TIME-VARYING IMAGE PROCESSING AND MOVING OBJECT RECOGNITION, 4
This Page Intentionally Left Blank
TIME-VARYING IMAGE PROCESSING AND MOVING OBJECT RECOGNITION, 4 Proceedings of the 5th InternationalWorkshop Florence, Italy, September 5-6, 1 996 Edited by V. CAPPELLINI Department of Electronic Engineering University of Florence Florence, Italy
1997 ELSEVIER Amsterdam
- Lausanne - New York- Oxford - Shannon - Tokyo
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211,1000 AE Amsterdam, The Netherlands
ISBN: 0 444 82307 7 91997 Elsevier Science B.V. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science B.V., Copyright & Permissions Department, P.O. Box, 521,1000 AM Amsterdam, The Netherlands. Special regulations for readers in the U.S.A. - This publication has been registered with the Copyright Clearance Center Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including photocopying outside of the U.S.A., should be referred to the copyright owner, Elsevier Science B.V., unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. pp. 69-76, 184-189, 190-196: Copyright not transferred. This book is printed on acid-free paper. Printed in The Netherlands.
PREFACE The area of Digital Image Processing is of high actual importance in terms of research and applications. Through the interaction and cooperation with the near areas of Pattern Recognition and Artificial Intelligence, the specific area of "Time-Varying Image Processing and Moving Object Recognition" has become of increasing interest. This new area is indeed contributing to impressive advances in several fields, such as communications, radar-sonar systems, remote sensing, biomedicine, moving vehicle tracking-recognition, traffic monitoring and control, automatic inspection and robotics. This book represents the Proceedings of the Fifth International Workshop on Time-Varying Image Processing and Moving Object Recognition, held in Florence, September 5-6, 1996. Extended papers reported here provide an authorative and permanent record of the scientific and technical lectures, presented by selected speakers from 10 nations. Some papers are more theoretical or of review nature, while others contain new implementations and applications. They are conveniently grouped into the following fields: m. B.
C. D. E. F. G. H. I.
Digital Processing Methods and Techniques Pattern Recognition Computer Vision Image Coding and Transmission Remote Sensing Data and Image Processing Digital Processing of Biomedical Images Motion Estimation Tracking and Recognition of Moving Objects Application to Cultural Heritage.
New digital image processing and recognition methods, implementation techniques and advanced applications (television, remote sensing, biomedicine, traffic, inspection, robotics, etc.) are presented. New approaches (i.e. digital filters, source coding, neural networks .... ) for solving 2-D and 3-D problems are described. Many papers are concentrated on the motion estimation and tracking-recognition of moving objects. The increasingly important field of Cultural Heritage is also covered. Overall the book presents - for the above outlined area - the state of the art (theory, implementation, applications) with the next-future trends. This work will be of interest not only to researchers, professors and students in university departments of engineering, communications, computers and automatic control, but also to engineers and managers of industries concerned with computer vision, manufacturing, automation, robotics and quality control. V. Cappellini
vi
WORKSHOP CHAIRMAN
V. CAPPELLINI, University of Florence, Florence, Italy
STEERING COMMITTEE
J.K. A GGARWAL, University of Texas, Austin, U.S.A. M. BELLANGER, Conservatoire National des Arts et Mgtiers, Paris, France J. BIEMOND, University of Delft, The Netherlands M. BRA CALE, University of Naples, Italy A.G. CONSTANTINIDES, Imperial College, London, England T.S. DURRANI, University of Strathclyde, Glasgow, Scotland G. GALATI, H University of Rome, Italy G.H. GRANLUND, University of Link6ping, Sweden T.S. HUANG, University of Illinois at Urbana-Champaign, U.S.A. G. IMMOVILLI, University of Modena, Italy M. KUNT, Ecole Polytechnique Fe'd~rale de Lausanne, Switzerland A.R. MEO, Polytechnic of Turin, Italy S.K. MITRA, University of California, Santa Barbara, U.S.A. F. ROCCA, Polytechnic of Milan, Italy A. ROVERI, University of Rome "La Sapienza", Italy G. L. SICURANZA, University of Trieste, Italy A.N. VENETSANOPOULOS, University of Toronto, Canada G. VERNAZZA, University of Cagliari, Italy
o~
VII
Sponsored by: European Association for Signal Processing (EURASIP) IEEE Central & South Italy Section European Association of Remote Sensing Laboratories (EARSeL) International Center for Signal & Image Processing (ICESP), Florence Centro d'Eccellenza Optronica (CEO) Dipartimento di Ingegneria Elettronica, University of Florence Istituto di Ricerca sulle Onde Elettromagnetiche (IROE) "Nello Carrara" - C.N.R., Florence Fondazione Ugo Bordoni Fondazione IBM ITALIA Fondazione per la Meteorologia Applicata Associazione Italiana di Telerilevamento (AIT) Gruppo Nazionale Telecomunicazioni e Teoria dell 'lnformazione (T. T.I.) - C.N.R. Sezione di Firenze dell'A.E.I. Associazione Italiana di Ingegneria Medica e Biologica (A.I.LM.B.) CESVIT- Agenzia per l' Alta Tecnologia Regione Toscana - Giunta Regionale
Co sponsored by: Alenia Spazio Alinari AXIS Esaote Biomedica Nuova Telespazio OTE SAGO S.M.A - S istemi per la Meteorologia e l'Ambiente Syremont Telecom Italia Telesoft Ente Cassa di Rispaxmio di Pistoia e Pescia
This Page Intentionally Left Blank
ix
CONTENTS
AO
DIGITAL PROCESSING METHODS AND TECHNIQUES
A.1
"On 3-D Space-time Interpolation and Data Compression of Digital Image Sequences Using Low-order 3-D IIR Filters" H.-L.M. CHENG and L.T. BRUTON
A.2
"Flicker Reduction in Old Film Sequences" (Invited) P.M.B. VAN ROOSMALEN, R.L. LAGENDIJK and J. BIEMOND
A.3
"Multichannel Filters in Television Image Processing"
19
K.N. PLATANIOTIS, S. VINAYAGAMOORTHY, D. ANDROUTSOS and A.N. VENETSANOPOULOS
BO
PATTERN RECOGNITION
25
B.1
"Blotch and Scratch Detection in Image Sequences based on Rank Ordered Differences" (Invited)
27
M.J. NADENAU and S.K. MITRA B.2
"Feature Matching by Optimization using Environmental Constraints"
36
A. BRANCA, E. STELLA, G. ATTOLICO and A. DISTANTE B.3
"@stem Identification for Fuzzy Controllers"
42
G. CASTELLANO, G. ATTOLICO, T. D'ORAZIO, E. STELLA and A. DISTANTE
Co
COMPUTER VISION
49
C.1
"Computer Vision for Autonomous Navigation: from Research to Applications"
51
G. GARIBOTTO, P. BASSINO, M. ILIC and S. MASCIANGELO C.2
"An Optimal Estimator of Camera Motion by a Non-Stationary Image Model" G. GIUNTA and U. MASCIA
57
C.3
"A Simple Cue-Based Camera Calibration Method for Digital Production of Moving Images"
63
Y. NAKAZAWA, T. KOMATSU and T. SAITO C.4
"Exploration of the Environment with Optical Sensors Mounted on a Mobile Robot"
69
P. WECKESSER, A. VON ESSEN, G. APPENZELLER and R. DILLMANN
DO
IMAGE CODING AND TRANSMISSION
77
D.1
"Time-Varying Image Processing for 3D Model-Based Video Coding" (Invited)
79
T.S. HUANG, R. LOPEZ and A. COLMENAREZ D.2
"A New Arbitrary Shape DCT for Object-Based Image Coding"
87
M. TANIMOTO and M. SATO D.3
"Picture Coding Using Splines"
93
M. BUSCEMI, R. FENU, D.D. GIUSTO and G. LIGGI D.4
"A 10 kb/s Video Coding Technique Based on Spatial Transformation"
99
S. BONIFACIO, S. MARSI and G.L. SICURANZA D.5
"Image Communications Projects in ACTS" (Invited)
105
F. BIGI D.6
"Conveying Multimedia Services within the MPEG-2 Transport Stream"
115
L. AZTORI, M. DI GREGORIO and D.D. GIUSTO D.7
"A Subband Video Transmission Coding System for ATM Network"
121
M. EYVAZKHANI D.8
"A High Efficiency Coding Method"
127
K. KAMIKURA, H. JOZAWA, H. WATANABE, H. KOTERA and K. SHIMAMURA D.9
"A Sequence Analysis System for Video Databases"
133
M. CECCARELLI, A. HANJALIC and R.L. LAGENDIJK D.10
"Subjective Image Quality Estimation in Subband Coding: Methodology and Human Visual System Application" Z. BOJKOVIC, A. SAMCOVIC and B. RELJIN
139
xi
EO
REMOTE SENSING DATA AND IMAGE PROCESSING
145
E.1
"Neural Networks for Multi-Temporal and Multi-Sensor Data Fusion in Land Cover Classification"
147
A. CHIUDERI E.2
"Influence of Quantization Errors on SST Computation Based on A VHRR Images"
153
P.F. PELLEGRINI, F. LEONCINO, E. PIAZZA and M. DI VAIA E.3
"Study of Ecological Condition Based upon the Remote Sensing Data and GIS"
159
M. ZHANG, J. BOGAERT and I. IMPENS E.4
"PEICRE PROJECT: a Practical Application of Remote Sensing Techniques for Environmental Recover and Preservation"
165
M. BENVENUTI, C. CONESE, C. DI CHIARA and A. DI VECCHIA E.5
"A Wavelet Classification Chain for Rain Pattern Tracking from Meteorological Radar Data"
171
P. GAMBA, A. MARAZZI and A. MECOCCI E.6
"Frequency Locked Loop System for Doppler Centroid Tracking and Automatized Raw Data Correction in Spotlight Real-Time SAR Processors"
176
F. IMPAGNATIELLO and A. TORRE E.7
"Use of Clutter Maps in the High Resolution Radar Surveillance of Airport Surface Movements"
184
G. GALATI, M. FERRI and M. NALDI E.8
"Simulation of Sequences of Radar Images for Airport Surveillance Applications"
190
F. MARTI, M. NALDI and E. PIAZZA E.9
"Data Fusion and Non Linear Processing of E.L.F. Signal for the Detection of Tethered Satellite System" 197 S. MONTEVERDE, R. RUGGERONE, D. TRAVERSO, S. DELLEPIANE and G. TACCONI
Fo F.1
DIGITAL PROCESSING OF BIOMEDICAL IMAGES
203
"A Simple Algorithm for Automatic Alignment of Ocular Fundus Images"
205
L. B ALLERINI, G. COPPINI, G. GIACOMELLI and G. VALLI
xii F.2
"Automatic Vertebrae Recognition throughout a Videofluoroscopic Sequence for Intervertebral Kinematics Study"
213
P. BIFULCO, M. CESARELLI, R. ALLEN, J. MUGGLETON and M. BRACALE F.3
"An Evaluation of the Auditory Cortex Response to Simple Non-Speech Stimuli through Functional MRI"
219
A. PEPINO, E. FORMISANO, F. DI SALLE, C. SAULINO and M. BRACALE
GO
MOTION ESTIMATION
225
G.1
"Temporal Prediction of Video Sequences Using a Region-Based Image Warping Technique" (Invited)
227
N. HERODOTOU and A.N. VENETSANOPOULOS G.2
"High Performance Gesture Recognition Using Probabilistic Neural Networks and Hidden Markov Models"
233
G. RIGOLL, A. KOSMALA and M. SCHUSTER G.3
"Image Segmentation Using Motion Estimation"
238
K. ILLGNER and F. MOLLER G.4
"A Phase Correlation Technique for Estimating Planar Rotations"
244
L. LUCCHESE, G.M. CORTELAZZO and M. RIZZATO G.5
"Tracking by Cooccurrence Matrix"
250
L. FAVALLI, P. GAMBA, A. MARAZZI and A. MECOCCI G.6
"Robust Pose Estimation by Marker Identification in Image Sequences"
256
L. ALPARONE, S. BARONTI, A. BARZANTI, A. CASINI, A. DEL BIMBO and F. LOTTI G.7
"Markov Random Field Image Motion Estimation Using Mean Field Theory"
262
A. CHIMIENTI, R. PICCO and M. VIVALDA G.8
"Moving Object Detection in Image Sequences Using Texture Features"
268
F. MOLLER, M. HOTTER and R. MESTER G.9
"Determining Velocity Vector Fields from Sequential Images Representing a Salt-Water Oscillator" A. NOMURA and H. MIIKE
274
xiii
HO
TRACKING AND RECOGNITION OF MOVING OBJECTS
281
H.1
"'Long-Memory' Matching of Interacting Complex Objects from Real Image Sequences"
283
A. TESEI, A. TESCHIONI, C.S. REGAZZONI and G. VERNAZZA H.2
"Spatial and Temporal Grouping for Obstacle Detection in a Sequence of Road Images"
289
S. DENASI and G. QUAGLIA H.3
"Attitude of a Vehicle Moving on a Structured Road"
295
A. GUIDUCCI and G. QUAGLIA H.4
"An Algorithm for Tracking Pedestrians at Road Crossing"
301
M. LORIA and A. MACHI
1.1
APPLICATION TO CULTURAL HERITAGE
307
"Cultural Heritage: The Example of the Consortium Alinari 2000-SOM"
309
A. DE POLO, E. SESTI and R. FERRARI 1.2
"Color Certification"
313
A. ABRARDO, V. CAPPELLINI, A. MECOCCI and A. PROSPERI 1.3
"Image Retrieval by Contents with Deformable User-Drawn Templates"
319
A. DEL BIMBO and P. PALA 1.4
"Synthesis of Virtual Views of Non-Lambertian Surface through Shading-Driven Interpolation and Stereo-Matched Contours"
325
F. PEDERSINI, A. SARTI and S. TUBARO
AUTHOR INDEX
331
This Page Intentionally Left Blank
A DIGITAL PROCESSING METHODS AND TECHNIQUES
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.)
9 1997 Elsevier Science B.V. All fights reserved.
On 3-D Space-time Interpolation and Data Compression of Digital Image Sequences Using Low-order 3-D IIR Filters H.-L. Margaret Cheng and Leonard T. Bruton Department of Electrical and Computer Engineering, The University of Calgary, Calgary, Alberta, CANADA A bstract--A method is proposed for the data compression and spatio-temporal interpolation of temporally sub-sampled digital image sequences using a first-order 3-D Linear Trajectory (LT) IIR filter.
1. I N T R O D U C T I O N Data compression of image sequences can be achieved by spatio-temporal sub-sampling. In this contribution, we propose a method for recovering a sequence of digital images from the temporally sub-sampled version using a low-order spatio-temporal 3-D IIR (infinite impulse response) filter to perform the required spatio-temporal interpolation. A firstorder 3-D Linear Trajectory (LT) IIR filter [1] is employed for this purpose, followed by a smoothing operation performed in the direction of the motion vector. Experimental results suggest that high compression ratios may be possible. We assume for simplicity that, in each spatio-temporal sub-image sequence, the 3-D spatio-temporal signal contains only one object moving with a constant velocity. This assumption is valid for many practical situations and is the underlying assumption of MPEG-2 and other compression methods.
2. R E V I E W
O F S P A T I O - T E M P O R A L S U B - S A M P L I N G OF I M A G E
SEQUENCES A 3-D LT signal pc(x, y, t), (x, y, t) E ~3 is defined as a continuous-domain space-time signal having a value that is everywhere constant in the direction of the motion vector ~ = (v~'x + vyey-4-vte.t), where ~x, C-u,C~tare the unit basis vectors in the spatial and temporal directions, respectively. The region of support (ROS) of the 3-D Fourier transform of a LT signal is the plane passing through the origin and perpendicular to ~, i.e. w~v~ + wuv u + cotvt 0. The 2-D spectrum on this plane represents the spatial frequency components of the intersection of the 3-D signal with the plane perpendicular to ~ [1]. We assume that this continuous-domain LT signal pc(z, y, t) is 3-D rectangularly sampled at a sufficiently high 3-D sampling frequency that aliasing is negligible. However, temporal sub-sampling of pc(x, y, t) by M introduces aliased replicated 3-D frequency planes (referred to as replica hereafter), at locations w~v~ + wyvy + wtvt = -t-2rrvtj/M, j C [1,..., M-l]. These replica must be completely eliminated by an ideal interpolator. To achieve close-to-ideal interpolation, we employ motion-compensated (MC) interpolation (lower part of Figure 1), where the orientation of the interpolator's passband =
,~, ,~/~ [:?i!~i~!~
(a)
co x
-
(b)
':::~....
:.........
li :: :': :: : :l~i i: :: :IV
(c)
Figure 1: Spectral representation of temporal (upper) and motion-compensated (lower) interpolation, shown for the 2-D case. Dashed lines show aliased replicated signal planes under temporal sub-sampling by M=2. Shaded regions represent passbands of interpolators. The problem of aliasing is shown in (a), and its solution (i.e. pre-filtering) is shown in (b). Interpolation of properly pre-filtered signals is shown in (c). Adapted from [2]. is adapted to that of the spectrum of the sub-sampled signal. In Figure 1 we review the advantage of using this method by comparing it with temporal (upper part of Figure 1) interpolation [2]. For ease of illustration, a 2-D signal that has been temporally sub-sampled by M = 2 is used. Its spectrum is shown in Figure l(a), where the solid line represents the original spectrum of the signal prior to sub-sampling, the dashed lines represent replica introduced by sub-sampling, and the shaded regions represent the passbands of the interpolators. Clearly, the temporal interpolator transmits the undesirable replica and, therefore, fails. To avoid such aliasing in the case of the temporal interpolator, the high-frequency components may be eliminated by separably pre-filtering the signal prior to sub-sampling (Figure l(b)). This seriously attenuates the 3-D planar spectrum of the signal, causing spatio-temporal blurring. However, MC interpolation ideally eliminates the replica and, therefore, does not require pre-filtering. In Figure l(c) we show the two interpolators operating on appropriately pre-filtered and sub-sampled sequences. Aliasing is avoided in both cases. However, because M C interpolation is performed in the direction of the motion vector ~, it does not attenuate the 3-D planar spectrum resolution of the signal and is, therefore, much more effective than simple temporal (or spatial) interpolation. A D E S I G N T E C H N I Q U E TO O B T A I N T H E 3-D LT I I R DISCRETE-DOMAIN FILTER FOR MC INTERPOLATION
3.
To achieve motion-compensated interpolation, we wish to design a stable 3-D IIR discretedomain LT filter having a 3-D passband that is approximately planar, where this passband closely surrounds the planar ROS of the 3-D LT signal. The design process commences with a suitable continuous-domain 3-D frequencyplanar filter [1] having a 3-D Laplace transform transfer function of the form [1]
T(s:,sy, st) = R/[R + s:L: + syLy + stL,l
(1)
The passband of T(s:, sy, st) closely surrounds a 3-D plane [1] passing through the origin and having a normal fi = +(n:~: + Ly~y + Lt~t)/l[LII2. The parameters R, L:, Ly, Lt
Figure 2: Resonant plane of the first-order 3-D LT IIR filter determine the orientation of the passband, and the "thickness" of the passband is determined by its 3-D bandwidth B3 = 2R/llLll~ (Figure 2) [1]. The proposed 3-D discrete-domain interpolating filter is obtained from the above continuous-time prototype by applying the triple s-to-z domain transform [3],
si =
1+
2
ai z i - 1 --,
O
zi nu ai
(2)
The case where az=ay=at=l corresponds to the triple bilinear transform (BLT). This discrete filter has the 3-D Z-transform transfer function [1] 1
1
1
1 --i
1
1
"
H(z~, zy, zt) = ~-~ E E aijkz~ zy' ztk/ E E ~_. bijkz~-izy-j zt k i=0 j = 0 k=0
(3)
i = 0 j = 0 k=0
where the coefficients aijk and bijk are real. The corresponding first-order 3-D recursive equation is [1]
qzyt -
b~176~~~-~akjipxy-'j=o itkj-~~~bkjiqzy-'itkjk=o ]=iO
i=o j=o k=O
(4)
i+j+kr where p~yt and q~yt are the respective discrete input and output sequences. Three advantages of this filter, relative to a non-recursive (i.e. FIR) 3-D filter, are evident. First, computational requirements are low: only 16 multiplies and 14 adds are needed to compute each output pixel. Second, memory requirements are small since as few as one frame store is needed for input or output. Third, the low order allows rapid adaptation to velocity changes. 3.1. C o n s e q u e n c e of the W a r p i n g Effect of the 3-D BLT Application of the B LT causes 3-D warping of the planar passband of the continuousdomain LT filter, such that the passband of the discrete-domain filter is given by Lx tan (-~) + Ly tan (_a~)+ Lt tan (-~) = 0, Jail <_ 7r, i = x,y,t. This 3-D warping can be shown to cause high frequency "speckles" to appear inside and outside of the interpolated object in the corresponding space-time dimensions. However, the transformation (2), with ai < 1, reduces the passband gain in the high frequency regions, where warping is most severe, and in the dimensions in which it is applied. This has two effects. First, signal components from replicated planes that pass
M-1 zero frames
p (x,y,t) temporally up-sampled input sequence
I temporal . I , . II intensity scaling q (x,y,t)l post-processor for temporal intensity scaling
r (x,y,t) spatio-temporally interpolated output sequence
Figure 3: 3-D space-time interpolation scheme through warped regions of the passband are further attenuated. Second, and more importantly, high frequency components of the baseband signal spectrum are reduced, thereby eliminating much of the texture and artifacts appearing both within and without the object. 4. T H E P R O P O S E D 3-D S P A T I O - T E M P O R A L MOTION-COMPENSATED INTERPOLATION
SCHEME
Assuming a highly temporally sub-sampled digital video sequence has been obtained by means of an appropriate frame sub-sampling strategy, we focus here on the problem of reconstructing an approximation to the original video sequence by means of 3-D spatiotemporal interpolation. That is, the interpolator operates on an image sequence that has been sub-sampled temporally by a factor M . A priori knowledge of the corresponding motion vector 0 is assumed (~ may be found using motion estimation techniques [4]). The proposed interpolation scheme consists of conventional temporal up-sampling of the image frames followed by 3-D spatio-temporal filtering to obtain the interpolated values in space-time. The proposed 3-D filtering is performed in two steps (Figure 3). 4.1. O b t a i n i n g the first-level a p p r o x i m a t i o n of the original image sequence We apply the 3-D discrete-domain LT filter to the temporally up-sampled 3-D signal in order to obtain a first-level approximation to the original signal by recovering the missing frames. By orienting the passband of the filter such that 6 = ~, we achieve lowpass filtering in the space-time direction corresponding to ~ . So, the main signal plane in the baseband is retained while the replica introduced by sub-sampling are attenuated. However, due to the low order of the filter, the replica are not sufficiently attenuated. As a result, the intensity of the interpolated output sequence sustains a ripple whose period equals M and whose rate of decay depends on both B3 and 6 (see Figure 4(a)). 4.2. E l i m i n a t i n g intensity variations due to t e m p o r a l ripple A second stage is employed to smooth out the temporal ripple. Either one of two proposed methods may be utilized. The first method involves using an oriented 1-D FIR filter that performs a movingaverage operation in the direction of 0. The order of the filter must equal M to ensure that intensity fluctuations are almost completely eliminated. Here, the difference equation of the 1-D FIR filter is
1M~lq(x -- i(~),y -- i(~t),t -- i) r(x,
,--o
(5)
When the pixel locations x - i(v~:/vt) and y - i ( v y / v t ) are non-integers, the nearest pixel is used. Hence, non-linearities arise. If, however, the quantities v~/vt and vy/vt are integers, we may take the Fourier transform to obtain the 3-D frequency response ___
S (e j(a'~+au+at)) --
M-,~o
~
sin( M--(~O.-~+ ~-flY + fit) 2 \ vt "'~ vt + ~ f l y + f~t)
e-J( 2 ,,, ..x+~ay+a,) x sin( 89
(6)
This is a good approximation even in the case of non-integer pixel locations. In the 3-D Fourier domain, S (e j(a'+a~'+a')) is a 3-0 sinc-like function in the direction of 0 and has equi-gain planes perpendicular to 0. For integer pixel locations, the 3-D planes where the gain is zero correspond exactly to the locations of the temporally replicated signal planes. This is a very effective 3-D lowpass filter for removing the replica introduced by temporal sub-sampling. The second method is to scale each frame of the output of the LT filter by a predetermined corrective intensity-scaling factor. Since intensity variations are only a function of the sub-sampling factor M, the bandwidth B3 of the LT filter, and its orientation fi, we can pre-determine the temporal intensity ripple fluctuations and thereby obtain the normalizing scale factor required for each frame. Although this method does not further attenuate the replicated planes, it is computationally efficient when compared with the first method. k
F
5. E X A M P L E
We present an example to demonstrate the capability of the proposed system to interpolate an up-sampled sequence involving an object that moves at constant velocity and also compare the results obtained using the two methods for removing temporal intensity variations. We use a 3-D digital image sequence in which a 40x40 pixels square, of value 100, moves at a velocity of (0.75, 0.5) pixels per frame. The frame has dimensions 256x256 pixels. A temporal sub-sampling factor M = 20 is used, implying a data reduction of 20. The other parameters are/33 = 0.04 and a factor of ai = 0.9 for i = x, y, t. In Figure 4(a) we show the average intensity of the interpolated square object as a function of image frames. Evidently, both methods remove the temporal intensity fluctuation present at the output of the LT filter, shown by the dotted line. However, differences occur in the spatial characteristics of the output, as seen in Figure 4(c) (d), where we show frame 50 of the interpolated output. Comparison of the FIR and scaling methods shows that the former removes aliased textural artifacts (ripples) due to sub-sampling that the scaling method does not eliminate. However, the difference between the two results is small, and for computational efficiency, scaling is preferred to FIR-filtering. For comparison, Figure 4(b) shows output frame 50 when the B LT is used in conjunction with intensity scaling. Artifacts are evident both within and without the square. However, these artifacts can be mostly removed by using ai < 1. 6. C O N C L U S I O N A 3-D space-time interpolation system is proposed that uses information about the motion of an object to recover the missing frames in a temporally sub-sampled digital image sequence. The first-order 3-D LT IIR filter [1] is proposed for performing MC interpolation.
Readjustment of the output intensity is performed by further filtering or by scaling. It is shown that the system works well for interpolating objects moving at constant velocity and, though not shown here, those undergoing sudden velocity changes. This method is effective for data reductions up to 20, implying the potential for compression ratios much larger than is achieved by the MPEG-2 method.
Figure 4: (a) Comparison of average intensities of output image. (b) Frame 50 of output of LT filter obtained by using BLT. Frame 50 of post-processed output obtained by (c) FIR filtering and (d) intensity scaling for M=20, ai=0.9, i = x, y, t. REFERENCES
[1] L. T. Bruton and N. R. Bartley. The Enhancement and Tracking of Moving Objects in Digital Images Using Adaptive Three-Dimensional Recursive Filters. I E E E Transactions on Circuits and Systems, CAS-33(6), June 1986. [2] M. Ibrahim Sezan and Reginald L. Lagendijk. Motion Analysis and Image Sequence Processing. Kluwer Academic Publishers, Norwell, Massachusetts, USA, 1993. [3] P. Agathoklis, L. T. Bruton, and N. R. Bartley. The Elimination of Spikes in the Magnitude Frequency Response of 2-D Discrete Filters by Increasing the Stability Margin. I E E E Transactions on Circuits and Systems, CAS-32(5):451-458, May 1985. [4] Didier Le Gall. MPEG: a Video Compression Standard for Multimedia Applications. Communications of the ACM, 34(4), April 1991.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
Flicker Reduction in Old F i l m Sequences* P.M.B. van Roosmalen, R.L. Lagendijk and J. Biemond Delft University of Technology, Department of Electrical Engineering, Information Theory Group,, P.O. Box 5031, NL-2600 GA Delft, The Netherlands.
Image flicker, undesirable fluctuations in image intensity not originating from the original scene, is a common artifact in old film sequences. After describing possible causes for image flicker this paper models the effects of flicker as local phenomena. Unfortunately estimation of the model parameters from the degraded sequence is hampered due to presence of noise, dirt and motion. In the latter cases the model parameters can not be estimated directly from local data and are interpolated using the found model parameters of regions nearby. Once the model parameters have been estimated the film sequence can be corrected, taking care that no blocking artifacts occur. The application of this technique in combination with other restoration techniques is discussed.
1. INTRODUCTION Unique records of historic, artistic and cultural developments of every aspect of the 2 0 th century are stored in huge stocks of moving picture archive material. Many of these historically significant items are in a fragile state and are in desperate need of restoration. However, the high cost and lengthy processing time required to restore archive material limit the preservation of these records on a large scale. The aim of the AURORA project (AUtomated Restoration of ORiginal film and video Archives) is the development of technology that significantly reduces the cost and processing time of the restoration processes. Areas of interest within AURORA include Noise Reduction [1], Blotch Detection and Removal [2], Scratch Removal [3], Film Unsteadiness Correction [4], Flicker Reduction, Line Registration Correction [5] and Color Correction. There are several reasons why the artifacts covered by these areas are to be addressed.
* This work is funded by the European Community under the ACTS contract AC072 (AURORA).
10 The first being the explosive growth in number of broadcasters for television, in the near future the home viewer will be able to choose from a hundred or more channels and all of them require programming. The costs for creating new, high quality programs are tremendous. Recycling old programs form a good alternative, if the image (and audio) quality expectations of the modem viewer are met. The second reason for image restoration is that preservation implies storage. The presence of artifacts, and noise in particular, causes compression algorithms to dedicate many bits to irrelevant information. After processing, image sequences of higher quality can be stored using less bits. In this paper we concentrate on the reduction of flicker artifacts. Image flicker is a common artifact in old film sequences. It is defined as unnatural temporal fluctuations in perceived image intensity (globally or locally) not originating from the original scene. Image flicker can have a great number of causes, e.g. aging of film, dust, chemical processing, copying and aliasing (e.g. when transferring film to VCR using a twin lens telecine). To our knowledge very little research has been done on this topic. Neither equalizing the intensity histograms nor equalizing the mean frame values of consecutive frames, as suggested in [6], form general solutions to the problem. These methods do not take changes of scene contents into account and they do not appreciate the fact that flicker can be a spatially localized effect.
2. A MODEL FOR IMAGE FLICKER Due to the lack of detailed knowledge on how the various mechanisms mentioned above cause image flicker, it is difficult to come to models for image flicker based on these mechanisms. Even if these models are known there still is the problem of selecting one of those models for correcting the film sequence. Often only the degraded sequence is available, it is not known what mechanism caused the image flicker. What can be said about flicker is that in any case it causes unnatural changes in image intensity (locally and/or globally) in time. Our approach models image flicker as a local effect independent of the scene contents. We want to limit fluctuations in image intensity in time by locally preserving the intensity mean and the intensity variance. The following model is assumed:
r(x, y, t) = ot(t)(I(x, y, t)+ ~[(x, y, t))+ [J(t) + rl(x, y, t)
~ct(t) = constant for [ [J(t) = constant
x, y e f ~
(I)
where Y(x,y,t) and l(x,y,t) indicate the observed and real image intensities respectively, ix(t) and 13(t) are flicker gain and offset parameters and f~ indicates a small image region and makes that the flicker is modeled as a local effect. In the ideal case (no fading, no flicker) ct(t) = 1 and 13(0 =0. Both flicker dependent noise ~(x,y,t) and flicker independent noise rl(x,y,t) add to the overall amount of noise, which can be estimated, for example, as in [7]. An example of flicker dependent noise is granular noise already on the film before flicker is introduced. Flicker independent noise can be thermal noise due to electronic processing.
11
3. E S T I M A T I O N OF F L I C K E R P A R A M E T E R S Flicker correction requires estimation of the flicker parameters o~(0 and [3(t). The estimates resulting from the initial approach (section 3.1) are optimal for stationary scenes. The estimation of image statistics in non-stationary scenes are usually influenced by the presence of motion. To avoid this one would like to apply some form of motion compensation. Unfortunately the presence of flicker hampers motion estimation as motion estimators usually have a constant luminance constraint, i.e. pel-recursive methods and all motion estimators that make use of block matching in one stage or another. For this reason we choose to merely detect the presence of motion (section 3.2). For regions in which motion was detected the flicker parameters are then interpolated using the flicker parameters of nearby regions not containing motion (section 3.3).
3.1. Flicker parameter estimation in the motion free case For the moment a stationary scene is assumed, let I(x,y,t) = l(x,y). It is also assumed that the distribution of y(x,y,t) does not change in time. This is acceptable under the assumption that the physical quality of the fdrn is constant and, as mentioned before, the scene is stationary. Taking the expected value and the variance of Y(x,y,t) in (I), in a spatial sense, gives for
x,y~
:
E(Y(x, y,t)) = o~(t)E(I(x, y) + y (x, y,t))+ ~(t) + E(rl(x, y,t))
(II)
var(Y(x, y,t)) = var(o~(t)(I(x, y) + y (x, y,t))+ [3(t) + rl(x, y,t)) = Ot2 (t)var(I(x, y)+ y (x, y,t)+ rl(x, y,t))+ (1-Or 2 (t))var(rl(x, y,t))
(III)
When assuming zero mean noise, rewriting these equations give o~(t) and I](t) for x, y e ~ : (IV)
[3(t) = E(Y(x, y,t))-o~(t)E(l(x, y))
Jvar(Y(x, O~(t) = ~
Y,'))
var(l(x,y)+~[(x,y,t)+rl(x,y,t))
(v)
Following [8] it can be shown that these estimates for o~(0 and 13(0 are optimal in the sense that they result in the linear minimal mean squared error between real image intensity and the estimated image intensity. If the variance of the flicker-independent noise is small compared to variance of the observed signal and/or o~(t)= 1, (V) can be approximated by:
I
var(Y(x,y,t))
o~(t) --- var(I(x,y)+y(x,y,t)+rl(x,y,t))
(VI)
12 In order to solve (IV) and (VI) in a practical situation estimates in a temporal sense of expected means and variances at frame t can be used:
E(I(x'Y))t= ET(E(I(x'Y)))= ET(E(Y(x'y't))-ct(t) ~(t)) 1
(VII)
t-1 ~,
E(Y(x,y,n))-fS(n) N - 1n=t-N o~(n) var(I(x, y) +'t (x, y,t) +rl(x, y,t)) = ET (var(I(x, y) +'t (x, y,t) +rl(x, y, t))) = ET
(
var(Y(x,y,t))~ 1 )=N-1 ~2(t')
t-1
var(Y(x, y, n))
~" n=t-N
O~2 (n)
To reduce memory requirements and computational load, first order of (VII) and (VIII) in a practical situation:
(VIII)
IIR filters are used instead
E(l(x, Y))t = rd~(I(x,Y))t-2 + ( l - K ) E(Y(x, y,t-
1))- ~ ( t - 1) a ( t - 1)
(ix)
var(l(x, y)+T(x, y,t)+Tl(X, y,t))=K var(l(x, y)+T(x, y,t-2)+rl(x, y,t-2))+ (1 -K) var(Y(x, y, t - 1)) a2(t-1)
(X)
where K signifies the importance of the previous estimate. Depending on the value for K this method allows the estimates of the original image mean and variance to be adapted to changes in scene lighting (e.g. during a fade or when a light is switched on). Low frequency image flicker is not removed in that case.
3.2. Motion detection in image sequences containing flicker A number of motion detection mechanisms that can be applied to image sequences containing image flicker are described in this section. As these mechanisms rely on detecting changes in image statistics not only motion but also dirt, drop outs and scene changes trigger the motion detectors. Where motion is detected the recursive filters for estimating the mean and variance have to be reset. 3.2.1. Motion detection using the flicker parameters Motion causes local changes in temporal statistics: significant changes in intensity variance and/or mean result in a large deviations from 1.0 for o~(t) and/or from 0 for fS(t)/o~(t), respectively. Regions containing motion can be detected by comparing all ct(t) and [5(t)/ct(Oto threshold values l__.Ta and _+T~. Motion is flagged when either flicker parameter surpasses its threshold value (typical values for Ta and T~ are 0.3 and 20 respectively).
13
3.2.2. Motion detection using frame differences A different method for detecting the presence of motion is the following. For each block in the current frame ~(t) and 13(t) are estimated using (IV) and (VI). The corrected frame is generated using (XI) (see section 4). In the absence of motion the variance of local frame differences between the corrected frame and the previous corrected frame should be twice the total noise variance. Where this is not the case motion is detected.
3.2.3. A hybrid motion detection system The method in section 3.2.2. has the disadvantage that it is very sensitive to film unsteadiness. Slight movements of textured areas may lead to large frame differences and thus to "false" detection of motion. The method in section 3.2.1 is robust against film unsteadiness. The drawback in comparing the flicker parameters ~(t) and ~(t)/et(t) to threshold values is that it is difficult to find good threshold values: false alarms and misses will always occur. Combining the two methods leads to a robust algorithm. First, the motion detection algorithm from section 3.2.1. is applied where T~ and TI~are chosen relatively small leading to relatively many false alarms and few misses. Second, the algorithm from section 3.2.2 is applied to those regions for which motion was detected: the correctness of the found flicker parameters is verified.
3.3. Interpolation of unreliable flicker parameters Where motion is detected the flicker parameters o~(t) and [5(t) computed according to (IV) and (VI) are unreliable. They are to be interpolated using the flicker parameters found in nearby regions. This approach leans on the assumption that the flicker parameters vary slowly (are correlated) in a spatial sense, and, as stated before, are independent of image contents. One pitfall is to be avoided. For uniform regions corrupted by image flicker it is difficult to tell what part of the image flicker is due to variances in gain and what part is due to variances in offset. These regions should not be included in the interpolation process. Moreover, from section 4 it will become clear that that the estimated flicker parameters for these regions should be marked unreliable. In the case of the restoration of old film sequences no problems are to be expected as granular noise is always present (we implicitly assume that granular noise is affected by flicker in a similar manner as the original scene intensities). The iterative interpolation process is as follows. Consider the matrix containing the values of all ~(t) for a certain image. Figure l a shows an example of such a matrix. The gray area indicates the image blocks for which o~(t) are known, the white area indicates the image blocks in which motion was detected. For blocks in the latter region the values ~(0 can be estimated at the boundary of the two regions, by taking the average value of the ~(t) in adjacent blocks in the still region (fig. lb). By repeating this dilation process an estimate for ~(t) can be assigned to each image block in regions where motion was detected (fig. l c,d). The procedure for estimating the unknown 13(t) is similar.
14
Figure 1. (a) gray indicates known parameter values, white indicates the unknown values. (b), (c) and (d) indicate what parameters have been estimated after 1, 2 and 3 steps of the dilation operation.
This method is not optimal in the sense that jumps might occur between the values for or(t) and I](t) in adjacent image blocks near the center of the dilated region (e.g. when the values in the top-left hand side of the still region are very different from the values in the bottom right hand side). This can be resolved by smoothing the found results using, for instance, a Laplacian kernel (see section 4). As the region containing motion becomes larger, more steps are required for the dilation process. This implies more uncertainty about the correctness of the interpolated values. Applying biases towards unity for o~(t) and to zero for 13(t) that grow with each step reduces the probability that flicker is enhanced due to incorrect estimation of the flicker parameters.
4. C O R R E C T I N G
IMAGE FLICKER
Once the flicker parameters have been estimated the sequence can be corrected. But first an extra step is required. As the flicker parameters are computed on a block by block basis, blocking artifacts will be introduced if the found flicker parameters are applied for correction without preprocessing. This preprocessing consists of upsampling the matrices containing the flicker parameters to full image resolution followed by smoothing using a low-pass filter. As mentioned before, when sources other than film are used the contribution to changes in gain and offset to the flicker can not be determined for uniform regions using (IV) and (VI). It is necessary that the flicker parameters in the uniform regions are estimated using the interpolation scheme in section 3.3. If not, smoothing would have the unreliable flicker parameters of these regions influence the reliable flicker parameters of neighboring regions. Now the new flicker free image can be estimated according to: i(x, y, t) = Y(x, y, t ) - ~(x, y, t) ~(x,y,t)
(XI)
15 5. E X P E R I M E N T S
AND RESULTS
Figure 2. Clips of original and corrected frames. In our experiments we used a test sequence of 50 frames containing image flicker and motion (introduced by a man entering the scene through a tunnel). When viewing this sequence it can clearly be seen that the amount of flicker varies locally. Also the presence of granular noise is
16
MEAN: 68.6 STD. DEV.: 2.5
80.0
1600.0
75.0
tu
o z <
uJ
MEAN: 1397.9 STD. DEV: 77.5
1500.0
rr
<
70.0
~
65.0
Z uJ k_Z 1300.0
_z
60.0 0.0
10.0
20.0
FRAME
30.0
40.0
1400.0
1200.0 0.0
50.0
10.0
20.0
(a)
1600.0
75.0
tu
z <
tu
or)
40.0
50.0
(b)
MEAN: 67.2 STD. DEV.: 1.0
80.0
30.0 FRAME
MEAN: 1353.7 STD. DEV.: 52.0
1500.0
E 70.0
~
14oo.o
65.0
_z
130o.o
z ILl I-Z
60.0
0.0
10.0
20.0
30.0 FRAME
(c)
40.0
50.0
1200.0 0.0
10.0
20.0
30.0
40.0
50.0
FRAME
(d)
Figure 3. (a), (b) Mean frame intensities and variances of original sequence. (c), (d) Mean frame intensities and variances of corrected sequence. clearly visible. The signal to noise ratio was estimated to be 21 dB. Equalizing the mean field intensities did not lead to a reduction in image flicker. Figure 2 shows clips of frames 13 and 15, which contain excessive amounts of flicker, before and after correction. Figure 3 shows the field means and variances of the original and the processed sequence. The smoother curves resulting from the processed sequence in figure 3 imply that the amount of image flicker has been reduced. Subjective evaluation confirms this. A (very) small amount of low frequency flicker remained, which can be explained by keeping the last paragraph of section 3.1 in mind. No blocking artifacts are visible and no blurring occurred. No new artifacts were visible.
17
6. DISCUSSION IN
IMAGE STABILIZATION
2D NOISE REDUCTION
v
COMPUTE
FLICKER CORRECTION
OUT Figure 4. Flicker correction as part of an automatic image restoration system
In practical situations the proposed scheme for flicker correction will be applied in combination with other restoration techniques as in many old films combinations of various artifacts are present simultaneously. Two common types of artifacts are noise and image unsteadiness. An example of the place of flicker correction in an automatic restoration system is shown in figure 4. Here the flicker parameters a(t) and 13(t) are estimated from a noise reduced, stabilized sequence. The simultaneous image flicker correction and image stabilization is applied to the original sequence. The output of this system forms the input for subsequent stages of the restoration system where noise, dirt and dropouts are removed making use of motion estimation and motion compensation. The flicker correction scheme can easily be extended to include camera panning, as the panning vectors can be estimated from the image stabilization vectors. Including camera zoom is more troublesome. A major problem is that the characteristics of observed texture changes depending on distance to the camera and on camera parameters such as aperture and focal point. It is difficult to adjust for these. Including scene rotation (perpendicular to the camera) is possible. The first frame of a sequence is chosen as a reference, later frames are compensated for their rotation with respect to the reference frame. Flicker can then be corrected for and the result is rotated back again. Note that aliasing caused by correction for rotation may well influence the results. As the rotation angle becomes larger less of the frames corrected for rotation overlaps with the reference frame. It is then necessary to pick a new reference frame. This can be the current frame, with the disadvantage that the overall brightness of this frame may be noticeably different from the overall brightness of the corrected preceding frame. Another possibility is to choose the corrected preceding frame as a reference (in doing so the loop is closed and the system might become unstable). Fortunately only old film sequences seldom contain zoom and rotation.
18
REFERENCES P.M.B. van Roosmalen, S.J.P. Westen, R.L. Lagendijk and J. Biemond, "Noise Reduction for Image Sequences using an Oriented Pyramid Thresholding Technique", Proceedings of ICIP-96, Vol I, pp. 375-378, Lausanne Switzerland, IEEE 1996. [21 A.C. Kokaram, R.D. Morris, W.J. Fitzgerald and P.J.W. Rayner, "Interpolation of Missing Data in Image Image Sequences", IEEE Transactions on Image Processing, Vol.4 no. 11, pp. 1496-1508, 1995. R.D. Morris, W.J. Fitzgerald and A.C. Kokaram, "A Sampling Based Approach to Line [31 Scratch Removal from Motion Picture Frames", Proceedings of ICIP-96, vol. I, pp. 801-804, Lausanne Switzerland, IEEE 1996. [4] T. Vlachos and G. Thomas, "Motion Estimation for the Correction of Twin-Lens Telecine Flicker", Proceedings of ICIP-96, vol. I, pp. 109-111, Lausanne Switzerland, IEEE 1996. [51 A.C. Kokaram, P.M.B. van Roosmalen, P.J.W. Rayner and J. Biemond, "Line Registration of Jittered Video", submitted to ICASSP 97, Munich 1997. [6] P. Richardson and D. Suter, "Restoration of Historic Film for Digital Compression: A Case Study", Proceedings of ICIP-95, Vol II, pp. 49-52, Washington D.C. USA, IEEE 1995. [7] J.B. Martens, "Adaptive Contrast Enhancement through Residue-Image Processing", Signal Processing 44, pp. 1-18, 1995. [8] K.S. Shanmugan and A.M. Breipohl, "Random Signals", pp. 529-534, J. Wiley & Sons, New York, 1988. [1]
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
19
M u l t i c h a n n e l F i l t e r s in T e l e v i s i o n I m a g e P r o c e s s i n g K.N. Plataniotis, S. Vinayagamoorthy, D. Androutsos, A.N. Venetsanopoulos a aDigital Signal & Image Processing Laboratory, Department of Electrical and Computer Engineering, University of Toronto, M5S 3G4, Toronto, CANADA, E-mail: [email protected] URL: http://www.comm.toronto.edu/dsp/dsp.html A novel multichannel filtering approach is introduced in this paper. The new filter, which is perfectly suitable for real time implementation, can be used to remove impulsive noise and other impairment from color TV signals. The principles behind the new filter are explained in detail. Simulation results indicate that the new filter offers some flexibility and has excellent performance. Due to its inherent parallel structure and high regularity, the new filter can be implemented using array processors on VLSI hardware. With the advent of the all-digital TV system, such filters can lead to systems which would retain accurate image reproduction fidelity despite possible transmission problems. 1. I N T R O D U C T I O N Image filtering refers to the process of noise reduction in an image. As such, it utilizes the spatial properties of the image and is characterized by memory. Filtering is an important part of any image processing system whether the final image is utilized for visual interpretation or for automatic analysis [1]. Filtering of multichannel images has recently received increased attention due to its importance in the processing of color images. It is widely accepted that color conveys information about the objects in a scene and that this information can be used to further refine the performance of an imaging system. Thus, the generation of high quality color images is of great interest. Noise in an image may result from sensor malfunction, electronic interference, or flaws in the data transmission procedure. In considering the signal-to-noise ratio over practical mediums, such as microwave or satellite links, there would be a degradation in quality due to the weak received signal. Degradation of the broadcasting quality can be also a result of processing techniques, such as aperture correction which amplifies both high frequency signals and noise. The appearance of the noise and its effect is related to its characteristics. Noise signals introduced during the transmission process are random in nature resulting in abrupt local changes in the image data. These noise signals cannot be adequately described in terms of the commonly used Gaussian noise models [1]. Rather, they can be characterized as 'impulsive' sequences which occur in the form of short time duration, high energy spikes attaining large am-
20 plitudes with probability higher than the probability predicted by a Gaussian density model. There are various sources that can generate impulsive noise, such as, man made phenomena, such as car ignition systems, industrial machines in the vicinity of the receiver, switching transients in power lines and various unprotected electric switches. In addition, natural phenomena, such as lightning in the atmosphere and ice cracking in the antarctic region, also generate impulsive noise. Impulsive noise is frequently encountered during the transmission of TV signals through UHF, VHF, terestial microwave links and FM satellite links. It is therefore important to develop a digital signal processing technique that can remove such image impairment in real-time and thus, guarantee the quality of service delivered to the consumers. Such a system is proposed here. A new two-stage multidimensional color filter is developed. The color filter is applied on-line on the digitized image frames in order to remove image noise. A number of digital techniques have been applied to the problem aiming to smooth out impulsive noise and restore TV images. In [2], [3] a multi-shell median filter has been introduced. The approach introduced in [3] is applicable only to gray-scale images. Since the TV signal is a color signal, such an approach can be applied only to the luminance component of the transmitted signal without any reference or association to the corresponding chrominance signals. However, there is some indication that noise correlation among the different image channels exists in real color images. Particularly, in the case of NTSC television broadcast signal, if there is any degradation of the chrominance signal that is broadcast, both the I and Q components would be affected simultaneously [4]. Therefore, noise removal operations on only one channel are not adequate and a multichannel filter is necessary to remove the noise and restore the originally transmitted signal. 2. A M U L T I C H A N N E L F I L T E R F O R I M P U L S I V E N O I S E R E D U C T I O N Impulsive noise can be classified as a short duration high energy spike, which results in the alteration of the digital value of the image pixel. After the effect of the noise, the altered value of the image pixel usually differs from the corresponding values of the neighboring pixels. However, in TV signals, any kinds of scenes, pictures or images are transmitted. Thus, it is important for the filter to differentiate between impulsive noise and other image features, such as intended dots or thin lines in the image, which may resemble this kind of noise. For the removal of impulsive noise the class of median filters is considered the most appropriate[l]. However, repeated applications of a median filter in a filtering window centered around a pixel of the image will probably remove the noise but will also reduce the resolution of the image by filtering out thin lines and details. Similarly, using a larger size of filtering window (e.g., 5 x 5 instead of 3 x 3) might result in better noise removal, but will blur the fine details of the image. Thus, to filter out noise and preserve image details a different approach is necessary. A two stage adaptive median filter is introduced. As with any other nonlinear filter, a working area (window or template)is centered around an image pixel [1], [5]. To prevent thin lines and intended spots in the image from being altered through the nonlinear filter-
21 ing process, we applied directional median filters inside the processing window. In other words, instead of a combined median filter applied to the whole window, four different median filters are applied across the four main directions at O ~ , 45 ~ , 90 ~ , 135 ~ (see Fig. 1). Be aware that the pixel at the window center (pixel under consideration) belongs to all four sets. If the pixel under consideration has considerably larger or smaller values than those of the other pixels along a specific direction it will be treated as an outlier and it will be replaced by the median value across this specific direction. Otherwise the value remains unchanged during this operation. Thus, by employing filtering across the main directions, lines and other fine details will be preserved. In a second stage, another median operates on the four filtered results to generate the final output. This directional vector processing median can be considered as an extension of the different multistage medians [6]-[8] to vector processing. The mathematical description of the filter can be summarized as follows: Let y(x): Z l --~ Z TM, represent a multichannel signal and let W E Z l be a window of finite size n • n (square window with filter length n2), where n is generally an odd number. The pixel under consideration xi,j is at the window center. The noisy vectors (n 2 in total) inside the window W are noted as: xi+k,j+t
k
l = O +1 +2 ... '
'
~
'
+(n-l) '
(1)
2
The median filter applied along the 0 ~ direction operates on the horizontal pixels, across and including the center pixel xi,j, noted as (see Fig. 1): xi,j+ l
1 - O, •
+ 2 , - - . , + (n - 1)
2
(2)
For simplification and clarity, let these vectors be h i . . . hn (h stands for horizontal direction). Now, according to standard vector median operation, a scalar distance dp can be defined for vector hp, p = 1 , . . . , n, as: dp = F_,qn~ Ilhp - hqllL ~ where Ilhp - hq[IL ~ is the L1 n o r m or the city block d i s t a n c e between the vectors hp and hq. An ordering of the dp's as d(1) < d(2) _< ... ___ d(n) implies the same ordering to the corresponding hp's: h(1) _< h(2) _< .-- _< h(n), where, h(p) is the pth order statistics [1]. The vector median Yl along the 0 ~ direction is defined as: Yl = h(~). Similarly, the process is repeated for the other three directions. The vectors fp, p = 1 , . . . , n ( f stands for 45 ~ direction) representing those pixels along the 45 ~ direction are (see Fig. 1): xi-k,j+k
k - 0, +1, + 2 , . . . , + (n - 1) 2
(3)
The vector median y2 along the 45 ~ direction is then defined as: Y2 - f(1). For the 90 ~ direction, the corresponding vectors vp, p - 1 , . . . , n (v stands for vertical, i.e. 90 ~ direction) are (see Fig. 1): xi-k,j
k - O, +1, •
• (n - 1) 2
(4)
22 The vector median Y3 along the 90 ~ direction is given as: Y3 - v(1). Finally, the vectors rp, p - 1 , . . . , n (r stands for reverse 45 ~ i.e. 135 ~ direction) representing those pixels along the 135 ~ direction are (see Fig. 1): k = 0, +1 • '
+ (n - 1) '
(5)
2
The vector median Y4 along the 135 ~ direction is thus defined as: y4 = r(1). In the second stage, a vector median filter is applied to the four vector median outputs Yl, y2, y3 and y4 obtained in the directional filtering of the previous stage. Hence, the final output X D V M F of this Directional Vector Median Filter (DVMF) is derived as: XDVMF
=
Y(1)
(6)
where, y(1) is the first order statistic of the ordered sequence of vectors yp, p = 1 , . . . , 4. This new Directional Vector Median Filter (DVMF) is applied to different color images, namely Lenna, Pepper and Lake to assess qualitatively the performance. First, the original images are corrupted with 4% impulsive noise and 50% noise correlation between the Red, Green, and Blue channels using an appropriate noise generator [1]. Then, the DVMF, with a window size of 3 • 3 (small window), is applied to the corrupted images and the filtered output images are displayed and compared visually with the original images. In all three images, no impulsive noise is visible. In addition, all the edge information, thin lines and fine details, are well preserved. 3. R E M O V A L O F M I S S I N G L I N E S An additional motivation to introduce directional filtering, is the problem of missing lines in TV signals in addition to impulsive noise which has been observed to be a common problem in TV signals transmitted over satellite or microwave links. Usually, the signal along a horizontal line of one pixel width will be lost and appears as either a white or black line along the image. In other words, these lines appear like continuous impulsive noise along the horizontal direction. Normally, on a single frame one or more such lines can appear. Since, such lines are horizontal, and most of the time have a width of one pixel, the horizontal direction filtering within the filtering window W is not considered in such cases. Therefore, the DVMF for images having missing line is re-defined as: XDVMF
~-- Y(1),
(7)
where, f(1), v(1) and r(1) are the first order statistics of the ordered vector sequences fp, vv and rp respectively as before, and Y(1) is the first order statistics of the ordered sequence of vectors yp, p = 1 , . . . , 3. Two types of simulations are made to assess the performance of this modified DVMF for removing missing lines. When some missing lines are inserted at random positions in the original image the proposed directional multichannel filter was able to perfectly remove the missing lines and nothing atypical could be visually detected on those locations where those lines were introduced. In order to examine the robustness of the proposed filter, an extreme case was investigated in another simulation experiment by adding 4% impulsive noise to the random missing line. Again, the filter performed well in removing
23 the impulsive noise as well as the missing lines. However, at some positions where missing lines existed, the filter failed to remove the noise completely and few pixels with impulsive noise are visible. This could be attributed to the fact that both missing lines and noisy pixels are contained within the filter window at those locations, and thus more than 50% of the pixel values are outliers. Since the break-down point c* of the median filter is 0.5 [1], the directional median filter failed to remove the noise when the filter window contained more than 50% outliers. Nevertheless, the results seem to be fairly acceptable for viewing (Figs. 3-4). If the impulsive noise percentage is approximately to 2%, an actual figure for most real systems, then the filter performance improved and almost no noise could be detected visually. The proposed methodology can be applied on-line for any of the existing TV systems. Since it is a digital image processing technique, analog-to-digital (A/D) converters are necessary to transform the in-coming analog TV signal to its digital form. After that, a real-time digital signal processor board can be designed to implement the method. Due to its inherent parallel structure and high regularity, the filter has regular computational structure, and can be implemented using array processors on VLSI hardware. Alternatively, a network of dedicated multiple microprocessors can be devised for its implementation. 4. C O N C L U S I O N S A new adaptive filter was introduced in this paper. The new filter perfectly suitable for real time implementation was used to remove impulsive noise and other impairment from color TV signals. Experimental results have been used to illustrate our discussion and to demonstrate the effectiveness of our method. In addition, we have outlined its hardware implementation which makes the proposed solution particularly attractive. With the advent of the all-digital TV system, such filters can lead to systems with accurate image reproduction fidelity despite any unforeseen transmission developments. REFERENCES 1. I. Pitas, A.N. Venetsanopoulos, Nonlinear Digital Filters: Principles and Applications, Kluwer Academic, Norwell Ma., 1990. 2. J. Siu, J. Li, S. Luthi, 'A real-time 2-D median based filter for video signals', IEEE Trans. on Consumer Electronics, vol. 39, no. 2, pp. 115-121, 1993. 3. C.J. Juan, 'Modified 2D median filter for impulse noise suppression in a real-time system', IEEE Trans. on Consumer Electronics, vol. 41, pp. 73-80, 1995. 4. B. Grob, Basic Television and Video Systems McGraw-Hill, N.Y., 1984. 5. A.N. Venetsanopoulos, K.N. Plataniotis, 'Colour Image Software', Proceedings of the European Conference on Circuits and Systems Design, pp. 247-251 , 1995. 6. G.R. Arce, 'Multistage order statistic filters for image sequence' IEEE Trans. on Signal Processing, vol. 39, pp. 1146-1163, 1991. 7. X. Wang, 'Adaptive multistage median filter', IEEE Trans. on Signal Processing, vol. SP-40, pp. 1015-1017, 1992. 8. X. Yang, P.S. Toh, 'Adaptive fuzzy multilevel median filter', IEEE Trans. on Image Processing, vol. 4, pp. 680-683, 1995.
24
135
9(i'
o
/
h
X
X
X
i-l.j-I
-I.j
i.j-I
,.j
i+l.j-I
t+l.j
X
X
X
45
o
i-l.j+l
.,...-
i.j+l
i+l.j+l
/ Figure 1. The New Filter
i X
i,j.! ' V
I ___J
I
HI: I
X i.j
L
Figure 3. Corrupted image (4% impulsive noise and missing line)
I
x ik~'~ 1 VMF
VMF
[,
J
VM[:
L
J x'.j ( = x
=y,h)
Figure 2. Parallel Implementation
Figure 4. Filtered result of (4)
B PATTERN RECOGNITION
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
27
Blotch and Scratch Detection in Image Sequences based on Rank Ordered Differences M. J. Nadenau and S. K. Mitra Department of Electrical and Computer Engineering University of California, Santa Barbara, CA 93106 The film material of old movies is very often degraded by randomly located blotches and scratches. These artifacts are caused by damages to the film surface or by dust, covering small areas of the surface. Digitizing this kind of film material results in images having patches with gray level values that are uncorrelated to the pixel neighborhood. To avoid a distortion of the unaffected parts of the image, first the locations of the blotches and scratches have to be detected before the restoration algorithm can be applied. These locations are especially characterized by a discontinuity in the sequence. In this paper we present an algorithm that works much more effectively and with less computational load than other known algorithms [ 1]. The proposed algorithm is based on rank ordered differences (ROD), which are calculated from the data of the current image flame, and the preceding and the succeeding motion compensated frame. The new algorithm is compared to existing detection algorithms in the form of probability plots and images indicating the correct, false and missing detections. 1. INTRODUCTION Old movies are often valuable historical records, but most of them progressively deteriorate in visual quality during the years, decreasing their usefulness. The main visual defects are dust which gets attached to the film or abrasions of the thin emulsion layer and scratches caused by foreign bodies, which have been in the camera or the projector. Usually two types of distortions can occur. The first one, called a scratch, appears as a thin line of pixels of arbitrary shape with nearly equal gray level values. The second one, called a blotch, is a block or a small coherent area of pixels with very similar gray level values. Individual pixels in a scratch or a blotch are a kind of impulsive noise distortion, but they are specifically characterized by two properties. First, these distortions build a discontinuity, because they appear randomly in the image sequence and the probability that scratches or blotches in two succeeding image frames are located in the same place is very low. Second, these distortions are coherent areas which have almost the same gray level. As a result traditional impulse noise removing algorithms, like median filters, do not work very well in removing these distortions. We proposed a new algorithm that takes advantage of the distortion characteristics mentioned above. To reduce the reconstruction errors at uncorrupted sites we have to separate the process of restoration from the process of detection. After the corrupted locations have been detected in a first stage, in a second stage the missing data is restored by an interpolation process. In this paper we only present an algorithm for efficient scratch or blotch detection. Using both blotch and scratch properties for the detection, we make use of data from the current frame, which allows us to detect the coherence of the blotches, but we also use data from the preceding and succeeding frame to detect a discontinuity of the gray levels in the temporal sequence of the
28 frames. The latter one is only possible in combination with a motion estimation. Otherwise, every motion in the movie would also be detected as a temporal discontinuity confusing the detector. Then the usage of temporal filtering would not represent any improvement in relation to a filter using only the data of the current frame. The motion estimation we used is based on a hierarchical block matching algorithm [2] which is very stable and sufficient for our purposes. It is likely that the application of another, but also more complex, motion estimation algorithm would give slightly better results, but the additional computational load is not justified by the associated improvement. In this paper we present a detection algorithm which is attractive on the one hand by its low computational load and on the other hand by its efficiency, which is better than that of any other comparable algorithm. The detector uses image data of the spatial and temporal pixel neighborhood and calculates some rank ordered differences (ROD). In a second step, three threshold tests are used to indicate wether the examined pixel is corrupted or not. Three different detectors are compared and applied to a sequence of 64 black and white images that have been distorted by artificial blotches. These blotches represent real blotches quiet well and the knowledge about the blotch positions makes it possible to calculate detection error rates. The generation of these blotches is described in [ 1]. The results of the three detectors, that have been applied to a single image frame and to the whole image sequence, are presented here along with error rate plots. Images indicating the correct, false and missing detections, provide a visual verification of the effectiveness of the algorithms. 2. MOTION ESTIMATION In our approach we use a hierarchical block matching motion estimation that is similar to the motion estimator described in [2]. In a movie sometimes an object is covered or uncovered from one frame to the next, but this kind of "natural discontinuity" usually appears only from the preceding to the current frame or from the current to the succeeding frame. To avoid, false detections of these discontinuities as corrupted areas, we make use of all three frames, the current, preceding and succeeding image frames, for detection. First, the algorithm builds image pyramids of each of the three frames. Starting with the original image, the next level is generated by filtering with a Gaussian circularly symmetric filter of a kernel size 7 x 7 and down sampling by factor 2 in both horizontal and vertical directions. This procedure, repeated recursively, results in representations of the original image at different resolutions. In our case we use a four level image pyramid. The original frame is level 0 with a size of 256 x 256, level 1 represents the same image by 128 x 128, and so forth.
Figure 1: Image pyramid
29 Next we begin the estimation of the motion at the highest level of the pyramid. These estimated motion vectors are used as initial vectors on the next lower level. This kind of hierarchical process ensures that all the different magnitudes of motion in a movie can be detected.
Figure 2: Seeking process At every level the same seeking procedure is applied. The image data of the particular level is segmented in rectangular macroblocks. In our approach we use a size of 4 x 4 pixels. Each of these macroblocks is shifted around in a certain seeking area to figure out which motion vector for this macroblock delivers the best match. This is done by thresholding the summed absolute difference (SAD) between the pixels in the current block and those in the block at the shifted position in the previous or succeeding frame. This seeking procedure is started only, when the SAD for the non-shifted macroblock is greater than a certain threshold (here 120). The search space we used is defined bysetting the maximum expected displacement to + 4 pixels. The displacement with the lowest SAD (Emin) is stored as a possible motion vector on this level. This value Emi n is compared to the SAD of "no motion" E 0. If the ratio r=E0/Emi n is greater than a threshold T O(here T0=2.3), the motion vector is accepted at the next level as an initial vector. Otherwise it is assumed as a spurious match [3]. In the latter case the motion vector of this macroblock is generated by an interpolation. It is set to the average of the motion estimations of the three neighbours with the lowest SAD. After estimation at the lowest (256 x 256 ) level, the vector estimations are compared with the difference picture which represents the pixel differences between the motion uncompensated previous image frame and the current one. At those locations, where the difference is less than a certain threshold, the motion vector is set to zero. This reduces the errors that have been accumulated from one level to the next level. Finally, the vector estimation for every pixel of the frame is done by an interpolation of the macroblock motion vectors. 3. DETECTORS 3.1. ROD
- Detector
The rank-ordered difference (ROD) detector presented in this paper is a modified form of the signal-dependent rank ordered mean filter (SD-ROM) used for restoration of an impulse noise corrupted image [4]. While the SD-ROM filter works exclusively in the spatial area of one image frame and is only able to remove one or two pixel wide distortions, the ROD filter is designed to work on image sequences. The ROD filter processes in both spatial and temporal
30 domains; that is, it uses data from the current frame, and from the preceding and succeeding ones. It is able to detect both thin scratches and blotches. The aim of the algorithm is to determine whether a pixel x(k_) in the current frame is corrupted or not. The vector k_ = (x0,Y0,n) describes the position of the pixel being tested. In the first step we define a six element vector p(k_) 9 p (k_) = [p 1 (k_) ,P2 (k_), P3 (k_), P4 (k_),p5 (k_), P6 (k_) ]
= [X(Xo, Y o - l , n - 1
) ,X(Xo, Yo, n - 1 ) , X ( X o , Yo+ 1, n - l ) ,
(1)
X(Xo, Yo - 1, n + 1),X(Xo, Yo, n+ 1),X(Xo, Yo+ 1, n + 1) ] p(k_) contains three pixels of the preceding frame n - 1 and three pixels of the succeeding frame n + 1. The center pixel x(k_) and the six elements of p(k_) build the input data for the filter. Figure 3 clarifies the arrangement of the input pixels. If k_ is pointing to a blotch, usually the value x(k_) is different from the gray values of the elements of p(k_). Depending on the motion, gray value change caused by light conditions in the scene and the gray value of the blotch itself, the difference between x(k_) and the elements of p(k_) varies between very small and large values.
n
P4 P5
C Y
".',"."".. "
succeedingframe
current fra n
'p6 Pl
x
i
preceding frame.. n-1
P2 P3 Figure 3: Pixel arrangement The values above and below k_, in the n-th frame are excluded because in the case of blotches, their values are very similar to the value of x(k_). Therefore including both of these values would only reduce the possibility of detecting an unusual difference of x(k_) compared to the neighborhood pixels. However, similar values for the pixels above and below k_ do not necessarily imply the presence of a blotch, because scratches can be one pixel wide. If k_ points to the border of a blotch, only the value above or below k_ would be similar to x(k_). Taking all these different cases into account, the complexity of the algorithm would be unnecessarily increased. The six pixel values of p(k_) are ordered by rank, which gives the vector r (k_) 9 r(k_) = [r 1 (k_),r 2 (k),r 3 (k),r 4 (k_),r5 (k) ,r 6 (k) ]
(2)
31 containing the values P l ' " P 6 mean m (k_) as: m (k_) =
in a rank ordered sequence. Next we define the rank-ordered
( r 3 (k) + r 4 ( k ) ) / 2
(3)
The r a n k - o r d e r e d d i f f e r e n c e s are defined as d (k_) = [d 1 (k_) ,d 2 (/!:_),d 3 (k) ] , where: r i (k_) - x (k)
x (k_) < m (k_)
X (k_) - r7_ i (k_)
x (k_) > m (k_)
d i (k_) =
V i = 1... 3
(4)
The rank-ordered differences provide the information about the likelihood of corruption of a pixel at location k_. Finally the comparison with preselected thresholds determines if a pixel is corrupted or not. The location k_ is detected as corrupted, if at least one of the following inequalities is true: d i (k_) > T i
i = 1...3
(5)
T 1 , T2 and T3 are preselected threshold values with T 1 < T2 < T3 . The most important threshold value for detection is T 1 9the thresholds T2 and T 3 are necessary, but of secondary importance and can be used to optimize the rate of correct detection by varying both of these values. On the other hand the rate of correct detection depends greatly on the appropriate selection of T 1 . In case of a restoration process for a commercial application, it could be useful to extend the algorithm by an adaptive threshold setting of T 1 , to automatically obtain an optimal detection for every frame. 3.2. SDIa - Detector
The SDIa detector is similar to the spike detection index (SDI) algorithm, presented in [5]. It is the simplest and earliest detector of the three detectors discussed in this paper. Like the ROD, the SDIa detector is also based on an heuristic approach to detect temporal discontinuities in a image sequence. The main idea of the detector is given by the following equation: e b = ( I n (~r) - I n _ 1 (~r + ~ n , n - 1 ( ~ ) ) ) 2
ef = ( I n (~r) - I n + 1 (~r + ~n,n + 1 ( ~ ) ) ) 2
/.
1 DSDIa (]') = I \ 0
(e b > et) /x ( e f > et)
(6)
otherwise
where I n (r~) is the pixel intensity at the location r~ in the n-th image frame and the motion c o m p e n s a t i o n vectors in forward and backward direction are Vn, n - 1 (r), Vn. n + 1 (r) . A n examined location is marked as corrupted, if the forward and backward squared frame difference (ef, eb ) are greater than a certain threshold el. 3.3. MRF
- Detector
The M R F detector is based on a Markov random field (MRF) model. This model is not applied to the image itself, but it is used to model the blotches of an image by creating a blotch detection frame D. This can be considered as an additional virtual frame between two real image frames of the sequence, containing only the blotch and not any real image information. For a possible configuration (D = d) of the complete detection frame D the presence of a
32 blotch at position ~r is indicated by d (r~) = 1 , while d (r~) = 0 represents an uncorrupted location. In [ 1] the following equation for an a posteriori joint distribution for the detection frame D is given P (D = d l I = i) = ~exp -
[a(1-d(r~))
(i(~r)-i(Nc)) 2
_
[31f(d(~r)) + J 3 2 6 ( 1 - d ( r ~ ) ) ]
where 0~, J3~ an J32 are certain parameters, i (~r) is the pixel intensity at the location r~ in the current frame, t(Nc) is the pixel intensity of the motion compensated other real image, the function f ( d (r~) ) gives the number of the four neighbors of d (r~) with the same value as d (~-), 6 ( ) is the delta function and S describes the possible area for r~ , namely the whole image frame. With Eq. (7) the probability for a certain configuration of D can be evaluated. The Gibbs sampler with annealing is used to find the maximum a posteriori (MAP) configuration of the detection frame D, given the data and the model for blotches. First, this technique is aimed to the current frame and preceding frame; next, it is applied to the current frame and succeeding frame. Only at those sites, where both times a discontinuity is estimated, the location is classified as corrupted. The search for the MAP is carried out in an iterative manner. After approximately 5 iterations the algorithm is assumed to have converged. 4. SIMULATIONS To compare the efficiency of the three detectors we use the same black and white image sequence, WESTERN, which has been used in [ 1]. The images have a size of 256 x 256 pixels and contain gray values in the range between 0 and 255. The sequence is artificially corrupted with blotches of random gray values, quite realistic in size and shape. This makes it possible to compute the false alarm and correct detection rate. First, we discuss the detector efficiencies by applying all three methods to the whole sequence of 64 image frames. We then show a typical frame to provide a visual demonstration of the detection algorithm. The motion in the image sequence is estimated from the degraded frames using a four-level estimation process, described above in Section 2. It is a different motion estimation process than used in [ 1], therefore the results for the MRF and SDIa detector might be slightly different. Figure 4 shows a plot of the correct detection rate versus false alarm rate for the ROD, MRF and SDIa detector, applied to the whole sequence. The probability of false alarm and correct detection are defined as:
nfa = Number of false detections
nfa Pfa = "~
nco P co =
nco + nmi
nmi
-
"
Number of missing detections
nco = Number of correct detections
(8)
N = Number of pixels per frame We used 1 < T 1 < 38, T2 = 39, T 3 = 55 as parameters for the ROD detector. For the MRF detector the best results out of the parameter range of 6 < e 1 < 34, 14 < e 2 < 54 have been chosen and the SDIa curve has been generated by measurements for 50 < e I < 2000 in steps of 25.
33
i::~
0.9
M.F .
~
0.
0.6 10 -4
.
.
.
.
.
1 0 -3 Probability
.
.
.
.
: - : .
.
.
.
1 0 -2
of false alarm
Figure 4: Performance of detectors applied to the whole sequence WESTERN
.
1 0 -1
.!
~0.7~ .
!i!i,! :~-r::
!
i
.
0.6l
i
.......
! . i~i i
.
10 -3
~ ! i!ii!i
./
: .... :
.
.
:
.
.
:
::::
..........
.
.
i
i ! i i!!
" ........................
ROD
.
:-
[ :
.
10 -2
Probability of false alarm
10
Figure 5: Performance of detectors applied to frame 49
Obviously the performance of the MRF detector and the SDIa detector are very similar. A slightly better result is obtained with the MRF detector, but the difference is really marginal. The ROD curve shows the fundamental improvement of the new detector versus the other approaches. For a correct detection rate of 80% the ROD detector has about 2.5 times less false detections, which provides a much more feasible basis for the restoration process to be carried out in the second step. The MRF detector, although a very complex algorithm, does not show a better performance versus the SDIa or the ROD approaches. Figure 5 provides a comparison of the detector performances as applied to frame 49 of the sequence W E S T E R N shown in Figure 6. This frame contains more marginally contrasted blotches on the average than all others. In this special case the MRF detector is able to take advantage of its better capability to detect poorly contrasted blotches by its spatial connectivity. In fact, the MRF detector performs perceptually better than the SDIa detector, but the new ROD approach provides a still superior performance. For a correct detection rate of 93%, the false alarm rates are - ROD 0.096% - MRF 0.81% - SDIa 0.95%. That means the performance of the ROD detector is 10 times better than the performance of the SDIa detector. Also the difference between the ROD and MRF detector is almost of one order. To visualize the difference of detection Figure 7 - 9 show the detector results. Green colored pixels indicate correct detections, red colored pixels mark missing detections and brown colored pixels represent false detections. The chosen bias point of the algorithms provide the same correct detection rate, that is, the number of green marked areas should be almost equal. To compare the detector performances, attention should be focused on the number of brown pixels. All three detectors produce false alarms in the area of the white coat lining, because it appears only in frame 49. In the preceding frame and succeeding frame it is covered. For a three frame based approach, this demonstrates the limit of an automatic detection process. In the same way all detectors miss the blotch on the right shoulder of the main person. This blotch is too marginally contrasted for all algorithms. While the ROD detector provides nearly perfect detection with the fewest number of false detections, the other two detectors result sometimes in small coherent areas of false detections.
34 A restoration process applied to these locations definitely causes a degradation of the fine details of the picture. From the implementation point of view, the computational complexity of these algorithms is of greater interest. We define the cost for an addition, subtraction, multiplication or division as 1, the cost for an exp-function as 20, according to the numbers used in [ 1]. The SDIa detector uses only 6 operations, while the MRF approach needs about 60 operations in forward and 60 operations in backward direction. For 5 iterations, this results in about 600 operations per pixel. That is a difference of two orders to the SDIa approach. The ROD detector requires only 24 operations and is thus a very easily implementable algorithm. 5. CONCLUSIONS In this paper we introduced an very efficient blotch and scratch detector of quiet low computational complexity. The proposed detector delivers a very solid basis for the next steps in image restoration. If it will be necessary to increase the efficiency of the ROD detector, even when this causes a higher computational load, the spatial information of the current frame has to be used in a sophisticated way, combined with the algorithm presented in this paper. ACKNOWLEDGEMENT This research is part of an ongoing Alexandria digital library project being carried out at the University of California, Santa Barbara under NSF Grant Number IR194-11330. REFERENCES [ 1] A. C. Kokaram, R. D. Morris, W. J. Fitzgerald and P.J.W. Rayner, "Detection of Missing Data in Image Sequences", IEEE Trans. Image Processing, Vol. 4, No. 11, pp. 1496-1508, Nov 1995 [2] W. Enkelmann, "Investigations of Multigrid Algorithms for the Estimation of Optical Flow Fields in Image Sequences", Computer Vision Graphics and Image Processing, Vol. 43, pp. 150-177, 1988 [3] J. Boyce, "Noise reduction of image sequences using adaptive motion compensated frame averaging", IEEE ICASSP, vol. 3, 1992, pp. 461-464 [4] E. Abreu and S. K. Mitra, "A Signal-Dependent Rank Ordered Mean (SD-ROM) Filter- A New Approach for Removal of Impulse from Highly Corrupted Images", IEEE ICASSP, Detroit, MI, USA, 9-12 May 1995, pp. 2371-4 vol.4 [5] A. C. Kokaram and P. J. Rayner, "A system for the removal of impulsive noise in image sequences", SPIE Visual Communication Image Processing, 1990, pp. 122-133 [6] R. D. Morris, "Image sequence restoration using Gibbs distributions", Ph.D. thesis at University of Cambridge, UK, May 1995
35
Top left illustration - Figure 6: Corrupted frame 49 of sequence WESTERN Top fight illustration - Figure 7: SDIa detector applied to frame 49 Bottom left illustration - Figure 8: MRF detector applied to frame 49 Bottom right illustration - Figure 9: ROD detector applied to frame 49
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 36
9 1997 Elsevier Science B.V. All rights reserved.
Feature Matching by Optimization using Environmental Constraints A. Branca, E.Stella, G.Attolico, A. Distante a aIstituto Elaborazione Segnali ed Immagini- C.N.R. Via Amendola, 166/5- 70126 Bari- ITALY Phone (39) 80-5481969 Fax (39) 80-5484311 bran ca ~ie si. ha. cnr.i t Matching is the capability to find correct correspondences among features extracted in two images of the scene acquired from different points of view or after TV camera motion. 3D stereo reconstruction and optical flow estimation are contexts of image understanding where matching has a fundamental rule. We describe a feature-based approach to solve the correspondence problem. Our goal is to correct initial matches, obtained by correlation, minimizing an appropriate energy function using as constraint the invariance of the cross ratio evaluated among coplanar points. 1. I n t r o d u c t i o n
Time-varying images of real-world scenes can provide kinematical, dynamical, and structural information of the world. To estimate the 3D motion and the structure of objects from image sequences, it is often necessary to establish correspondences between images, i.e., to identify in the images the projections corresponding to the same physical part of the sensed scene. The existing techniques for general two-view matching roughly fall into two categories: continuous and discrete. In this work a general method to perform discrete feature matching between images acquired in different times or from two different views is proposed. Generally, discrete matching techniques proposed in literature are implemented by direct methods, using local constraints on features [8], [1], or through optimization methods, using 'global constraints on features [5], [2],[10] to formulate an energy function to be minimized. While the direct methods are fast but more sensitive to noise, the optimization based techniques are more reliable though the drawback to require a burdensome processing. The energy minimization based approaches have been extensively used in literature [6][9] and most of them formulate the energy functional using constraints, determined considering feature characteristics, such as: uniqueness, ordering, disparity continuity. An unexplored field is that to consider projective geometrical invariance constraints in the optimization process computing features correspondences. In this paper an optimization method including cross-ratio invariant constraint of five coplanar points is proposed to solve the correspondence problem. The geometric invariance of cross-ratio of five coplanar
37 points has been used in literature as constraint for optimal matches selection in tracking algorithms, planar region detection or object recognition using probabilistic analysis [3],[11],[4],[12]. The performance of probabilistic approaches depends on the choice of rule for deciding whether five image points have a given cross-ratio [7]. In our method projective invariant constraints are included directly in the optimization process. We propose a feature-based approach to solve the correspondence problem by minimizing an appropriate energy function where constraints on radiometric similarity and projective geometric invariance of coplanar points are defined. The method can be seen as a correlation based approach which take into account the projective invariance of coplanar points in computing the optimal matches. In the following sections the algorithm used for optimal match selection (section 2) and the minimization technique implemented (section 3) to correct all mismatches are described. The experimental results (section 3) show as the approach provides good estimates of visual correspondences.
2. Raw Match C o m p u t a t i o n and Mismatch Selection Our aim is to define a new optimization algorithm for solving the correspondence problem using perspective invariance of cross ratio. Displacement vectors should be estimated only for features of "high" interest (using the algorithm proposed in [8]) which are salient points that can be more easily matched than other points. Initially, raw matches are computed maximizing the radiometric similarity among windows in the first image cenetered on high variance features and second image candidate features. Such matches represent the initial guess will be improved through an optimization process. Our idea is to use the geometric invariance of cross ratio C R ( P ) of five coplanar points P = (Pl, P2, Ps, P4, ps)
CR(P) - sin(a13)* sin(a24)
(1)
(where sin(ao) is the sin of the angle p(p-ff'pj) to verify the goodness of matches estimated through radiometric similarity and at same time to correct all mismatches. This require to satisfy the constraint that for all matches among neighboring points, given five points Pi~kl~ = {Pi,Pj,Pk,Pt,P~} in the first image and the corresponding points in the second image Qokl,.,, = {qi, qj, qk, qz, q,.,,}, the cross ratio computed for each subset must be the same. Previous works proposed in literature use the value of cross ratio computed on a group of five points to verify their coplanarity or match correctness. Evaluation performed on a single group must involve the use of thresholds, actually a small error in locating points can determine large variations in cross ratio values. These problems can be overcome if many combinations of five image points are considered. A mismatch or a point not coplanar with its neighborhood will be easily identified by considering the cross ratio similarity computed on all groups containing it.
38 3. Mismatch Correction through Optimization Cross ratio similarity computed on a large number of groups can be used as constraints to correct all mismatches generated using radiometric similarity. We propose to solve the correspondence problem by imposing that the sum of all differences between the cross ratio computed for each considered subset of five features of the first image and the cross ratio computed for the matched points in the second image, must be minimized. The energy function to be minimized to solve the correspondence problem is: E =
x"lvsaa ~,=1 IICR(P,)
-
IVFeattl C R ( Q , ) I I ~ + V, z_.,=~ ,~ _ R4)
(2)
where P,~ and Q,, are the n t h subsets of five points from the first and second image respectively, and the term P~ imposes that corresponding features in the first and the second image must have a radiometric similarity. The norm E will be minimized only when its partial derivatives with respect to all points qi in the second image equal zero. Satisfying this condition for each of the qi then a system on N F e a t simultaneous equation in N F e a t unknowns is generated. Since the problem is nonlinear, we use an iterative approach to compute the optimal solution. Each match is update iteratively by an amount given by the partial derivative of energy function with respect the same point scaled by a parameter fl determined at each iteration using the method of conjugate gradient:
Vi, q,+=Z ~ -~qi k--l,qh~qi
(3)
Since the partial derivatives of cross ratio estimated for a subset Qn with respect to a point qi G Qn depends to all points qk, k = 1...5 of Qn, the update should depend to radiometric similarity of other points {qk C Qn, qk ~ qi; k = 1...4}. Correct matches (with high radiometric similarity) will influence positively the update, on the other hand, mismatches (with a low radiometric similarity) will avoid any update depending of them. Starting from some approximate matches, the algorithm improve the solution until a predetermined convergence criterion is satisfied. Due to the nonlinearity of the system more than one solution can exist. The success to reach a global minimum, and not be trapped in local minimum, depends on having a good first-guess for the solution. The goal is to reject the noise introduced from correlation measurements. The approach we propose converges through iteration upon the desired correspondence points {qi} by implementing gradient descent along the E(q~) surface, which expresses the quadratic cost function's dependency on all of the {qi} points. The correct matches are not changed because the computed adapting signal control is zero due to the satisfaction of the geometrical constraint. On the other hand, mismatches are influenced by correct matches determining the noise rejection. When a stable state is reached, the energy function value in each subset will provide useful information to identify coplanar features.
39 4. Experimental Results The experimental results reported in this paper have been obtained from tests performed on time varying image sequences. The tests have been performed by considering pairs of image relative to a same static scene and acquired in different times while the TV camera is moving forward along its optical axis (the resulting optical flow must have a radial topology). Once highest interest features are extracted in the first image of a sequence, the raw matches, computed imposing the radiometric similarity, have been considered to define a large number of five point groups between neighboring features, in order to select all mismatches or to correct all mismatches through the optimization process. In the reported results we can observe as the performances of the approach are independent of the planarity of the scene. Actually, to satisfy the cross-ratio constraint it is sufficient that near features are coplanar, but it is not necessary that all extracted features to be coplanar. We can compare the results obtained from a sequence of coplanar image features in fig.(1) and those obtained from the image sequence in fig.(2) where features are extracted on different planes: the algorithm recovers the correct matches in both sequences. Finally, the ability of the system to select all correct matches from the raw measurements obtained through correlation, without apply the optimization process is shown in fig.(3). 5. Conclusions In this paper, we have proposed a new approach to solve the correspondence problem between sparse features of images acquired at different times or from two different point of view. The approach is based on cross-ratio similarity between five coplanar points. Cross ratio invariance constraint computed on a large number of combinations of five image points can provide an useful mean to identify mismatches generated by radiometric measurements and at same time to correct all mismatches through an optimization process. REFERENCES 1. N.Ayache, "Artificial Vision for Mobile Robots",MIT Press, 1991 2. N.Ayache and B.Faverjon, "Efficient Registration of Stereo Images by Matching Graph Descriptions of Edges Segments" The Int. Journal of Comp. Vision 1(2):107-131, April 1987. 3. H.Chabbi, M.O.Berger "Using Projective Geometry to Recover Planar Surfaces in Stereovision" Pattern Recognition Vol.29, No.4, pp.533-548, 1996. 4. S.Carlsson "Projectively Invariant Decomposition and recognition of Planar Shapes" International Journal of Computer Vision Vol 17 No 2 pp 193-209 1996. 5. Y.Ohta and T.Kanade, "Stereo by Intr- and Inter-Scanline Search. IEEE Trans. on Pat. Anal, and Mach. Intell.,7,No.2:139-154,1985. 6. J.J.Lee, J.C.Shim,Y.H.Ha "Stereo correspondence using the Hopfield Neural Network of a new energy function", Pattern Recognition, Vol.27,No.ll,1994.
40
Figure 1. (a)Start image of the sequence with extracted feature over-imposed. (b)Optical flow estimated after matching with correlation. (c)Optical flow estimated after matching with our optimization technique.
Figure 2. (a)Start image of the sequence with extracted feature over-imposed. (b)Optical flow estimated after matching with correlation. (c)Optical flow estimated after matching with our optimization technique. 7. S.J.Maybank "Probabilistic Analysis of the Application of Cross Ratio to Model Based Vision" International Journal of Computer Vision Vol 16 pp 5-33 (1995). 8. H.P.Moravec "The Stanford Cart and the CMU Rover",Proc. IEEE,1983. 9. J.P.Pascual Starink, E. Backer "Finding point correspondences using simulated annealing", Pattern Recognition, Vol.28,No.2,1995. 10. L.Robert and O.D.Faugeras "Curve-based Stereo: Figural Continuity and Curvature" In CVPRgl, 57-62. 11. D.Sinclair, A.Blake "Qualitative Planar Region Detection" International Journal of Computer Vision Vol 18 No 1 pp 77-91 (1996). 12. C.A.RothweU,A.Zisserman,D.A.Forsyth,J.L.Mundy"Planar Object Recognition using Projective Shape Representation" International Journal of Computer Vision Vol 16 pp 57-99 1995.
41
Figure 3. (a)Start image of the sequence with extracted feature over-imposed. (b)Second image of the sequence with features computed through correlation over-imposed. (c)Second image of the sequence with features corrected through optimization overimposed. (d)Optical flow estimated after matching with correlation. (e)Matches selected from the flow in (d). technique.
42
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
System identification for fuzzy controllers G. Castellano, G. Attolico, T. D'Orazio, E. Stella, A. Distante " Istituto Elaborazione Segnali ed Immagini- C.N.R. Via Amendola, 166/5 - 70126 Bari- ITALY at t olico @iesi.ba. cnr.it Several robotic applications can be accomplished through a direct mapping between perceptual situations and control commands. Fuzzy logic is a useful tool for realizing such a mapping. It allows either explicit programming or automatic learning of control rules from suitable training data. A fuzzy control system for wall-following has been developed studying the problem of automatic extraction of rules from training data. Using a machine learning technique we build a compact rule base by estimating the relevance of each input signal in the control decisions. The derived fuzzy rules have successfully driven a TRC Labmate inside an indoor environment. 1. I N T R O D U C T I O N Some tasks involved in autonomous mobile vehicle navigation can be solved efficiently without using plan-based methods, that need some internal representations to be built and updated. Obstacle detection and avoidance, wall-following, door-crossing are examples of low-level strategies that can be realized by a direct mapping between the sensory input and the output control spaces, thus avoiding the delay introduced by the update of an environment model and enabling the use of simple and cheap sensors, as the ultrasonic ring used in our experiments. To describe mathematically the mapping between perceptual situations and control commands may be not easy or desirable. Therefore the use of techniques for designing good approximation of the desired behaviors is preferred. Both neural networks ([1], [2]) and fuzzy controllers ([3], [4]) have proved to give good results as function approximators in robot navigation applications. The learning process for a neural network is generally slow and the final knowledge is represented in a way difficult to evaluate, integrate and refine. The linguistic representation of fuzzy rules is easily understandable instead, allowing validation and correction at any time using information provided by human experts. The learning process for a fuzzy controller is quicker and scales gracefully with the size of the problem. The design of fuzzy systems requires the choice of input and output data, the definition of linguistic values (with the associated membership functions) for each fuzzy variable and finally the derivation of rules from the available knowledge (human experts and real data). To make automatically these choices means to improve the autonomy of the system in learning the initial strategy and in its tuning during on-the-job runs in a dynamic environment.
43 In [5] and [6] the rules are obtained by iteratively dividing the input and the output space into regions to which numerical input-output data are assigned Also machine learning approaches [7], neural networks [8] and genetic-based learning algorithms [9] have been used to derive a feasible set of rules. In [10] we developed a rule construction algorithm to automatically build a fuzzy wall-follower. The resulting rule base is efficient but contains a large number of fuzzy rules depending on the number and the granularity of input variables. This paper addresses the problem of automatically building a compact rule base by estimating the relevance of each input signal in the control decisions. Using a well known machine learning technique the number of produced fuzzy rules is drastically reduced with respect to [10] without weakening the controller. Experimental results will be shown using a fuzzy controller for the wall-following task. 2. T H E F U Z Z Y W A L L - F O L L O W E R 2.1. N o t a t i o n s and definitions Let us consider a fuzzy system with n inputs xl,...,xn and a single output y. Each input linguistic variable Ak E Ux, k = 1, ..,n, is characterized by Nk linguistic terms Akl, Ak~., ..., AkN~.. The linguistic variable B E Uo is characterized by M linguistic terms B1,B2,...,BM. A fuzzy set is associated with a crisp representative value, that we define as the modal point for triangular or gaussian shaped sets. Trapezoidal sets instead will be represented by the midpoint of the range of points having membership value 1.0. We will denote by aki and bj the representative value for sets Aki and Bj respectively. Let the rule that maps the i th multivariate fuzzy input variable A i to the jth univariate output set be labelled by rij, i.e.: rij: IF (xl is A~) A N D . . . AND (x~ is A~)THEN (y is B j) where A~ (respectively B j) is the linguistic value of the input fuzzy variable Ak (respectively the output fuzzy variable B) in rule rij. 2.2. Fuzzification, Rule Inference and Defuzzification We have adopted a nonsingleton fuzzifier, which is more adequate than singleton fuzzification when dealing with noisy data [11]. Our fuzzifier maps an input crisp value x into the fuzzy set A* for which pa*(x) > 0.5. A product-max inference has been applied. Rule evaluation by product operator retains more input information than using the min operator and generally gives a smoother output surface [11], a desiderable attribute for a controller. Given an input x = (zl,z2,..,zn), the firing strength of a rule is n
#,'o (x) - #A' (X) -- II /Za~ (Xk) k=l
(1)
while the final maximum membership value for the output set Bj, j = 1...M after the inference of all rules is given by: #-Bj (Y) - minCPej(y), im.a.~(p,,j (x))) where Y represents the support values of the output membership functions.
(2)
44 The center of area defuzzification method has been adopted, since it yelds a better performance of the controller than the mean of maxima method [11]. However, to reduce the computational cost of the method, we have considered the crisp representative value bj of the set Bj instead of its centroid. Thus the crisp control value is obtained as:
M y*-- Ej=I - ~ B j ( bj )bj Eg=I~Bj (bj)
(3)
-
2.3. R u l e C o n s t r u c t i o n Following the machine learning approach of [7], we have developed a rule construction method, which applies the ID3 algorithm to construct a compact rule base by selecting only the most relevant input variables. The rule construction algorithm involves building a decision tree by recursively taking as root of each subtree the input variable with greater information content and with less homogeneous branches (in terms of values of that variable in the training set). Rules are obtained by crossing all possible branches of the tree from the main root the leaf nodes, representing the output fuzzy values. At each level of the decision tree we define:
nt the total number of training samples at that point of the tree nsj the number of the nt samples with y in Bj nAhi is the number of the nt samples with xk in Aki nAkiBj is the number of the nt samples with x~ in A~ and y in Bj In order to evaluate the importance of each input variable, the Quinlan's Gain Ratio has been adopted as information measure. For an input variable ~ this is defined as: GR(xk) -
INF(y)-
Mk
INF(x~)
where Mk -- ~ nAh______A (_i i nt j
nAh,Bj 1og2nAkiBj ) nBj nBj
is the information content if xk is selected as root of the current subtree, and N~
INF(=k) - - ~
z9
nab, 1og2nAh, nt
,
M
I N F ( y ) - - ~ , :~J log2 n ' i
nt
39
nt
are the total information content for the input variable xk and the output variable y respectively. In addition, in order to avoid useless details in the decision tree (normally producing about 100 rules) at each step we create a sub-tree only if it produces a relevant reduction of the error rate, that is if
N
E eAk~ ~ et--s i=l
where
45
~q - - 4 V nt nAhi eAki -
et
-
-
-
nt
-
TI,AhiBrna~ TI,Bma,. ~
-
0.5 0.5 - - maxj=l,...M n A k i B i maxj-1,...M TI,B j -
-
7~AkiBrna~ -~
nB,~~
+
3. E x p e r i m e n t a l R e s u l t s The Sistema AUtonomo RObotizzato (AUtonomous RObotic System) SAURO, an autonomous vehicle oriented to transportation tasks in indoor environments [fig.l), has been used for collecting the training data and for testing the fuzzy wall-follower. SAURO is a LABMATE mobile base provided with a VME bus processing system and a ring of 18 ultrasonic sensors.
Figure 1. The mobile robot SAURO.
Figure 2. Arrangement of ultrasonic sensors and sensor suits.
Input to the fuzzy controller are the ultrasonic sensors measures grouped into suits, according to the required spatial resolution, each one providing a single value to the control system (fig.2). Also the number of linguistic labels associated to each fuzzy variable depends on the position (and relevance for the wall-following task) of the corresponding suit (fig.3). The motion control of the mobile robot is realized by setting its steering velocity w (fig.4). For simplicity a constant forward speed has been assumed. Training data (sensory input and corresponding control output) have been collected during navigation sessions of the vehicle driven by a human operator along a wall on its right-hand side. Fig.5 shows the training environments and the corresponding trajectories of SAURO. The rule construction method used in [10] has derived about 170 fuzzy rules, with a visible degree of redundancy. With the application of the ID3 method, only 15 rules have been extracted, without decreasing the performance of the controller. Each rule does not
46
N
F
V
~. ( m m )
150
300
10000
500 (a)
..
150 200 300
500
600
NB
NM NS t PS PM
PB
-15
30
,. (mm) 10000
(b)
Figure 3. Membership functions of the input variables (a) LeftSack, LeftFront, Front and (b) RightFront and RightBack.
-60
-30
-5 0 5 15
60
Figure 4. Membership functions of the output variable.
use necessarily all the 5 input data, therefore exploiting the real relevance of each input on control commands. The final number of rules is comparable with the size of hand-written fuzzy rule-base for similar tasks. The compact controller has successfully driven SAURO to follow unknown configurations of walls in both simple (fig. 6) and complex (fig. 7) environmental situations. It can be noted that the robot is also able to avoid unespected obstacles by correctly changing its trajectory and still following the wall.
STTS
ST~__&T
3
START
~\
Figure 5. Situations used for collecting the training data.
4. C O N C L U S I O N S Fuzzy navigation controllers can be an effective solution for the implementation of navigation behaviors that do not require internal representations of the environment, so hard to acquire and to update, necessary for conventional plan-based techniques. Automatic learning and continuous adaptation of the control strategy by representative real data can produce fuzzy rules, that experts can then evaluate and tune with their skills. A first automatic derivation of the fuzzy rule base, has produced redundant rule bases. By estimating the relationship between input and output data, we have build a fuzzy wall-f011ower with a reduced number of rules. Simplifying the fuzzy controller is especially important in prospect of the extension of the control system to the complete set of behaviors required by a safe navigation in indoor environments.
47
~,~
START/
Figure 6. SAURO's trajectory in a simple environment.
STAR
Figure 7. SAURO's trajectory in a complex environment cluttered with obstacles.
REFERENCES
1. D.A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3, 1991. 2. H. Meng and P.D. Pincton. Neural network for local guidance of mobile robots. In Proc. of Inter. Conference on Automation, Robotics and Computer Vision, pages 1238-1242, Singapore, November 1994. 3. K.T. Song and J.C. Tai. Fuzzy navigation of a mobile robot. In Proc. of I E E E / R S J Inter. Conference on Intelligent Robots and Systems, volume 1, pages 621-627, Raleigh, NC, July 1992. 4. W. Li. Fuzzy logic based robot navigation in uncertain environments by multisensor integration. In Proc. of the 1994 IEEE International Conference on Multidensor Fusion and Integration for Intelligent Systems (MFI '94), pages 259-265, Las Vegas, NV, October 1994. 5. M. Lan S. Abe. Fuzzy rules extraction directly from numerical data for function approximation. IEEE Transactions on Systems, Man and Cybernetics, 25(1):119129, January 1995. 6. L.X. Wang and J.M. Mendel. Generating fuzzy rules by learning from examples. IEEE Transactions on Systems, Man and Cybernetics, 22(6):1414-1427, November 1992. 7. J.Y. Hsu S.C. Hsu and I.J. Chiang. Automatic generation of fuzzy control rules by machine learning methods. IEEE Proc. of lnt. Conference on Robotics and Automation, pages 287-292, 1992. 8. Y. Lin and G. A. Cunningham III. A new approach to fuzzy-neural system modeling. IEEE Transactions on Fuzzy Systems, 3(2):190-197, May 1995. 9. A. Homaifar and Ed McCormick. Simultaneous design of membership functions and rule sets for fuzzy controllers using genetic algorithms. IEEE Transactions on Fuzzy Systems, 3(2):129-138, May 1995. 10. G. Castellano, G. Attolico, E. SteUa, and A. Distante. Learning the rule base for a fuzzy controller. In 4th IEEE Mediterranean Symposium on Control and Automation (MSCA '96), Crete, Greece, June 1996. 11. M. Brown and C. Harris. Neurofuzzy Adaptive Modelling and Control. Prentice Hall, 1994.
This Page Intentionally Left Blank
C COMPUTER VISION
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
Computer Vision Applications
for
Autonomous
51
Navigation" from
Research
to
G.Garibotto, P.Bassino, M.Ilic, S.Masciangelo Elsag Bailey -TELEROBOT, via Hermada 6, Genova, Italy The results described in this paper represent the conclusion of a cycle of researches carried out in the last few years in the field of Artificial Vision applied to mobile robotics. A first prototype system was experimented in the field of service robotics, including light material transportation (mail, documents, medicines, clinical data), and museum guide services. This technology has been recently applied to the driving control of standard transportation vehicles for palletised goods and materials. A prototype system named ROBOLIFT has been developed and is presently in the stage of industrial exploitation. The referred results are a concrete demonstration of technology transfer from basic research studies to the application domain and the level of maturity of Computer Vision technology for industrial use in Service Robotics. 1. INTRODUCTION TO ROBOTIC VISION In this section we briefly refer the main steps of our research efforts, from the beginning of the '80s with a strong industrial investment in robotics applications. At that time the driving force of the research effort was the development and integration of prototypes of Vision systems for a manipulating robot to perform flexible assembly operations [1 ]. Beside modelbased image recognition and object positioning and orientation, the main contribution was the effective integration of the system in a fully autonomous robotic assembly cell. The achieved performance were satisfactory in terms of flexibility and processing speed, but the high costs of the parallel implementation on the proprietary EMMA2 architecture [1] and the lack of a standardisation prevented a wider industrial exploitation of these results. In the mid of the '80s an international project has been established to better investigate 3D stereovision and motion analysis and realise a special hardware machine to perform such image processing functions almost at video rate. The main result of this European project (ESPRIT P940)[2] has been a multi-DSP parallel processing architecture based on M E - b u s which allowed 5 Hz 3D stereo reconstruction, using trinocular (three cameras) stereo arrangement and tracking of linear segment features at a rate of 10 Hz. The limited market size, as well as the not yet proved reliability and robustness of the on-going research did not allow a consolidation of the system as an industrial product. On the other hand the P940 machine has represented for many years (since the end of the project in 1992) a strong competitive advantage for the European industry in Computer Vision and a very powerful advanced research environment for real time experiments in the field of Computer Vision [3]. We have successfully applied this technology in different contexts (quality inspection and control), and
52 in robotic metrology, using camera calibration [4] for the 3D reconstruction of surface patches of object models. Anyway, one of the most challenging problem for Computer Vision was clearly identified, since the end of the '80s, in the development of intelligent sensors for autonomous navigation control. This is the context where almost all features of Vision research can be fully exploited, in terms of adaptability, dynamic response, visual servoing, learning and understanding of the environment, perception of global features (self-orientation) and local features (obstacle detection) [5]. From 1987 to 1992 our team has participated to an international project, ESPRIT P2502 [6], aimed to develop vision technologies for mobile robotics, together with the most qualified European Research centres in the field. The final demonstration in Genova was a combination of interactive teleguidance, and stereo-based obstacle detection, using off-board processing with a special hardware workstation. Moreover, a strong experience was gained in monocular vision with the development of perspective inversion tools and geometric reasoning by 3D model based techniques. In 1993, to demonstrate the maturity of vision-based navigation, using on-board low-cost PC-based processing hardware, a fully integrated system, SAM (Autonomous Mobile System), was realised, to address a wide class of autonomous navigation and transport tasks, in an indoor environment, in presence of people. In the following section 2 a brief description of this first prototype system is referred. More recently, at the beginning of 1994, an industrial oriented project has been started, to put into practice Vision technology to achieve competitive results both in terms of performance and costs. The goal of this project was the automation of an existing, conventional fork-lift carrier using an intelligent on-board control system, driven by Computer Vision. The reference application was the transportation of self-supporting palletised goods in a warehouse. Section 3 is devoted to briefly recall the Vision techniques which has been developed and used in this project as well as the current experimental results of the engineered version ofRoboLift. A more detailed description of the system can be found in [7]. 2. DESCRIPTION OF THE MOBILE ROBOT SAM The logic architecture of SAM was implemented as a series of almost independent layers of competencies, each one in charge of a single, well defined task, such as obstacle avoidance and global position maintenance. The obstacle avoidance strategy is reflexive, that is the trajectory is heuristically determined on the basis of sensor readings rather than accurately planned starting from a reconstructed local environmental map. The suboptimality of the obtained trajectory is largely compensated by the fast response time which allows to navigate safely at an acceptable speed also in presence of cluttered environments. The hardware solution is based on a PC platform as the main computational infrastructure to reduce costs, minimise the development time and take advantage of the wide choice among a great variety of add-on boards which can be integrated to improve the system functionality. The navigation system needs a periodic position and orientation estimation coming from an external sensor in order to reset drifts of odometry. This is provided through a Vision system able to detect and recognise navigation landmarks placed in known position along the robot routes [8], and to recover the robot position and orientation with respect to them.
53
Figure 1: The mobile robot SAM
Figure 2: The artificial landmark
The selected artificial landmark consists of a black annulus on a white background, as depicted in Fig.2. The 3D position and attitude of the camera with respect to the landmark reference system is obtained from a single image through model based perspective inversion.
2.1. Summary of experimental results Beside extensive laboratory experiments, the system SAM has been tested for two years in our office environment, for document transport and guest accompany service, in normal operating conditions with a lot of people wandering around. Later, the robot SAM has been equipped with a Soundblaster board for sound generation and a radio modem providing a link to an host computer at a remote control station. The remote computer is able to select the most appropriate sound or voice file according to the robot position or the navigation status (presence of obstacles, landmark search, and so on) as communicated by the robot navigation system. The robot in such a configuration was installed in an historical building during the Christmas '93 exhibitions, as reported in [9]. The results has been very encouraging in terms of performance and good acceptance by the people who visited the museum during the exhibition days. 3. ROBOLIFT: AUTONOMOUS FORK-LIFT IN LOGISTIC SERVICES The problem of autonomous transport of material in workshops and warehouses has been traditionally approached through the use of automated guided vehicles (AGV). They have been introduced into the market in the early '70s and have provided significant improvements in terms of efficiency, precision and co-ordination of material flow in manufacturing, as compared to conveyor belts, single track railways, etc. The main drawbacks of the consolidated AGV technology [10] comes from the heavy installation requirements (inductive guides buried into the floor), the need of continuous central control, the rigidity of fixed navigation pathways, the requirement to design specialised machines for each different application. Moreover there is a severe limitation in flexibility, and
54 the position of all the palletised loads in the warehouse is supposed to be known in advance with high precision (within the range of 10 mm.). Our answer is RoboLift the first Autonomous Fork Lift developed jointly by Elsag Bailey Telerobot and Fiat OM Carrelli Elevatori SpA (Patent pending). This system is based on Vision technology both for autonomous navigation and for the recognition of the pallets to be transported. 3.1 Main Characteristics of the vehicle The selected basic vehicle is the classical frontal fork lift carrier (from the well known EU family of Fiat OM Carrelli Elevatori SpA, operating in the range of 1.2, 1.5 ton), being one of the most commonly used in the market. The kinematics of the vehicle is made of three wheels (two driving and one steering). A schematic drawing of the vehicle and the list of sensors which have been introduced for autonomous control is referred in figure 3.
Figure 3 Sensor arrangement in ROBOLIFT
Fig.3 Model based vision and 3D recognition and positioning of the pallet
3.2: Computer Vision for Autonomous Navigation Vision processing is performed primarily to support autonomous navigation. Through the recognition and 3D location of some artificial landmarks (H shaped) placed on the floor along the planned navigation path, the Vision system is able to self-localise the vehicle in the scene, by integrating this information with the odometric values coming from sensors on the wheels (both drive and steering).
55 3D model based vision is used to identify and recognise the H-shaped landmarks placed onto the floor, by exploiting all a priori information available to simplify image processing analysis. To avoid errors and ambiguities caused by other features in the scene or noise effects, geometric reasoning is performed directly in 3D, by reprojecting all features onto the 3D floor. Using extended geometric features, this process is proved to be extremely robust also when the landmarks is partially occluded or damaged by stains. Computer Vision is performed on-line, during the motion of the vehicle passing through these landmarks, along the navigation path. The success of Artificial Vision is strongly related to the accuracy of camera calibration, computed with respect to the odometry of the vehicle, to obtain a homogeneous data representation suitable for the navigation commands (steering and driving). 3.3" Computer Vision for the recognition of the pallet pose A second fundamental vision function which is implemented in ROBOLIFT is pallet detection and recognition and is performed by a camera placed within the forks and rigidly connected to them, so that it can move up and down, searching for the different position of the palletised load. A model-based Vision algorithm has been implemented, to search for the central cavities of the pallets and compute the size and shape of these holes. A prediction-verification paradigm has been implemented. It consists in projecting onto the image the geometry of the pallet model, from the expected position in the 3D world. An adaptive estimation of the contrast is performed into the expected hole position in the image, followed by a controlled region growing process aimed to propagate this grey level up to the border of the hole, with a constraint on the expected size and shape. A schematic example is referred in figure 3. Once the holes are correctly identified and localised, the current 3D position of the pallet is computed. If this new position is within the tolerance bounds to be carried by the fork-lift, the forks will be properly shifted to left or right of the appropriate amount to take the load in a centred position. The project has been developed by taking as a reference the standard Europallett of size 1200 x 800 mm., with loading side 1200. The Computer Vision system has proved to be able to recognise the presence of the pallet, also in a wide range of operating conditions, from intense sun-light, artificial light and shadows, and compute the current distance and orientation of the pallet with respect to the vehicle.
4. RESULTS AND STATUS OF THE PROJECT The ROBOLIFT project started in 1994, with the definition of the basic architecture and the selection of the necessary modifications implemented on a first prototype. During 1995 a first laboratory prototype has been integrated and a first release of control software has been made available in July '95, followed by an extensive experimentation phase in a suitably equipped warehouse environment. A second engineered prototype has been integrated at the beginning of'96, and it has been officially presented at the Hannover Fair in April '96. Further engineering work is in progress to improve the system performance, its robustness and reliability as well as the level of integration with the application field.
56 5. CONCLUSIONS One of the most important objective of advanced research institutions, including EEC programmes, is the promotion of industrial exploitation of research results with particular attention to A.I. technologies which received significant research funds in the last few years. The paper describes our recent experience in exploiting Computer Vision technology in transport service automation, following the necessary fundamental steps of basic research development, laboratory prototype implementation, and the final acquisition of a strong integration knowledge and expertise. The achieved results consists in the integration of an autonomous fork-lilt carrier, which can be also driven in the conventional way by the human operator. The system makes use of model based passive vision techniques, without any external active lighting support. Computer Vision represents the main sensory component for both autonomous navigation and pallet recognition. The possibility to use a standard PC-based multiprocessing architecture allows the implementation of a competitive industrial system. The extensive experimental results collected during many hours of test demonstrate a high maturity of Vision technology in advanced mobile robotics, to come out from the research labs an be used as an established and accepted technology in Industrial Automation. REFERENCES 1. L.Borghesi, et al. "A Modular Architecture for a flexible real-time Robot Vision System", Proc. of the Int. Conference on Digital Signal Processing, Firenze, Sept. 1987. 2. G.Garibotto, S.Masciangelo, "Depth and Motion Analysis P940: development of a realtime Computer Vision System", ESPRIT Workshop at ECCV'92, S.Margherita, May 1992. 3. Faugeras, "Three-Dimensional Computer Vision; a geometric viewpoint", The MIT press, 1993. 4. E.Bruzzone, F.Mangili, "Calibration of a CCD Camera on a Hybrid Coordinate Measuring Machine for Industrial Metrology", International Symposium on Industrial Vision Metrology, Winnipeg, Manitoba, Canada, July 1991. 5. G.Garibotto, S.Masciangelo, "3D Computer Vision for Navigation/Control of Mobile Robots, in Machine Perception, AGARD Lecture Series, 185, 1992. 6. B.Buxton, et al. "The Transfer of Vision Research to Vehicle Systems and Demonstrations", Proc. of the ESPRIT Conference, 1991, Brussels. 7. G.Garibotto, "ROBOLIFT: Vision-guided Autonomous fork-lilt", Service Robot, An International Journal, Vol.2, n.3, 1996, pp.31-36. 8. Garibotto, M. Ilic, S. Masciangelo, "An Autonomous Mobile Robot Prototype for Navigation in Indoor Environments", Proc. of the Int. Symposium on Intelligent Robotic Systems '94, Grenoble (France), July 1994. 9. Garibotto, S. Masciangelo, M Ilic, "Vision Based Navigation in Service Robotics", pp. 313-318, Image Analysis and Processing, Lecture Notes in Computer Science, Springer, 1995 10. Warnecke, C.Schaeffer, J.Luz, "A driverless and free-ranging fork-lift carrier, Proc. of the 24th ISIR, session El, Nov. 1993.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
57
An Optimal Estimator of Camera Motion by a Non-Stationary Image Model G. Giunta
U. Mascia
INFO-COM Department, University of Rome "La Sapienza", Rome, Italy e-mails: [email protected] [email protected] Camera zooming can be regarded as a 2D Doppler effect. Techniques for Doppler estimation from 1D signals, based on data partition and linear regression on a set of time-delay measurements, were presented in the literature. Such basic idea is here extended to fine motion estimation. The devised algorithms, estimating four global motion parameters (viz: horizontal and vertical translation, zooming, and rotation), are based on a non-stationary model. They have been validated by both synthetic and experimental tests. 1. I N T R O D U C T I O N The analysis and estimation of motion are very important in time-varying image processing. Many algoritms have been developed to estimate 2D motion for different applications [ 1]-[ 12], such as object tracking, image segmentation, environment sensing for autonomous vehicle navigation, image sequence coding, object-oriented analysis and synthesis coding, TV standard conversion, frame rate conversion, bandwidth compression for HDTV, very low bit-rate video coding for audiovisual services, 2D motion parameter acquisition from image sequences, camera motion estimation and compensation, etc.. Fast discrete-time techniques for time-delay estimation with sub-sample accuracy, based on a parabolic interpolation of estimated cross-correlation samples, were devised and analysed for random 1D signals [ 13]. Camera zooming (or radial motion) causes an isotropic change of the scale in the whole image, that can be regarded as a 2D Doppler effect on the magnitude of spatial polar coordinates. Moreover, any rotation can be modeled as a 2D Doppler effect on the phase of spatial polar coordinates. Techniques for Doppler estimation of 1D signals were proposed in the literature [14]-[16]. Among them, an indirect estimation method based on data partition and linear regression on a set of time-delay measurements (linearly related to the actual Doppler coefficient) was devised [ 17]. Such basic idea is here extended to a 2D fast estimator of spatial position. Motion compensation brings about a saving in bit-rate, due to the smaller prediction error as well as to the reduction in motion-vector coding. A fine estimation with sub-pixel accuracy can take account of more information, giving better results in the prediction of the picture. Stationarity is widely assumed in image modeling for the sake of simplicity, while it is well known that this assumption is far from reality. A motion estimation procedure with a subpixel accuracy is here presented. It is based on a non-stationary model depending on local properties. This approach can be usefully employed to reach our aim, by extending the method devised in [ 17], for estimating the four global motion parameters (viz: horizontal and vertical translation, zoom, and rotation). In particular, a 2D paraboloid is used for interpolating the inter-pixel cross-correlation estimates. This method can be applied to 2D Doppler estimation, after a block partition of the whole image, by a linear regression in the complex domain of the spatial dispacements. Such measurements can be also weighted according to a proper error function, derived from a confidence measure, accounting for the local statistics of each block. Such a non-stationary method is then based on the minimization of the weighted mean square error.
58 2. I M A G E MODEL AND MOTION ESTIMATOR
2.1 Time-varying image model Let z = x + j y be the Cartesian coordinates, expressed in the complex domain. Let us consider a pair of sequential image frames, say R and P. Let us assume the following model of instantaneous motion: R(x,y) = R(Z) = S(Z) (1) (referencepicture) P(x,y) = P~) = S[~-~i) / ~ + E~) (2) (moved picture) where E(/,) is the model error image, that also accounts for the two noises. In particular, Re{fi} and Im{fi} represent the horizontal and vertical displacements, while the term o~=pexp [j 0] accounts for both the zoom factor (p) and the rotation angle (0). It is interesting to point out that any rotation can be directly included in the MSE estimation in the complex domain: in fact, a complex change of scale takes zoom as well as rotation into account (i.e. modulus represents zoom, and phase represents angle of rotation).
2.2 Displacement estimation Our method performs a fine (sub-pixel) estimation by means of a fast digital algorithm. The whole picture is divided into small blocks and the relative position displacements are extracted by a conventional matching algorithm based on cross-correlation measurements. The estimated displacements are linearly related in the complex domain and the four parameters can be derived by performing a complex linear regression. The relationships so obtained can be weighted according to their accuracy, which depends on the contents of each block of data. In paticular, we divide the whole reference image into small blocks. For each block, we search for the best matching block in the moved picture (like accomplished in several widely used standard), by evaluating the magnitude of displaced frame difference, i.e.: MDFD(.-I) = { ~k I R(_zk) - P(_zk+~) I } (3) where the sum extends over all the pixels of the considered block. We then estimate the linear motion between the considered pair of blocks as the displacement ld, that minimizes the magnitude of displacedframe difference, i.e.: R = "c 9arg MDFD(Id,) = arg mini: MDFD~) (4) by performing a sub-pixel parabolic interpolation of the square of the estimated MDFD(.TJ: MDFD2(.~) = MDFD2(~,~) _=a ~2 + b ~2 + c ~ + d ~ + e ~ + f (5) with :t = ~ + j ~ . The displacement ld, is determined as the minimum argument of the parabolic function, fitted by six samples of the squared MDFD(jO chosen around its coarse minimum.
2.3 MSE global motion estimation Let /5 be the center of the i-th reference block. We may write a (usually overdetermined) set of linear equations in the complex domain:
o~zi + ~ = R i
(6)
In fact, there are 2 complex unknowns (8 and ~ accounting for 4 real parameters (viz: horizontal and vertical displacement, zoom factor and rotation), while the number of equations is equal to the number of considered blocks in the whole picture.
59 If we chose the origin of the coordinates at the exact center of the reference frame and we take a symmetic arrangement of N blocks into account, a standard pseudo-inverse solution, based on the mean square error (MSE), can be employed: N N
~5= 1 ~
i=l
~l,i (7/
zi Ri
o~=~
(8)
N
i=l
~ z i zi i=l
2.4 WMSE global motion estimation
If we have no particular symmetries or if we wish to take the available equations into account with different weights, a weighted mean square error (WMSE) can be defined: /5=~-1 I [ ~ w i2 z i*zi][~ w i2 i]~l,- [ ~ w i2 *i l [ ~z w i2 z i ~l,i] 1 i=l
i=l
i=l
ct= 1 I[i__~1 w2l [i__~1w2z~Ril-[i__~1 w2z~l [i__~1w2Ril I with
A= ~ i=l
~w2zi z
~wiz i
i=l
i=l
(9)
i=l
(10,
wiz i=l
In fact, while the MSE criterion minimizes (6), the solutions (9)-(10) minimize the set of equations: Wi [ (t ~ + ~5 ] = Wikl, i (11) We have employed a parabolic approximation of the squared MDFD (5) near the minimum in the displacement estimation. It is well known the dependence of time-delay estimation performance on such a parameter (its second derivative provides the asymptotic error variance). The curvature of the same squared MDFD function is the simplest local confidence measure that we may use. In particular, we have employed the curvature of the estimated squared MDFD (depending on the cross-correlation function) along the direction of the local motion gi estimated for each block, i.e.: --
2
wi
M D F D 2(.~i)
~ ~i 2
I
(12)
,~i= Ri
MDFD (depending on the cross-correlation coefficient), divided by the variances 0..Kz- and o.. 2 of the reference and the moved blocks, i.e." t" and the normalized
1 . wi2 OROp
i
~,i2
MDFD 2r
(13)
~i=ai with Y'i= ~'i exp[Jei]' evaluated along the same direction (i.e. ei=angle{Ri}).
60
3. SYNTHETIC AND EXPERIMENTAL RESULTS 3.1 Synthetic tests with known global motion parameters A number of standard still images (viz: "Airplane", "Barbie", "Baboon", "Boats", and "Gold") have been considered. The test images were reduced by the factor 4 from the original size of 512.512 pixels, becoming 128.128 pixels sized. Each image was compared to a copy of itself, deformed by a known set of global motion parameters. The parameter values were randomly chosen, each one with a uniform probability density function, in the range [-0.5,0.5] pixels for the horizontal and vertical displacements, in [0.95, 1.05] for the zoom factor, and in [-3,3] degrees for the rotation angle. Each image was tested 500 times, and a total of 2500 test images was then obtained. The motion vectors were estimated from 8.8 pixels sized blocks. Three different weights were used: NOW = no weight; DCCF = directional convexity of the cross-correlation function; DCCC = directional convexity of the cross-correlation coefficient. The numerical results of accuracy of the four estimates are reported in tabs. 1-4 for the cases of few (16) and many (195) blocks used in the performed tests. H.Dis.(pels) NOW DCCF DCCC
few (16) blocks Bias.10 Var.ol02 MSE.102 0.13 1.46 1.48 -0.09 0.27 0.27 -0.12 0.29 0.30
many (195) blocks Bias.10 Var..102 MSE.102 0.04 0.17 0.17 0.01 0.33 0.33 0.02 0.30 0.30
V.Dis.(pels) NOW DCCF
Bias.10 0.63 0.14
Var.~ 1.78 0.65
MSE.102 2.17 0.67
Bias.10 0.08 0.04 0.04
Var..102 0.15 0.14 0.12
MSE.102 0.16 0.14 0.12
Zoom factor NOW DCCF DCCC
Bias.103 4.88 1.37 1.31
Var.ol06 68.1 21.0 21.1
MSE.106 91.9 22.9 22.8
Bias.103 0.11 0.17 0.11
Var..106 1.02 1.54 1.78
MSE.106 1.03 1.57 1.79
Rot.angle (o) NOW DCCF DCCC
Bias.10 0.97 0.36 0.37
Var..102 43.5 9.8 9.7
MSE.102 44.4 9.9 9.9
Bias.10 -0.05 0.01 -0.05
Var..102 0.31 0.22 0.23
MSE.102 0.31 0.22 0.23
Tables 1-4. Estimation accuracy of the synthetic tests for a small and a large number of blocks.
3.2 Experimental tests with unknown global motion parameters Pairs of frames from two standard test image sequences (viz: "Foreman" and "Tabletennis") were extracted, because of their global motion characteristics. The subsequences were alternately cut forward and back. The motion vectors were estimated from 8-8 pixels sized blocks and collected to simultaneously estimate the four motion parameters. Some significant results of the three estimates (NOW, DCCF, and DCCC) of the rotation angle from the "Foreman" sequence (320 blocks per frame, 141 frames) and of the zoom factor from the "Table-tennis" sequence (1176 blocks per frame, 93 frames) are shown in the figs. 1-2.
61
Frames
w
from
"Foreman"
3
i!
L,
N0W [XX~
~ t 9
R
ii
il~.
l i i i~ t~
-
l/,J
//!
l
ol c t~
c o
in
-1
I I , .... ,.h,
m o
-2
9
90
i
9
100
l
9
110 frames
i
120
130
Fig. 1. Estimates of the rotation angle from an experimental sub-sequence. Frames
from
.
#
"Table-tennis"
1,03 L_
o r
1,02
N,-
E o o N
, t -"
,, x...~,....,,f .~
~\
1,01 DCCC 1,00
50
.
,
60
.
,
70
frames
9
,
80
. 90
Fig. 2. Estimates of the zoom factor from an experimental sub-sequence. 4. C O N C L U D I N G
DISCUSSION
The results of the synthetic tests, performed on actual standard images to validate the devised algorithm, have shown that the WMSE-based method is suited in the presence of a small number of blocks, while its accuracy is comparable to a simpler MSE-based method for a larger number of available data. No significant difference appears between the results obtained with weights derived from the local cross-correlation function or from the local crosscorrelation coefficient. The method has been also applied to standard test image sequences. It visually appears that the WMSE criterion enhances the dynamic properties of the algorithm (this can be useful for fast tracking the camera motion), while MSE-based estimates are usually more smoothed.
62 The specific criterion (namely: uniform MSE, CCF-based WMSE, CCC-based WMSE) should be chosen in practice according the particular application. As a general remark, a simple MSE-based technique is suited for estimating the camera motion of high resolution image sequences, while both the examined WMSE-based methods should be preferred on small images, such as a region of interest extracted by a segmentation algorithm and containing a moving object. Future research investigations will include the case of multiple objects moving on a background. The mathematical problem then becomes a multi-linear regression, that can be resolved after a proper clustering of the available measurements. REFERENCES
[ 1] J.K. Aggarwal and N. Nandhakumar, "On the computation of motion from sequences of images - A review", Proc. IEEE, Vol. 76, No. 8, 1988, pp. 917-935. [2] G.J. Keesman, "Motion estimation based on a motion model incorporating translation, rotation and zoom", in Signal Processing IV: Theory and Applications, 1988, pp. 31-34. [3] S.F. Wu and J. Kittler, "A differential method for simultaneous estimation of rotation, change of scale and translation", Signal Processing: Image Communication, Vol. 2, 1990, pp. 69-80. [4] J.H. Moon and J.K. Kim, "On the accuracy and convergence of 2D motion models using minimum MSE motion estimation", Signal Processing: Image Communication, Vol. 6, 1994, pp. 319-333. [5] Z. Eisips and D. Malah, "Global motion estimation for image sequence coding applications", Proc. 17th Conv. of Elec. and Electronics Eng., Israel, 1991, pp. 186-189. [6] Yi Tong Tse and R.L. Baker, "Global zoom/pan estimation and compensation for video compression", Int. Conf. Acoust. Speech Signal Proc., 1991, Vol. 4, pp. 2725-2728. [7] G. Giunta, T.R. Reed and M. Kunt, "Image sequence coding using oriented edges", Signal Processing: Image Communication, Vol. 2, No. 4, 1990, pp. 429-440. [8] M. Bierling and R. Thoma, "Motion compensating field interpolation using a hierarchically structured displacement estimator", Signal Processing, Vol. 11, 1986, pp. 387-404. [9] Y. Ninomiya and Y. Ohtsuka, "A motion compensated interframe coding scheme for television pictures", IEEE Trans. Commun., Vol. COM-30, 1982, pp. 201-211. [ 10] A. Amitay and D. Malah, "Global-motion estimation in image sequences of 3-D scenes for coding applications", Signal Processing: Image Communication, Vol. 6, 1995, pp. 507-520. [11] M. Hoetter, "Differential estimation of the global motion parameters zoom and pan", Signal Processing, Vol. 16, 1989, pp. 249-265. [ 12] P. Migliorati and S. Tubaro, "Multistage motion estimation for image interpolation", Signal Processing: Image Communication, Vol. 7, 1995, pp. 187-199. [ 13] G. Jacovitti and G. Scarano, "Discrete time techniques for time delay estimation", IEEE Trans. on Signal Proc., Vol. 41, No. 2, 1993, pp. 525-533. [14] C.H. Knapp and G.C. Carter, "Estimation of time delay in the presence of source or receiver motion", J. Acoust. Soc. Am., Vol. 61, No. 6, 1977, pp. 1545-1549. [15] J.W. Betz, "Comparsion of the deskewed short-time correlator and the maximum likelihood correlator", IEEE Trans. on Acoust. Speech Signal Proc., Vol. ASSP-32, No. 2, 1984, pp. 285-294. [ 16] J.W. Betz, "Effects of uncompensated relative time companding on a broad-band cross correlator", IEEE Trans. on Acoust. Speech Signal Proc., Vol. ASSP-33, No. 3, 1985, pp. 505-510. [17] E. Weinstein and D. Kletter, "Delay and Doppler estimation by time-space partition of the array data", IEEE Trans. on Acoust. Speech Signal Proc., Vol. ASSP-31, No. 6, 1983, pp. 1523-1535.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
63
A simple c u e - b a s e d c a m e r a calibration m e t h o d for digital production of m o v i n g images Y. Nakazawa, T. Komatsu
and
T. Saito
Department of Electrical Engineering, Kanagawa University 3-27-1 Rokkakubashi, Kanagawa-ku, Yokohama, 221, Japan
One of the keys to new-generation digital image production applicable even to domestic uses is to construct simple methods for estimating the camera's motion, position and orientation from a moving image sequence observed with a single domestic video camera. For that purpose, we present a method for camera calibration and estimation of focal length. The method utilizes four definite coplanar points, e.g. four corner points of an A4 size paper, as a cue. Moreover, we apply the cuebased method to the digital image production task of mixing real and CG moving image sequences. The cue-based method works well for the digital image mixing task.
1. INTRODUCTION
- BACKGROUND AND MOTIVATION -
Recently some research institutes have started studying digital production of a panoramic image sequence from an observed moving image sequence, construction of a virtual studio with the 3-D CG technology and so on, with intent to establish the concept and the schema of the new-generation digital image production technology. Such an image production technology, utilizing information about the camera's motion, position, orientation and so on, integrates consecutive image frames to produce such an enhanced image as a high-resolution panorama or mixes a synthetic 3-D CG image sequence and a real moving image sequence taken with a video camera. The key to the newgeneration digital image production applicable even to domestic uses is to develop simple methods for estimating the camera's motion, position and orientation from a real moving image sequence observed with a single video camera [1 ]-[4]. In this paper, to render it feasible to pertbrm such 3-D estimation of the camera's motion, position and orientation when we use a single do~nestic handy video camera whose camera parameters are not given in advance, we present a method for performing camera calibration along with estimation of focal length of the camera accurately by using four definite coplanar points, which usually correspond to four vertices of a certain quadrilateral plane object such as an A4 size paper, as a cue. The practical computational algorithms for the cue-based method of camera calibration are composed of simple linear algebraic operations and arithmetic operations, and hence they work so well as to provide accurate estimates of the camera's motion, position and orientation stably. Furthermore, in this paper, we apply the cue-based camera calibration method to the image production task of mixing a synthetic 3-D CG image sequence and a real moving image sequence taken with a video camera according to the recovered estimates of the camera's motion, position and orientation.
64 2. CUE-BASED C A M E R A C A L I B R A T I O N In this paper, we assume the following situation. The situation is as follows: while moving the single video camera arbitrarily by hand, we image the scene which includes not only the objects of interest but also four definite coplanar points P~ -- P4 whose relative positions are known in advance and which usually correspond to four vertices of a certain quadrilateral plane object with known shape and are used as a cue for camera calibration. We perform camera calibration, that is to say, determination of the camera's position and orientation at each image frame, from 2-D spatial image coordinates of the four definite coplanar cue points, which are detected with our recently presented active line-contour model [5] and tracked temporally over consecutive image frames. Under such conditions, we perform camera calibration and estimate the focal length f o f the camera at the same time.
2.1. Image Coordinate System Here for each image frame we define the 3-D viewing coordinate system o'-x'y'z' which is associated with the 2-D image coordinate system O-XY as shown in figure 1. In figure 1, we represent the 3-D viewing coordinates and the 2-D image coordinates with ( x' y' z' ) and ( X Y ) respectively. We represent the 3-D viewing coordinates of the four coplanar cue points Pl " P4 with { p~' = ( x / y / z i' )t ; i = 1,2, 3, 4 }, and we represent the 2-D image coordinates of the imaged coplanar cue points, perspectively projected onto the image plane, with { Pi - ( X Y )" i = 1, 2, 3, 4 }. 2.2. Camera Calibration The problem of camera calibration is to recover geometrical transformation of 3-D world coordinates of an arbitrary point in the imaged scene into its corresponding 2-D image coordinates, from given multiple pairs of the 3-D world coordinates and their corresponding 2-D image coordinates. The camera calibration problem is concisely formulated with the homogeneous coordinate systems. Given both 4-D homogeneous world coordinates a - ( x y z 1 )' of an arbitrary point in the imaged scene and their corresponding 3-D homogeneous image coordinates b = h. ( X Y 1 )', then the foregoing transformation will be represented as the linear transformation, which is
m3 3-D world coordinate ,
Y
~
ml
X
Figure I. Coordinate systems.
system
65 defined as follows:
(x'
y
,
z
,)t
-M'(x
v
=(m I
m2
)t 1
z
m3
m4).(x
v z. 1)t
=x.m 1 +y.m 2+z.m 3+m 4
(1)
Xt
X=--7.f
,
Y= y ' . /
Z
(2)
.7
where the focal length )"is explicitly handled. Here the camera calibration problem is defined as the problem to recover the 3 x 4 matrix M and the focal length f o f equation 1 from given multiple pairs of the homogeneous world coordinates and their corresponding homogeneous image coordinates. Equation 1 means that the 3-D viewing coordinates ( x' y' z' ) are expressed as a linear combination of the three vectors { in 1 rnz in 3 }, and hence we may regard the three vectors { m I in 2 m 3 } as the basis vectors of the 3-D world coordinate system o-xyz. On the other hand, the vector m 4 means the displacement vector shifting from the origin of the 3-D viewing coordinate system to that of the 3-D world coordinate system. Here we imagine a plane quadrilateral whose four vertices are given by the four definite coplanar cue points, and we refer to the plane quadrilateral as the cue quadrilateral. As a common coordinate system to all image frames, we define the 3-D world coordinate system o-xyz whose x-y cross section contains the cue quadrilateral, that is to say, whose z-axis is normal to the cue quadrilateral. Moreover, without loss of generality, we put the origin of the 3-D world coordinate system o-xyz at one of the coplanar cue points, e.g. P l In this case, we can represent the 3-D world coordinates of the four coplanar cue points P~ - P4 with
I,-(xi
}
y, :.,),-(-,, y, o)'-o Pi-(xi
Yi
zi
Yi
0) t
"i-2,3,4
(3)
Assuming that the focal length f o f the camera is accurately estimated some way or other, which will be described in the next section, we can easily recover the 3 x 4 transformation matrix M of equation 1 from the four pairs of the 3-D world coordinates Pi - ( x; Yi 0 )' of each cue point P; and its corresponding image coordinates P ~ - ( X; Y; )'. Substituting the four coordinate pairs into equation 1, then we will reach the simultaneous equations:
, (x'i
Yi
xi/z'i
,)t zi
=N.(xi
Yi
1) t =
in,1n,2 n21
n22
n24
n31
n32
n34
"
i
/
t
Xi -
,
Yi - Yilz'i
" i - l, 2, 3, 4
(4)
where the focal length f i s implicitly included in the expression of the 3 x 3 matrix N and the matrix
66 N is related with the matrix M as follows" / mll m21
m12 m22
m24
m31
m32
m34
ml411 =
nillf n21/f
nl2/f
nl41f]
n22/f
n24/f
tt31
n32
(5)
t/34 J
The simultaneous equations of given by equation 4 are linear with respect to the nine unknown matrix components n's, and we can easily solve them. However, their solution should be expressed with one scale factor, and hence here we set the value of the matrix component n~ to one. Moreover, given the focal length f o f the camera, then we can recover the column vectors { m 1 m z m 4 }of the matrix M by applying the relation of equation 5. With regard to the column vector m 3 o f the matrix M, we should employ a vector which is normal to both the two column vectors { m 1 m z }, e.g. m3 =
{Im~l/Iml x m2]}.
(m I x m2)
(6)
Thus we can recover the 3 x 4 transformation matrix M of equation 1. 2.3. E s t i m a t i o n
of Focal Length
Once we recover the foregoing transformation matrix N of equation 4, we can estimate the relative depth z i' of each coplanar cue point Pi as follows: t
zi = m31 "xi + m32 ' Yi + m34 = n31 "xi + n32 "Yi + n34 Thus we get an estimate of the 3-D viewing coordinates Pi' - ( P. as follows"
,( x'i Yi) tZi( = Xi "z.i l l
I
Pi =
Yi "z i l l
Xi' Yi' Zi' )' o f
;
(7) each coplanar cue point
(8)
zi
The lengths of the four sides of the cue quadrilateral are assumed to be known in advance, and furthermore taking account of the fact that the ratio of lengths of two sides arbitrarily chosen out of the four sides is invariant irrespective of the definition of the 3-D coordinate system, we get the relation:
I'
'1
P2 - P l I / I P 4 - P l
= IP2
-
Pl
12/Ip4-p,I
2
=
r
(9)
Substituting equation 8 along with equation 7 into equation 9, then we will obtain the quadratic equation with respect to the focal length f. The solution is given by (10)
f = ~/(r.C-A)/(B-r.D) -
- x,. z;)
+
-
z;)
C = (X 4'2:.'4 - X l'z' 1)2+ (Y4"z4-YI'z'I)2,
,)2
67
3. DIGITAL MOVING IMAGE MIXING We have imaged the scene in our laboratory while moving a 8-mm domestic handy video camera arbitrarily by hand, and then we have applied the foregoing cue-based camera calibration method to the moving image sequence, each image frame of which is composed of 720 • 486 pixels. In the imaged scene we have put an A4 size paper on the table in the scene, and we have used the four comer points of the A4 size paper as cue four coplanar points. Moreover, we have put a book on the table as a still obstacle. We have detected the four edges of the A4 size paper with our recently presented active line-contour model [5], and identified its four comer points as their intersections, and thus we obtain estimates of the image coordinates of the four comer points.
Figure 2. Image frames chosen from the resultant mixed moving image sequence.
68 We have performed the digital image production task of mixing a synthetic 3-D CG image sequence of moving two toy cars and a moving toy robot and the real moving image sequence of our laboratory, according to the recovered estimates of the camera's motion, position and orientation. Figure 2 shows some image frames chosen fiom the resultant mixed moving image sequence. As shown in figure 2, we can hardly identify any artificial distortions in the mixed image sequence, which demonstrates that the cue-based camera calibration method works well for the foregoing digital moving image mixing task. 4. CONCLUSIONS In this paper, we present a method for performing camera calibration along with estimation of focal length of the camera accurately by using four definite coplanar points as a cue. The practical computational algorithms for the cue-based method of camera calibration are composed of simple linear algebraic operations and arithmetic operations, and hence they work so well as to provide accurate estimates of the camera's motion, position and orientation stably. Moreover, in this paper, we apply the cue-based camera calibration method to the digital moving image production task of mixing a synthetic 3-D CG image sequence and a real moving image sequence taken with a domestic video camera according to the recovered estimates of the camera's motion, position and orientation. Experimental simulations de~nonstrate that the cue-based camera calibration method works well for the digital moving image mixing task. The key to the accurate cue-based camera calibration is how to detect the feature points used as a cue in an input image highly accurately. Sub-pixel accuracy will be possibly required for the detection task. To detect the feature points with sub-pixel accuracy, in advance we should enhance spatial resolution of the image region containing the feature points. It seems that we can apply our recently presented temporal-integration resolution-enhancement method [6] to this purpose. Moreover, to complete a practical image processing algorithm for the digital moving image mixing task, in addition to the camera calibration, we should take account of many other points, that is to say, occlusion between real objects and synthetic CG objects, 3-D shape of real objects, and so on. Further studies on these points will be required. REFERENCES
1. K. Deguchi, "Image of 3-D Space : Mathematical Geometry of Computer Vision", Shoukodo Press, Tokyo, Japan, 1991. 2. C. Longuet-Higgins , "A Computer Algorithm for Reconstructing a Scene from Two Projections", Nature, .293 ( 1981) 133. 3. R. Horaud, et al., "An Analytic Solution for the Perspective 4-Point Problem", Computer Vision, Graphics, and hnage Processing, 47 (1989) 33. 4. C. J. Poelman and T. Kanade, "A Paraperspective Factorization Method for Shape and Motion Recovery", Lecture Notes in Computer Science, 801 (1994) 97. 5. Y. Nakazawa, T. Komatsu and T. Saito, "A Robust Object-Specified Active Contour Model for Tracking Smoothly Deformable Line-Features and Its Practical Application to Outdoor Moving Image Processing", IEEE 1996 International Conference on Image Processing, 17P8.13, 1996. 6. Y. Nakazawa, T. Komatsu and T. Saito, "Temporal Integration Method for Image-ProcessingBased Super-High-Resolution Image Acquisition", lEE Proc. Vis. Image Signal Process., 143 (1996) in press.
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 69
1997 Elsevier Science B.V. All fights reserved.
E x l ) h ) r a t i o n of t h e e n v i r o n m e n t w i t h o p t i c a l s e n s o r s m o u n t e d on a m o b i l e robot P. Weckesser, A. von Essen, G. Al)l)enzeller, R. Dillmann Institute for Real-tiine Computer Systems & Robotics Prof. Dr. U. Reinhold, Prof. Dr. R. Dillmann University of Karlsruhe, Department for Comt)uter Science I(arlsruhe, Germany The exI)loration of unknown enviroimmnt is an iInportant task for the new generation ()f mol)ile service robots. These rol)ots are supposed to operate in dynalnic and changing envir()nlnents together with human beings an(1 other static or moving objecl, s. Sensors tha,t are capable of providing tile quality of information that is require(t for the described scenario are optical sensors like digital cameras and laserscanners. In this I)al)er sensor integration and fusion for sli(:h sens()rs is described. Complementary sensor information is transformed into a coInmOn representation in order to achieve a coot)crating sensor system. 1. I n t r o d u c t i o n
In this Imper an apt)roach to fuse sensor infl)rnmtion fr(ml (:{mlplementary sensors is I)resented. 'File mobile rol)()t PlqlAMOS (figure 1) was used as an experimental testlmd. A multisensor systmn supports the vehicle witll o(lometric, sonar, visual and laserscanner information. This work is part of a large project with tile goal of lnaking r()bot navigation safer, faster, more rdiable and l~lore stalJle under changing environtnental conditions. An architecture for actiw; and ta.sk-driw'~l processing of sens()r data is presented in [10]. With this architecture it is p()ssit)le to control the sensor system acc()r(ling to envir(mmental conditions, per(:eiw;d sensor infornlation, a1)riori knowledge and the task of the robot. The system's 1)erfl)rmance is demonstrate(l for tile task of exploring an llnkn()wn environment and incrementally lmihling up a geometrical model of it.
Figure 1. PRIAMOS
70 Sensor fusion is perh)rmed by matching the local perception of a laserscanner and a camera system with a global model that is being built up incrementally. The Mahalanobis distance is used as matching criterion and a Kalman filter is used to filse matching features. A common representation inclu(ling the uncertainty and the confidence is used for all scene features. 1.1. Mobile robot navigation Navigation tasks of a mobile robot can be subdivided into three subproblems.
1. collision avoidance: this is the basic requirement for safe navigation. The problem of collision avoidance is solve(1 for dynamic environments with different kinds of sensors like sonars or laserscanners.
2. mobile robot positioning: if geometrical a priori information about the environment is available to the robot the folh)wing questions can be asked. 'Where am I?', Where am I going?' and 'How do I get there' [7]. With today's sensors these questions can be answered for static environments [6]. Though the problem is not solved in general for dynamic and changing environments.
3. exploration and environmental modelling: The problem of exploring an unknown environment was apI)roached by various groups [4,1,9] but is by far not solved. Most approaches aim at building up a 2-dimensional map of a static environment. In this paper a 3-dimensional map of the environment is built up with an integrated use of a laserscanner and a trinocular vision system. The laserscanner only provides 2dimensional information. The vision system is capable of perceiving the environment 3-dimensionally. The goal of this paper is to develop and to apply sensor fusion techniques in order to improve the system's performance for 'mobile robot positioning' and 'ext)loration of unknown environments'. The approach is able to deal with static as well as dynamic environments. On different levels of t)rocessing geometrical, topological an(l semantical mo(lels are generated (exploration) or can be used as a priori information (positioning). The system's l)erformance is demonstrated for the task of building a geometrical model of an unknown environment. 2. Obtaining 3D descriptions of the scene Ill this section the reconstruction of the 3-dimensional scene with the trinocular vision system and the laserscanner is descril)e(|. As scene featurcs linear edge segments are used. These edge segments are reI)resented by nlidI)oint, direction-vector and half-length. The uncertainty of the segments is rel)resented by a covariance matrix [3]. The xz-I)lane is the ground-plan of the coordinate system and the y-axis represents the h(:ight. As the laserscanner only provides 2-(limensional data the y-coordinate is always equal to the height in which the sensor is mounted on the robot.
71
2.1. 3 D r e c o n s t r u c t i o n f r o m t r i n o c u l a r s t e r e o
The process to reconstruct scene features from camera images is relatively complex but it is possible to derive a 3-(limensional (lescription of the scene. The first step of stercoima,ging is the calibration of the, cameras. In [12] a photogrammetric aI)t)roach to highly accurate camera calibration of zo()m lenses is developed. The result of the calil)ration is the matrix MDLT which (lescril)es the transformation from scene- to image-coordinates for a camera. In homogeneous coor(linates this tra, nsformation is given by ( 'W"'U?' ) "W~'U~
i
ix/ y
"
-- MDLT
wi
Z
(1) "
1
In the presented system linear edge segments which are extracted fi'om the camera images in real-time are used as image and scene features. It is possible to reconstruct scene features if corresponding image features in at least two camera images are known. This means that the stereo correspondence 1)rol)lenl has to be solved. In [8] a trinocular stereo-matching algorithm using the eI)ipolar c()nstraint combined with a local n()rmalized cross-correlation technique has been developed. The stereo matching algorithm provides c()rresl)onding image I)oints (u i, v i) in the camera images. For the presented system it was ext)erimentally proved that the uncertainty of the matches can generally be estimate(t to be below one pixel. F()r the stereo-reconstruction of a scene point the following overconstrained linear system has to be solved: A
y
- b
~
(,)
p-
y
z
-(ATA)-'ATb.
(2)
z
In [5] it is shown that the uncertainty for the reconstruction of a scene point can be written in a first order approximation by a covariance matrix E p -- J E,,,,,
jT
with
J
0((ATA)-'ATb)
0(',,,~, v')
(3)
Are, asonable estimation for E~,,~ is given by Eu,,, = 1. In order to reconstruct a line segment the endpoints of the lille Pl and P2 are reconstructed. The equations 4 to 8 (lescril)e the representation of a line segment by midpoint, normalized dire, ctiou-vector and halflength and tile correspon(ling covariance matrices for the representation of the m~certainty. For a minimal reI)resentation there is no uncertainty ret)resente(1 for the halflength. m -
midpoint
(4)
r -
m-v2 [[PI --O2[[
normalized direction
(5)
l =
[[Pl-P2112
ha lflength
(6)
E m ---
Epl +Ep:~.4
(:()variance of midI)oint
(7)
Er -
r..~ +r.p22 [[Pl -p2ll
covariance of direction
(8)
-
-
p~+p22
Tlle state vector for a line segment in rol)ot coordinates is given by k r - (m, r, l) T
72 3D descriptions obtained by a laser scanner Figure 2 shows a laserscan that is ac(luired in a corridor environment. The sensor (lata 1)r()vi(led by the laserscanner are 2-(limensional (ground plan) so y-coordinate is always 0. In order to show the quality of the laserscan the cad-model of this environment is overlayed in grey lines. 2.2.
_:__=_.-_ .....................................~a~t.a1~,m.=~r
_-::-
:-=_-
,-.1
_ _ ..............I
Figur(; 2. raw data from laserscan
Figure 3.
.....
example for edge extraction by
iterative end point fit combined with leastsquare approximation
An experimental evaluation for tile accuracy of the laserscanner measurements was carried out with the result that within a distance of 10 meters the variance of the distance measurement and tile variance, 1)erp(',ndicular to the measuring direction can 1)e estimated to be
a~ - (2cm)2 - 4cm 2
'
a~ - d2 tan2 (0"25~ 2
"
(9)
From the scanner's polar ret)resentation of the measurements a cartesian rcpresentation is computed which results in the following covariance matrix for the uncertainty of a single scan point of the laserscanner.
((~) + alcos2((~,) (a'Z_L- a~)cos(t~)sin(c~)
)
The next step of processing is the extraction of linear edge segments fronl the laserscan. This is done by using the iterative end-point ,fit algorithm for the determination of points belonging to a line segment. A least square solution is applied to compute a symbolic representation as defined 1)y the equations 4 to 8 (see also [11]). This is displayed in figure 3.
73
2.3. T r a n s f o r m a t i o n t o w o r l d c o o r d i n a t e s The line segments are so far rel)reseute(l in robot coordinates. The state vector of the r()b()t is given by x~ - (x,., z,., qS) 't' For a c o m m o n robot indeI)endent ret)resenta.ti()u the t.ransformation to world coor(linates is necessary. This transformation is given t)y the t()llowing rotation and translatiou:
R -
0 - sin(C)
1 0 0 cos(qS)
,
T -
() zr
.
(1())
The transforination of the state vector in rol)ot coordinates k r = (m, r, l) t() the state v(,ct()r in world coordinates k w 1)e('.()lu(;s k w = ( R m + T, R r, 1). The 1)r()t)agatioIl of the uncertainty of a r a n d o m vector x with covariance matrix Ex lln(l(;r the transformation y = f(x) is in a first order aI)proximation given by ]~-] y
z
Of(x) Ox
~x X--~
Of(x)"' Ox
(11) X-=X
\Vitll this equation the ulw(;rtainty ()f the state vector iu worl(t coor(tinates [5] 1)eCOllW.s
y]~,,w __ I t Y],rn R T -~-
-0~-I11
~-]R
-0-~ Ill
-4- Y]T
(12)
and r. w - R ~ r R " ' +
r
~
(oR,), ,, -bT~
(13)
with E R a Il(1 ET beeing the llu(:('rl,aiuties in R and T. 3. E x p l o r a t i o n
II~ ()r(ler to explore an ellvir()llillent the rol)()t is 1)rovi(le(t with a topologicaJ m()(lel and a, (:('rtaiu)uission ((lirectiou and (listau(~(, t() travel) is sI)ecified. The geomet, ri('al worl(t m()(lel is 1)lfilt uI) incremeutally. The lo(,al 1)('.rcet)tion is matched with the gl()l)al m()(lel a u(1, if 1)()ssit)le, fused acc()r(liug t() ttl(', f()ll()willg section. 3.1. F u s i o n o f s y m b o l i c e d g e s e g m e n t s Iu this work linear edge segnlents a r(' rei)reseuted t)y midI)oint, nornlalized (lire('ti()n vcct.()l a u(l llalltength 1)e(:a.us('. this reI)r(,seutati()xl is advantageous for the fusiou of segments. 'l'his lIl(~/,IIS ~tll edge seguw.ut is (l(;Iine(1 1)y a state vector k = (71~.:~,mu, 7/~.z,r:,;, r.~, r~, l) '1' aTl(l l,h(' uucertainty is givelt 1)y the (:()variau(:r lua.trix Ek. I~l ()r(ler to lind ('.orresi)()u(ling s(:eue f(,atllres (nearest neighbor matching) in tll(; h)ca] 1)cr(:Cl)ti()u ;tnd the gl()t)al mo(lel the Matlalan()t)is distance is apt)lied. The Mahalall()l)is (lisl:al~(:(' is a distance crit,eri()u f()r tw() stat, e ve(:tors normalized by the sllm ()f their (:()Val'iance lnatIiC('s. The s(tlmre(l Nlahalall()l)is (lista, n(:e is given by ~(ko, kl) - ( k o
- k l ) T (Y]ko --[- ~]kl) --I ( k o -
kl).
(1,[)
74 The squared Mahalanobis distance defines a x2-distribution [2] for the means ko = k l . For a segment in the local perception the nearest neighbor in the global model is defined by the minimal Mahalanobis distance. The two segments can be fused using a Kalman filter. K :
Eo (Eo + ~ 1 ) -1
Kahnan-Gain
(15)
k :
ko - K (ko - kl)
fusc(l segment
(16)
E =
(I - K) Eo
covariancc matrix
(17)
Every time an edge segment is fllsed with a segment from the local t)erception its confidence is incremented by one [3]. For multiple observations it is so possible to reject segments from the world model whicll have only been observed once and could be dynamic objects or artefacts.
3.2. Ground plan exploration In the following example (figure 4) the robot has driven about 50 meters through a hallway environment and incrementaly generated a geometrical map using the lascrsca, nncr and the above described techniques.
h~
!ii
......................
Figure 4. sensor fllsion with laserscanner
The accuracy of tile generated real) is basically defined by the o(lometry of the robot. This means that this map could be used as a priori information for later navigation tasks in order to enable position correction.
3.3. 3-dimensional map b u i l d i n g In this second example an al)l)roach to fusing 3-dimensional stereo-reconstructions is presented. In this case the rol~ot is h)cate(l in a hallway corner. In figure 6 the rcconstruc-
75 tions from a sequence of stereo triples (tigure 5) are displayed, again overlaye(1 with tile exact ca(t-model of the environment. Here the 3-D reconstruction is fully automatically (:omtmted with an accuracy of • f()r the vertical edges.
Figure 5. camera image from sequence of stereo-tril)les
Figure 6. reconstruction from a sequence of stereo-triples
The stereo vision system is tile most I)owerflfl sensor that is currently used on PR,IAM()S. It requires more effort to reconstruct scenes by stereo than with a laserscanner though.
4. Conclusion In this paper it was shown that stereo-reconstructions and planar laserscans can be ret)r(:sented in a coInmon format. This enables the fusion of complementary sensor information. Experimental results w(',re 1)resente(l for 3-(limensional mat) building ()f an lmknown environment. More mathematical details can be found in [11]. Th(' accuracy of th(, generated model enal)les using this model for position correction during later navigation tasks. It will also be the goal to derive topological and semantic knowlege from the geometrical reconstructions using generic ol)ject models. The planar laserscanner will s()(m be replaced by a 3-D scanner that provides (leI)th images with a resolution of 128 • 64 1)ixels at a rate of 5Hz. This step will again improve the perf()rmance of the t)resente(l system, especially for 3-D modelling.
Acknowledgement The authors would like to thank F.Al)egg and C.Wetzel for their contributions to the iml)l('.mentation of this work. This work was 1)erformed at tile Institute for Ii,eal-Tinle Conlplll,er Control Systems & Rob()tics, Prof. Dr.-Ing. U. Rembold and Prof. Dr.-Ing. 11. Dillmann, Department of Computer Science, Uniw'xsity of Karlsruhe, 76128 I(a.rlsruhe, Germany.
76 REFERENCES 1. F. Arman and J.K. Aggarwal. Model-based object-recognition in dense range images - a review. A CM Computin, g Surveys, 25(1):5- 43, 1993. 2. J.L. Crowley. Principles and Techniques for Sensor Data Fusion, volume 99 of NATO AS I, Multisensor Fusion for computer Vision, chapter 1, pages 15-36. J.K. Aggarwal, 1989. 3. J.L. Crowley, P. Stehnaszyk, and P. Skordas, T. Puget. Measurement and integration of 3-D structures by tracking edge lines. International Journal of Computer Vision, 8(1):29- 52, 1992. 4. T. Endlinger and G. WeiB. ExI~loration, navigation and self-localization in an autonomous mobile robot. In Autonome mobile Systeme, pages 142 151. SpringerVerlag, 1995. 5. O.D. Faugeras. Three-Dimensional Computer Vision. MIT-Press, 1995. 6. L. Feng, J. Borenstein, and H.R. Everett. Where am I? Sensors and Methods for Autonomous Mobile Positioning. University of Michigan, 1994. 7. J.J Leonard and tt.F. Dllrrant-Whyte. Directed sonar sensing for mobile robot navigation. MIT-Press, 1992. 8. P. Steinhaus. Bin- ~md trinokulare Matchingalgorithmen t)asierend auf Korrelationsverfahren und epipolarer Geometrie, 1996. Studienarbeit, Institut fiir ProzeJ]rechentechnik und Rol)otik, Universit/it Karlsruhe, Deutschland. 9. E. Triendl and D. J. Kriegman. Stereo vision and navigation within tmildings. In bdcrnational CoT@fence on Robotics and Automation. IEEE, 1987. 10. P. Weckesser, G. Al)penzeller, A. von Essen, and R. Dilhnann. Exl)loration of the environment with an actiw; and intelligent optical sensor system. In b~.ternational Conference on Intelligent Robots and Systems; Ituman Robot Interaction and Cooperative Robots. IR,OS, November 1996. 11. P. Weckesser a.nd I{. Dillmann. Sensor-fusion of intensity- and laserrange-images. In MFI, 1996. 12. P. Weckesser and G. Hetzel. Ph()togrammetric calibration methods fl)r an active stereo vision system. In Prof. A. Borkowsky and Prof. J. Crowley, editors, bt.tclligent Robotic Systems (IRS), pages 326333, 1994.
D
IMAGE CODING AND TRANSMISSION
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
79
Time-Varying Image Processing for 3D Model-Based Video Coding 1 T. S. Huang, R. Lopez, A. Colmenarez University of Illinois at Urbana-Champaign ~05 N. Mathews Ave., Urbana, IL 61801
Abstract This paper presents a framework for tracking moving features using a 3D model and an analysis-by-synthesis approach. First, the global pose is estimated with 2Dto-3D position correspondences of rigid features. Next, non-rigid features are tracked after compensating for the global pose. Our approach is tested in the context of facial feature and head pose tracking. Results confirm that the system is robust, accurate, and can be implemented in real-time. This framework finds applications ranging from model-based video coding to facial expression analysis to face recognition using video sequences.
1
Introduction
Time-varying imagery has been increasingly prevalent in computer vision and image processing. For computers to be able to interact with man, it is vital that they be able to analyze such data and extract higher level knowledge of the world they perceive. One of the most basic tasks, which has produced a wealth of research, is the "simple" act of recognizing and tracking objects in a time-varying scene. While this is intuitive to humans, it has been shown many times to be a formidable task for a computer in all but very limiting scenarios. Indeed, much of the work in automatic object tracking has sought to relax the constraints under which such systems will produce accurate results. Image understanding and motion analysis of video sequences are two aspects of computer vision that connect and relate to a number of applications ranging from video annotation to video coding to human computer interfaces. In most 1 This work was supported in part by Joint Services Electronics Program Grant N00014-96-1-0129 and an AT&T Bell Labs Fellowship.
80
scenarios, objects are not rigid because their motion, with respect to the scene, is combined with deformations and changes in light conditions. A framework for automatic detection and tracking of moving objects in such complex conditions is very important in time-varying imagery, and would provide a tool for further analysis and understanding of the higher level aspects of the observed scene. In early work, detection and tracking of moving objects was carried out in limited scenarios and only superficial understanding of the scene was achieved [1-4]. Low level modeling such as snakes, active contours, and deformable templates introduced some improvements [5-7]. However, to obtain a robust and complete analysis of the objects and their motion, high level modeling is required. Note that in approaches such as analysis-by-synthesis, the scene understanding is limited by the model's capability to represent the scene. A particular example of major importance in time-varying imagery is the detection, tracking, and motion analysis of human faces. Two distinctive applications are: (i) very-low-rate video coding, and (ii) human-computer interfaces. In the former, motion analysis is used to improve the motion compensation used in compression schemes that code the displaced-frame-difference [9-11], and to drive the synthesis of the video sequences in pure model-based schemes [8]. In human-computer interfaces, the motion analysis of the face is useful for gesture recognition [12], lip reading [13], and other high level tasks. In this paper we present a scheme for model-based tracking of global and local motion that is robust, accurate, and can be implemented in real-time. The system consists of three modules acting in a feedback loop: (i) 3D Object Modeling, (ii) 2D-3D Pose Estimation, and (iii) Synthesis-based Template Matching. Because the scheme performs analysis-by-synthesis using a 3D model, the system overcomes many of the common problems such as error accumulation over long sequences, light conditions, and occlusions. We demonstrate its use in the context of facial feature tracking with both synthetic and real image sequences. The synthetic sequences are helpful in examining the accuracy of the system with respect to ground truth. The real images sequences help test the robustness of the algorithm is practical applications.
2
S y s t e m Overview
As indicated in the block diagram in Fig. 1, the system consists of three modules acting in a feedback loop: (i) 3D head modeling, (ii) feature detection, and (iii) pose estimation. Given a predicted head pose, the 3D head model provides a synthetic view from which templates are made. Features are then detected via template matching using the newly generated templates. Finally,
81
the 2D feature positions and their corresponding 3D positions in the head model are used to estimate the new head pose. Kalman filters are used to predict the head pose and the 2D feature location using previous measurements. Note that if this procedure is applied repeatedly on the same frame, it can be considered an iterative approach to refining the current head pose estimation. C~borscan Data 3D
"---
Features
KalmanPredictor
1~1
[
Render
) ~ [Synthetic View, and Feature Positions
/~
Predicted Pose
[
Estimated Pose I ~--- I PoseEstimator 1
I KalmanPredictorI ~
[ Feature Position l
( TemplateMaker ]
[Predicted Feature Position Video Sequence
I
/
Fig. 1. Feature-Based Global Head Pose Tracker: The Block Diagram. The system is initialized assuming a front view in the first frame, and the rough location of the facial features; the visual pattern recognition technique in [14] provides these initial facial feature locations. A feature-based pose estimation algorithm relies on the precise detection of the feature positions. Small errors in the feature locations can produce large errors in the pose estimation. Good spatial localization is achieved with weightedtemplate matching; the match error is weighted with a Gaussian function centered at the position of the features. Since the 3D location of the facial features are known from the head model, no error in the feature locations is accumulated over the sequence.
3
Head Modeling
For the purpose of this research, the object modeling, or more specifically, head modeling, was performed with a 3D Cyberware range scanner, Fig. 2. While these scans provide detailed geometric data for a specific subject, it is somewhat intrusive and requires the subject to be available for the scan. Preliminary experiments, however, indicate that crude, low-resolutional models may be sufficient and that it may be possible to use an "average" head scan or a generic 3D head model. With the use of texture mapping techniques, the high resolution input image can be mapped onto a low resolution geometric
82
Fig. 2. The Cyberware head scan data. (a) Color data. (b) High-resolution range data. (c) Low-resolution wireframe used in tracking. surface and still provide accurate renderings of the necessary templates for a wide range of 3D poses. As part of the initialization, a texture for the generic model is obtained from the first frame by aligning the facial features with their correspondences in the 3D model.
4
Pose Estimation
One of the main steps in the proposed system is the update of the head pose from the corresponding 2D-3D facial features. A large set of 3D features are extracted manually from the initial head scan and a smaller subset of 2D features are obtained from the feature tracking module. Using these correspondences, we can compute an alignment transform that maps the 3D features to their 2D locations [15-17]. One difficulty is that, since only 3 feature points are used, the pose estimation results in 2 mathematically equivalent transforms, only one of which is correct for our purposes. We limit ourselves to 3 features (2 eye corners and nose tip) because of rigidity constraints and also to reduce complexity. However, a 4th feature point (mouth center) can be roughly approximated and used to resolve the ambiguity in the transforms. Since we know the 3D location of this 4th feature, the two resulting transforms can be applied to this point and compared to the estimated 2D position. The correct transform will result in a significantly smaller mapping error. Our pose computation module is based on the work in [17] where 3 points in each the model and image are used for object alignment. The main underlying assumption made in this method is that 2D images of an object obey a weak perspective imaging model. Using this assumption we can apply an affine approximation to this model and obtain a closed form solution to the pose estimation problem.
83
5
Experimental Results
Several sequences were tested with the proposed system, two of which are presented here. The sequences were captured at 30 fps and 320x240 resolution. The head models for each subject were created with the range scanner and the necessary 3D features were extracted manually. In the sequence of Figure 3, three rigid points were used for the pose estimation: the two outer eye corners and the middle of the nostrils. Several non-rigid points, the tips of the eyebrows and mouth corners, were also tracked successfully. The images in Figure 4 show the results of wireframe tracking. The 3D wireframe at the computed pose is overlayed over the original image. We also tested our system with synthetic video sequences to determine its accuracy. Figures 5-7 show a comparison between the measured pose parameters and the ground truth for the angles about the X, Y, and Z axes. Each figure shows plots for the optimal, Kalman filtered, and ground truth values. Many more results can be obtained from the authors as MPEG video clips, or check ftp://ftp.ifp.uiuc.edu/rlopez/00INDEX.
6
C o n c l u s i o n s and F u t u r e W o r k
In this paper we have presented a robust and novel approach to real-time feature tracking using a 3D model-based framework. A small set of facial features were tracked successfully over a large range of head motion. Global rigid motion was recovered in the form of three angles and a scale factor. Nonrigid points were also tracked successfully after compensating for the computed global motion. The combination of the 3D model, head pose estimation, and texture mapping avoided the error accumulation problem and allowed better localization of the features. Future work includes more detailed modeling of the head to aid in the analysis of local facial motion such as expressions, gestures, and deformations (wrinkles, etc.). Also, constraints on the movement of nonrigid model points need to be developed to allow for the estimation of actual 3D motion vectors from 2D trajectories.
References
[1] I. K. Sethi and R. Jain, Finding Trajectories of Feature Points in a Monocular Image Sequences, PAMI, Jan 1987.
84
Fig. 3. Feature Tracking Results: 10 points tracked: 3 rigid points used for the pose estimation + 7 non-rigid points. Selected frames are shown. [2] D. Huttenlocher, J. Noh, W. Rucklidge, Tracking Non-rigid Objects in Complex Scenes, ICCV 1993. [3] A Framework for Real-Time Window-Based Tracking Using Off-The-Shelf Hardware, [4] Y. Yao and R. Chellappa, Dynamic Feature Point Tracking in an image sequence, IEEE Int. Conf. Pattern Recognition Oct 1994 [5] F. Leymarie and M. D. Levine, Tracking Deformable Objects in the Plane Using an Active Contour Model, PAMI Jun 1993 [6] C. Kervrann and F. Heitz, Robust Tracking of Stochastic Deformable Models in Long Image Sequences, IEEE Int. Conf. Mach. Intel. Jun 1993
85
Fig. 4. Wireframe Tracking Results
Fig. 5. Synthetic Sequence: Optimal, Filtered and Actual Angles (X axis)
Fig. 6. Synthetic Sequence: Optimal, Filtered and Actual Angles (Y axis) [7] F. G. Meyer and P. Bouthemy, Region-Based Tracking Using Affine Motion Models in Long Image Sequences, CVGIP:Image Understanding Sep 1994. [8] K. Aizawa and T. S. Huang, Model-Based Image Coding: Advanced Video Coding Techniques for Very Low Bit-Rate Applications, Proceedings of the IEEE Vol. 83, Feb. 1995
86
15
, x
Optimal
.....
Filtered
10 ~
,
Real
Frame
#
Fig. 7. Synthetic Sequence: Optimal, Filtered and Actual Angles (Z axis) [9] Y. Altunbasak, A. m. Tekalp, and G. Bozdagi, Two-Dimensional Object-Based Coding Using a Content-Based Mesh And Affine Motion Parameterization, [10] C. Toklu, A. Erdem, M. Sezan, and A. Tekalp, 2-D Mesh Tracking For Synthetic Transfiguration, [11] Y. Wang and O Lee, Active Mesh- A Feature Seeking and Tracking Image Sequence Representation Scheme, IEEE Tran. Image Processing Sep 1994 [12] I. A. Essa and A. Pentland, A Vision System for Observing and Extracting Facial Action Parameters, CVPR 1994 [13] D. Stork and M. Hennecke, Speechreading: An Overview of Image Processing, Feature Extraction, Sensory Integration and Pattern Recognition Techniques. Int. Conf. Automatic Face and Gesture Recognition 1996 [14] A. Colmenarez and T. S. Huang, Maximum Likelihood Face Detection, Int. Con/. Automatic Face and Gesture Recognition 1996 [15] Ricardo Lopez and Thomas Huang. Head pose computation for very low bitrate video coding. In Vaclav Hlavac and Radim Sara, editors, Computer Analysis o/Images and Patterns, pages 440-447, Prague, Czech Republic, September 1995. Springer. [16] Ricardo Lopez and Thomas Huang. 3d head pose computation from 2d images: Templates versus features. In IEEE International Conference in Image Processing, pages 220-224, Washington DC, USA, October 1995. IEEE, IEEE Press. [17] Shimon Ullman and D. P. Huttenlocher. Recognizing solid objects by alignment with an image. International Journal of Computer Vision, 5(2):195-212, 1990.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
87
A New Arbitrary Shape DCT for Object-Based Image Coding Masayuki Tanimoto and Mario Sato Department of Information Electronics, Faculty of Engineering, Nagoya University Furo-cho, Chikusa-ku, Nagoya, 464-01, Japan A new DCT-based scheme is proposed for object-based image coding. It expresses pixels in arbitrary shape regions as linear combination of a part of DCT basis functions. The number of basis functions used here is the same as that of the pixels to be encoded. Computer simulation shows that the proposed scheme is superior to Region Filling DCT and conventional block-based DCT. 1
INTRODUCTION
Transform coding for arbitrary shape regions is one of the key issues of object-based image coding. Such schemes are divided into two categories. The first category uses basis functions of orthogonal transform depending on the shape of the region. Arbitrary Shape KLT (AS-KLT) [1] and Arbitrary Shape DCT (AS-DCT) [2] [3] come into this category. The second category uses basis functions of conventional block-based orthogonal transform such as DCT. Region Filling DCT (RF-DCT) [4], Shape Adaptive DCT (SA-DCT) [5] and Region Support DCT (RS-DCT) [6] come into this category. The schemes of the second category are discussed here. RF-DCT makes rectangular blocks by filling the outside of the region, and applies the conventional two dimensional DCT to the blocks. So, the number of samples to be coded is increased. In SA-DCT, one dimensional DCT is applied to each horizontal line, shifting the coefficients to low frequency side and one dimensional DCT is applied to vertical lines. In this scheme, the shift of the coefficients decreases vertical correlation. In RS-DCT, pixels in the region is expressed approximately by using a limited number of basis functions of DCT. This paper proposes a new arbitrary shape transform coding scheme of the second category. Pixels in the arbitrary shape region are expressed exactly by using the same number of basis functions as that of the pixels. The coding performance of the proposed scheme is examined. 2
2.1
PROPOSED
SCHEME
Outline
Figure 1 shows the outline of the proposed scheme, bl(i,j) denotes the values of pixels in the region to be coded. First, the outside of the region is filled with 0 to make a rectangular block. Then, block-based orthogonal transform is applied to the block. Components Xl(k,1) whose number is the same as that of the pixels in the region are selected. The selected components Xl (k, l) are converted to Yl (k,/), and components x2(k,1) which are not selected are substituted by 0. Thus, a new rectangular block is made in the transformed domain. The conversion of Xl(k, 1) to yl(k, 1) is determined so that the inverse orthogonal transform of this new rectangular block reconstructs the
88
OT
< Orthogonal~ Transform/
x I (k,.l) I
tc:mp~
A
be~kl,
i)
redundant I components
conversion
I bl (i, j) I I
decoded pixels
-_ IOT ( Inverse ~
y 1 (k,l)
components ]
~Orth~176 I \Transform/
Figure 1. Outline of proposed scheme.
values of pixels bl (i, j). b2(i, j) which are the values of pixels in the outside of the region are not used. In the proposed scheme, the number of samples is not increased. The pixels bl(i, j) in the region are reconstructed perfectly if the coefficients yl (k.1) are not quantized.
2.2
Derivation of the conversion equation The equation for the conversion of Xl(k, l) to yl(k, l) is derived. As shown in Figure
2, $1 denotes the region to be coded in the pixel domain and S~ denotes the region of components to be coded in the transformed domain. Sl (i, j) is a sampling function taking the region S1 and expressed as Sl (i
' J)
_ [ 1 (i,j) 9 [0 else.
(1)
t
$1
region of pixels to be coded (a) pixel domain
region of / components// to be co~//
(b) transformed domain
Figure 2. Sampled region in each domain.
89
Defining Ckl(i, j) as basis fllncti()ns of the ()rthog(mal transfl)rln,
(2)
= Xl(k,/)
(sl(i,j)r
holds for (k, l) C S'1. From the inverse transfrom, S l(i, j)b~(i, j)
=
sl(i, j)
Z
(k,,t,)~s',
y, (k', l')r (i, j) (~1,/1)~S/I Yl (]~" /t) {Sl(i, j)r (i, j)}
(3)
holds. Substituting (3) to (2), a set of equations
Z
(Sl(i,j)~kl(i,j),81(i,j)~k,l,(i,j))" yl(k',l')
= Xl(k,l)
(4)
(k',l')~S', are obtained. Solving (4), 'gl(k,l) is obtained from Xl(k, l) in case that the determinant given by
det = I(sl(i,j)r162
(5)
is not equal to O. 3 3.1
CODING
SIMULATIONS
Conditions
Computer simulation of the proposed coding scheme is performed. Figure 3 shows encoding and decoding process used here. Before orthogonal transform is applied, DC component in the region $1 is removed to decrease the discontinuity of pixel values at the boundary. Blocks including the boundary are chosen, and both sides of the boundary are coded in each block. 8 x 8 DCT is used as the orthogonal transform. Bit allocation for the quantization of each coefficient in the transformed domain is determined by considering the power of the coefficients. Figure 4 shows the algorithm of selecting basis flmctions. Spatially localized basis are not orthogonal though orthogonal basis functions are desirfunctions sl(i, j)r able for high coding performance. This algorithm selects basis functions which are as nearly orthogonal as possible. 3.2
Results
Examples of selected basis functions are shown in Figure 5. Gray boxes in the upper blocks denote pixels to be coded. Gray boxes in the lower blocks denote selected basis functions. The selected basis functions are distributed though the pixels to be coded are localized. Figure 6 shows the dependence of SN ratio of the reconstructed picture on entropy. The entropy of the proposed scheme and RF-DCT includes the information of selected basis functions, but does not include the shape information. As seen from this figure, SN ratio of the proposed scheme is about 3dB higher than that of RF-DCT at the same entropy.
9O 0
0 P removal
DC
orthogonal
of
conversion
transform
component
and
selection
of
components
pixel
domain
~
transformed
domain
quantization
[encoder] [decoder]
0
0 9
4 addition
DC
of
component
q
selection
inverse
orthogonal
of r e g i o n
pixel
transform
domain
~
transformed domain
Figure 3. Encoding and decoding process.
To achieve the same SN ratio, the proposed scheme needs entropy about l bit/pixel less than the conventional DCT. The shape information by chain coding is estimated to be about 0.3bits/pixel. Therefore, the proposed scheme is superior to the conventional DCT even if the shape information is included. 4
CONCLUSION
We proposed an image coding scheme which expresses an arbitrary shape region by using block-based orthogonal basis functions. High coding performance is obtained by proper selection of the basis functions since it uses the same number of basis functions as that of pixels to be coded and the selected basis functions are nearly orthogonal. ACKNOWLEDGEMENT The work reported in this paper was supported in part by the Grant-in-aid for General Scientific Research (B), No.07455159, 1995-1996, the Ministry of Education, Science and Culture.
9!
START
Select [unction
a basis having
the largest coefficient as a candidate
No
Subt the component in the pixel domain
Yes Adopt the basis function
L
~ A b a n d o n the ~ basis function
J
No
Figure 4. Algorithm of selecting basis functions.
Figure 5. Examples of pixels to be coded and selected basis functions.
92
34
32
30 propo ,_..., m
2
n-" Z cO
/,"
28
,d"
26
24 ,," ~,,"
22
....... conventional DCT
,,"'"
0
.'
..
,
i
|
i
|
,
|
|
,
0.2
0.4
0.6
0.8
]
1.2
1.4
1.6
1.8
entropy [bits/pixel]
2
Figure 6. Dependence of SN ratio on entropy in various schemes.
REFERENCES
[1] I. Matsuda, S. Itoh, T. Utsunomiya, "Adaptive KL-Transform Image Coding Based on Variable Block Shapes", THE TRANSACTIONS OF THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, B-I Vol. J76B-I No. 5, pp.399-408, May 1993. [2] M. Gilge, T. Engelhardt, and R. Mehlan, "Coding of Arbitrary Shaped Image Segments Based on a Generalized Orthogonal Transform", Signal Processing Image Communication, 1, pp.153-180, Oct.1989. [3] Y. Kato, "A Study of Orthogonal Transform for Arbitrary Shape Image Segment", 1990 AUTUMN NATIONAL CONVENTION RECORD, THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, D-326, p.328, Oct.1990. [4] N. Ito, H. Katata, H. Kusao, "A Method of DCT for Arbitrary Shape", The Proceedings of the 10th Picture Coding Symposium of Japan, 5-2, pp.77-78, Oct. 1995. [5] T. Sikora and B. Makai, "Shape-Adaptive DCT for Generic Coding of Video", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 5, NO. 1, pp.59-62, Feb 1995. [6] Y. Shishikui, "A Study on Arbitrary Shape Coding using Region Support DCT", TECHNICAL REPORT OF IEICE, CS95-157, pp.61-66, Dec.1995.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
93
Picture c o d i n g u s i n g splines M.Buscemi, R.Fenu, D.D. Giusto, and G.Liggi Dept. of Electrical and Electronic Engineering, University of Cagliari Piazza d'Armi, Cagliari 09123 Italy email : [email protected]
A new interpolation algorithm for 2D data is presented that is based on least-squares minimization and splines. This interpolation technique is then integrated into a two-source decomposition scheme for image data compression. The major advantages of this new method over traditional block-coding techniques are the absence of the tiling effect and a more effective exploitation of interblock correlation.
1. INTRODUCTION Although 1-D spline functions have important applications in many areas of scientific and engineering research, they are frequently insufficient to meet the need of the encoming recent technological developments. For instance, to represent a mathematical model described by several parameters and to interpret higher dimensional data, functions of two or more variables are often required. Among these, the family of piecewise polynomial functions in more than one variable, which are usually called multivariate splines, is the most useful in many applications. Moreover, the subject of multivariate splines [1] has become a rapidly growing field of mathematical research. Using spline-methods it is possible to carry out interpolations of discrete points: the major advantage of these algorithms for image data compression lies in the possibility of reducing the tiling effect typical of simple block coders and in the gradualness of changes between adjacent pixels. Before expounding the content of this paper, it is important to underline that very often, for natural images, the hypothesis of sufficiently high correlation and 1st order Markov process is not worth; so an higher approximation fidelity can be achieved by optimising the interpolation algorithm using special parameters as variance, covariance, standard deflection, and so forth.
2. THE CODING STRATEGY In order to interpolate images, a new strategy has been adopted, based on two-source decomposition [2]. The first source is related to the low spacial frequencies, while the second one allows to extract the residual information, which refers to the high spacial frequencies. Two different techniques have been improved for each component: they are the spline
94 interpolation and the entropic analysis, respectively. The sources are elaborated in an independent way, but in the phase of decoding the reconstructed image is obtained by their recombination through a specific algorithm.
2.1. Image data interpolation The interpolation of a source requires the definition of a model, which can be either deterministic or stocastic, depending on whether it uses casual variables or not. It is reasonable to interpret several neighbouring points of an image, as belonging to a polynomial function. Using 3rd degree functions with two variables the generic pixel value z=f(x,y) can be interpolated by z = c 1 + c2Y + c3x + c4xY + c5Y 2 + c6 x2 + c7y2x + c8Yx 2 + c9Y 3 + Cl0x3
(1)
When the number of points is greater than the number of polynomial coefficients, one strategy of minimization of error is unavoidable: if the MSE is chosen as indicator-parameter, the resulting function could not pass exactly in the interpolation points, but minimizes the MSE [3]. This objective is achieved by solving a linear system expressible in the matrix form, where [A]is a 10x 10 symmetrical matrix; c is the vector of the polynomial coefficients (the unknows); t is the vector of the known terms and depends on the image pixels. This tecnique can be successfully adopted until pixel values are not abruptly varying and pixel number is not exaggeratedly major than the degree-freedom of the polynomial functions. These problems define the application limits.
2.2. Least square 2D spline interpolation Spline function and least-square block interpolation present, both, a contrastanting behavour: the first one links excellently adjacent blocks but induces an evident effect of smoothing with loss of settlement; the last one approaches in a correct way but only locally, giving rise to tiling effects. Appreciable results can be attained by combining these two methods for the whole image or, better, for the low frequency component. The initial strategy (in a second moment it has been abandoned) was based on total links about adjacent blocks for continuity, 1st and 2nd derivatives for each directions (x,y). Calling c 1 ' c0.,2 c 3. the coefficients of three adjacent blocks and x f and y f their extreme position
geometrical values, a set of 18 equations can be written:
' f3 + c6x ' f + 4xi clOX
+cl .
0;.
.
4x} +clxl +cl 4 o .
...
1 f3 +c5y 1 f2+ c l y f + c ~ _ c 3 1 =0 ; c9Y
(2)
c7Y 1 f2+ c l y f + c { - c 3 = 0
Applying totally the MSE technique, a linear system results but there are superabundant equations, so a QR factorization (q orthogonal, R triangular) must be obtained for getting the solution (among the infinite possible solutions) which minimizes
II[A]c-tl12,where [A] is a
In x n] matrix, and c and t are [n x 1] vectors. Although this method could be highly correct, the results are not excessively good because the innumerable constrins on the edges produce insufficient approximation in the central part of each block. So a second strategy has been
95 developed inverting the order of application of spline-approximation and SE minimization. A local SE minimization is applied for the first block of the image and, only after this operation, the new pixel values (derived by the polynomial function) come into play for the following block. The process is iterative and the links are warranted by the method of minimization of the SE which examines both the pixels of the new block and (this is the real innovation) the last adjacent pixels previously calculated. This is equal to write a linear system [4] for each block [A]c = t , where [A] is a symmetric matrix in the form (n+l) 2
EZy
ZZx
... ...
ZZ y
EZ y 2
ZZxy
ZZx
ZZxy
ZZx
,
.
2
...
.
and [t] a vector (I is the pixel) in the form Z
I /x
but now both [A] and [t] have new limits for their summations. They are extended for containing a convenient number of external pixels: N+ot N+~ Z... Z... ~, [3 estimated according to the position held in the image 0 0 The sequence of analysis of the blocks of the whole image is illustrated in Figure 1, which points out the overlap between the marginal parts of adjacent blocks.
Figure 1. The sequence of analysis of the blocks in the whole image; the overlap between the marginal parts of adjacent blocks is shown.
96 This overlap allows to extract a partial reconstructed image with a very low presence of the tiling effect although the details are not perfect. The objective to achieve trough the low frequency image is the elaboration of a prototype which will be improved by the residual image, but with good connections.
2.3. Residual coding Through the analysis of the pixel entropy, two similar methods have been created for the memorization of the residual component. The difference can be summarized considering that the first one presents a specific quantization matrix, the second one is based on a tree code (arithmetic coding [5]). The procedures based on the entropy are then applied to the residual information. Since the correlation is high, the final image can possess a bit-rate compression higher than 40:1 (8bpp for the original image).
3. RESULTS The analysis of reconstructed images shows that the Least-Square Spline Interpolation method achieves PSNR values over 30dB if bitrate is almost 0.20 bpp. This confirms the high fidelity between the original image and the reconstructed one; in fact, the residual is sufficiently restricted so a good visual quality is observed. The most problematical zones as textures and edges do not give rise to unpleasant effects for each value of bpp. In Figure 2, images are respectively: the original one, the low pass, the correct residual, the codified residual and the reconstructed one with bpp .20. In order to point out the performances with different compression-rates and observe the gradual improvements, a particular of the whole image is provided in Figure 3 with the following values of bitrate: 8 (the original), 0 . 2 0 , 0 . 2 5 , 0 . 3 0 , 0 . 4 0 , 0 . 5 0 bpp. After this graphic presentation of the results, it is very interesting to make a comparison between the techniques which we have developed and other two important methods: the JPEG standard [7] and the interpolation with B-splines [8]. These last two techniques have a range of bitrates less wide because they are not able to achieve considerable compression rates owing to their operation limits. Our methods present the best results until a compression rate equal to .25 bpp. In Figures 4 and 5 PSNRs for the various methods are reported relevant to different values of bitrate.
REFERENCES 1. C. K. Chui. Multivariate Splines. CBMS 54, Siam, 1988. 2. J. K. Jan and D. J. Sakrison. Encoding of images based on a two-component source model. IEEE Trans. Comm. COM-25(11), 1977, pp. 1315-1322. 3. S. Van Huffel and J. Vandewalle. The Total Least Squares Problem, Computational Aspects and Analysis. Siam, Philadelphia, 1991. 4. D. Bini, M. Capovani, G. Lotti, e F. Romani. Complessita' numerica. Boringhieri, Torino, 1981. 5. G. C. Langdon. An Introduction to Arithmetic Coding. IBM J. Res. Dev., vol. 28, March 1994, pp.135-149.
97 6. A. M. Mood, F. Graybill, and D. C. Boes. Introdizione alla statistica. McGraw-Hill 1988, pp.481-500. 7. R. J. Clarke. Transform Coding of Images. Academic Press, London 1985. 8. Carl de Boor. A Pratical Guide to Splines. Applied Mathematical Sciences vol. 27, Springer-Verlag, 1978.
Figure 2. Results obtained by the proposed approach.
Figure 3. Details at different bitrates.
98
1"1~ JPEG
mllmlll~~~
"161"181201"251"3~176176176 ~ ~
Etllm I~
~
~J'~
1.s.spline entr. 1.s.spline arithm. 13-spline Figure 4.
Figure 5.
nh,mlllmL~~l
: ~ rata ~
~W~~]~:o~
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
99
A 10 k b / s V i d e o C o d i n g T e c h n i q u e B a s e d o n S p a t i a l T r a n s f o r m a t i o n
*
Stefano Bonifacio, Stefano Marsi and Giovanni L. Sicuranza DEEI, University of Trieste via A. Valerio, 1 0 - 34127 Trieste- Italy This paper describes a method for coding QCIF image sequences at 10 kb/s and 10 frame/s. The proposed algorithm performs motion compensation between consecutive frames using a spatial transformation. It is divided in two steps: the first one consists of the motion estimation and compensation, the second one includes the prediction error processing and transmission. This technique is of help in overcoming the limits presented at very low bit-rate by traditional hybrid block-based techniques. 1. I N T R O D U C T I O N Recent developments in video coding, aiming at video transmission on wireless communication and telephone channels, point out the inadequacy of coding techniques based on block-matching and bidimensional discrete cosine transform at very low bit rate (< 16 kb/s). In fact, the coarse quantisation of the DCT coefficients, and the limits on the number of blocks which can be correctly updated, cause serious degradations on decoded images introducing the so-called mosquito and blocking artifacts. One solution, to overcome these problems, can be obtained using a spatial transformation to realize a motion compensation in the sequence. Considering the high temporal correlation between two consecutive frames in the sequence, the actual frame can be predicted from the previous one using a spatial transformation and the knowledge of the motion in the scene. Of course using this technique not all the possible movements in the sequence can be reconstructed so this methodology must be, of course, integrated by a second step in which the reconstructed error is evaluated, compressed and transmitted to the decoder. 2. T H E P R O P O S E D
ALGORITHM
The main goal of this project has been the development of an algorithm for video sequence coding working at 10 Kb/s for very low bit-rate applications like video-phone tele-conference or tele-surveillance. Focusing to these applications some hypotheses must be assumed; in particular the sequence to code should have only slow motions inside, no change scene and no panning or zooming of the camera. The proposed algorithm is composed of two fundamental steps: in the first step specific motion estimation and compensation technique is used to predict the actual frame from the previous one while in the second one the predicted error is processed and transmitted *This work was partially supported by "MURST" and "ESPRIT-LTR 20229 Noblesse" projects.
100 2.1. M o t i o n e v a l u a t i o n a n d c o m p e n s a t i o n The spatial transformation-based approach [1-3] achieves a compensation of the movements in the scene warping the previous frame, i.e. the nth predicted frame I n ( X , y ) is N synthesised from the decoded previous one I n - l ( X , y). The process can be written as
]n(X,y)- In_l(Xt, y I) X'
--
y~
--
ailx + ai2y -4- ai3 + aisy + ai6
(1)
(2)
ai4x
where (x', y') is the estimated position in the previous frame for the current (x, y) pixel. This relationship is modelled by means of the geometric atone transformation, as described in (2) with six motion parameters a i l . . . a i 6 . The image is divided in non-overlapping triangular equilateral patches, so we have to estimate the motion parameters for each of them and send this information to the decoder, where the warping process is performed. Since the transformation offers six degrees of freedom, if we estimate the motion vectors (MVs) of the three vertices, and use (2), the six parameters can be determined. In (2), the i subscript for the parameters specifies that they are referred to the ith patch. When (x', y') is not a sampling point of the image (e.g. if x' or y' are not integer), we use the intensity value of the nearest one. The prediction for the whole frame needs the estimation of the motion vectors of all the vertexes of the patches, that therefore make up a grid of regularly placed points, called "grid-points"(GPs). In the next step all the MVs must be evaluated. This can be done through two different methodologies: the MVs can be evaluated using an optimisation process which allows to minimise the error between the predicted image and the original one; even if this technique is able to reach the highest quality in the predicted image, the evaluation of the MVs uses an iterative algorithm that, cycling for all the possible MVs configurations, reaches the better one. Of course the computation complexity in such a case is so high that a real time application for the system could be very expensive. A cheaper solution can be reached if we can consider the real movement for all the grid points. In such a case we accept the hypothesis that the movement of the grid points is sufficient for the reconstruction of the motion in the sequence. In this paper we adopt this solution, which requires a low computation level, even if the result is a sub optimal predicted frame. Describing in detail the process, for each grid point, a MV is estimated with a classical block matching (BM) technique [4]: it is searched the best matching block of pixels in the previous frame, which minimises the mean absolute difference in a search region centred on each GP; the corresponding displacement is the estimated vector for the GPs. For simplicity, according to the hypothesis of no motion of the camera, we do not perform BM on the GPs disposed on the border of the frame: the points on the four corners are assumed to be motionless, while the other GPs on the borders are allowed to move only along the border of the picture, and their MVs are deduced from the MVs of their corresponding closest GPs inside the frame. No information is then needed to be sent with regard to the peripheral GPs. The estimated motion vector field (MVF), is the highest priority information to be transmitted. The field is zig-zag scanned, and the difference between the components of
101
Figure 1. Evaluation of the no-motion areas: absolute difference between two consecutive frames (LEFT), threshold application (CENTER), median filter and evaluation of the no-motion vectors (RIGHT)
successive MVs is coded using a variable-length code, while sequences of null vectors are run-length coded. Since an irregular MVF involves high information rate, a particular method is studied to prune badly estimated MVs, mainly in still background zones of the image, where MVs should be set to zero. A binary image is composed by thresholding the absolute difference between two consecutive original frames, then a 3 x 3 median filtering eliminates the isolated points corresponding to local noise difference. A vector is set to zero if there are no associated filtered pixels in a region around the associated GPs. An example of this process is illustrated in fig 1. The resulting MVF is further smoothed by means of a vector median filter [5], which operates on each vector and the six adjacent ones. This filter works jointly on the vector components, so that the filter output is one of the original input vector (the quality of the decoded images is improved by the fact that we do not introduce new vectors). With this approach,the MVF becomes more uniform according to the natural motion of the sequence, and also the information rate, related to the transmission of the MVF, can be significantly reduced. The average amount of bits, required to code MVF after this processing, is about 400 bits per frame for the sequence "Miss America". An example of this process is shown in fig.2. 2.2. E r r o r p r o c e s s i n g a n d t r a n s m i s s i o n Considering that the main goal of this project is the development of a system able to code a video sequence at 10 kb/s and 10 frames/s, it is quite evident that the average number of bits that we can spend to code every frame is 1000. Looking at the conclusions of the previous paragraph we can consider that while about 400 bits must be used to transmit the MVF, a bit margin of about 600 bits can be used to transmit additional informations. The last important step of the method involves prediction error processing and transmission. Because of the inexact estimation of some vectors, and the difficulty to model particular movements and effects of the scene, like rotations of objects and uncovering of new areas in the picture, warping techniques does not always provide an adequate ira-
102
Figure 2. Motion Vector Field Processing: the MVF obtained after the Block-Matching (LEFT), after the evaluation of the no-motion areas (CENTER) and after the median vector filter (RIGHT)
age quality. An experimental comparison between several method for error elaboration (pixel-based, region based and block-based) points out that a block-based approach gives better results. Whatever method we choice, due to the low amount of bits remaining after MVF transmission, the task of coding prediction error is really hard. Using a block-based approach, the prediction error is evaluated inside 8 x 8 pixel blocks, placed in all the possible positions in the frame. The error related to the blocks with the highest mean absolute difference is DCT-transformed and quantised. The obtained coefficients, the quantisation step and the position of the blocks are transmitted. For each frame the number of coded blocks and transmitted coefficients depends on the remaining amount of bits after MVF transmission. On the receiver a post-processing is performed for all the blocks which have been improved by the transmission of the error. Typically the most important DCT coefficients of the error patches are located at low frequencies; in such a case, in the reconstruction phase while the low frequencies errors are recovered, the noise is concentrated in the high frequencies. Using a low-pass filter as post processing the advantage is twofold: the high frequencies noise is reduced and even if the reconstructed image becomes little smoother, a subjective evaluation of the image quality rewards this effort. 3. S I M U L A T I O N
RESULTS
The simulation was performed on three video-conference sequences in the QCIF format (picture size 176 x 144 pixels, luminance only), with a frame rate of 10 Hz: "Miss America", "Salesman" and "Akiyo". The grid is made up of 332 GPs, but only the 281 MVs of the internal GPs need to be coded and transmitted. In BM, the MVs are allowed to have integer values in the range +7 pixels vertically and horizontally, and 7 • 7 pixel blocks were used in the matching search. In the error coding phase, due to the energy compaction property of DCT only low-frequency coefficients are scalar quantised and coded with a variable length code. About 80 bits are required for error recovering in a 8 x 8 block. Moreover the first frame is J-PEG coded and transmitted.
103
sequence name "Miss America" "Akiyo" "Salesman"
min PSNR 30.97 dB 29.50 dB 27.01 dB
mean PSNR 32.47 dB 30.44 dB 30.04 dB
Table 1 Minimum and mean PSNS obtained for three test sequences
Figure 3. Last frame of the sequence: original (LEFT), reconstructed (RIGHT)
In spite of decreasing the temporal correlation between the first and the current frame in the sequence, warping and error recovering techniques achieves good results, particularly when the scene content does not produce sharp changes. In fact, PSNR drops significantly only when the movements in the scene give rise to a complex or irregular MVF, so that motion information increases and only few error blocks can be transmitted. On the other hand, when the movements become less decided, the MVF can be decoded using only few bits and a larger bit margin can be applied on the transmission of the error. In such a way frame after frame the quality in the decoded sequence significantly increases. In fig. 3 the last frame of the decoded sequence is represented together with the original one, fig. 4 depicts the plot of the PSNR for the sequence "Miss America" of 50 frames (5 seconds). Finally, Table 1 summarises the results obtained for the three test sequences. 4. C O N C L U S I O N S In this paper an innovative method to code video sequences at very low bit-rates has been presented. The proposed algorithm uses a spatial transformation to predict the actual frame from the previous one and the transmission of some error patches for the
104
Sequence "Miss America" 42 ............................................................. 44t i i 40
..............................................................................................
38
.....................................................................
::..................... : .................... ! i 'i i ~:--.
..........................
36 Z
34 32 28 26
' 5
10
15
20 25 30 frame number
35
40
45
t
50
Figure 4. Diagram of the PSNR vs. frame number for the reconstructed sequence "Miss America" reconstruction of critical motion areas which are not well reconstructed through spatial transformations. In typical video-conference sequences simulation results demonstrate that in a 10kb/s channel a good quality can be reached for the reconstructed sequence. R E F E R E N C E S
1. Y. Nakaya and H. Harashima, "Motion compensation based on spatial transform", IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, no. 3, pp. 339-356, June 1994. 2. G . J . Sullivan and R. L. Baker, "Motion compensation for video compression using control grid interpolation", Proc. 1991 IEEE ICASSP, M9.1, pp. 2713-2716, Toronto, May 1991. 3. J. Nieweglowski, T. G. Campbell and P. Haavisto, "A novel video coding scheme based on temporal prediction using digital image warping", IEEE Transactions on Consumer Electronics, vol. 39, no. 3, pp. 141-150, Aug. 1993. 4. J. Jain and A. Jain, "Displacement measurement and its application in interframe image coding", IEEE Transaction on Communication, vol. COM-29, no. 12, pp. 1779-1808, Dec. 1981. 5. J. Astola, P. Haavisto and Y. Neuvo, "Vector median filters", Proceedings of the IEEE, vol. 78, no. 4, pp. 678-689, Apr. 1990.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
105
I m a g e C o m m u n i c a t i o n s Projects in A C T S Franco BIGI Responsible for Service Engineering in the Directorate B "Advanced Communications Technologies and Services in the Directorate General XIII "Telecommunications Information Market and Exploitation of the Research of the European Commission.
1 9I N T R O D U C T I O N
On 27 July 1994, the European Council adopted, as part of the fourth framework programme of the European Union (1994-1998), a specific programme for research and technological development, including demonstrations in the field of Advanced Communication Technology and Services (ACTS). The ACTS programme is to be considered as the third phase of the implementation of the IBC (Integrated Broadband Communications). Considerable progress has been made since telecommunications operators and industry defined IBC as the common vision of the future communications infrastructure for Europe and as objective for RTD collaboration in the RACE programme. But, right from the conceptual stage in 1984, three phases were distinguished on the route towards implementing IBC: a Phase I concentrating on the system engineering, specifications and key technologies, a Phase II concentrating on integration and the prototyping of the new services and applications and a Phase III beyond the RACE Programme consisting of user-oriented experimentation. With the ACTS programme, with a considerable paradigm shift, the objectives, as stated in the Council Resolution are ...to develop advanced communication systems and services for economic development and social cohesion in Europe, taking into account of the rapid evolution in technologies, the changing regulatory situation and opportunities for development of advanced transeuropean networks and services.
The aims are to support European policies for early deployment and effective use of advanced communications in consolidation of the internal market, and to enable European industry to compete effectively in global markets. The work will enable the re-balancing of public and private investments in communications, transport, energy use and environment protection, as well as experimentation in advanced service provision. In conjunction with the work in the specific programme on information technologies, it will provide a common technological basis for applications research and development in the specific programme on telematic systems and will prepare the ground for the development of a European market for information services..." The programme Acts has been implemented with two calls for proposals: the first call resulted in 115 projects started in the second half of 1995 or at the beginning of 1996 for a total
106 contribution of 440 million ECU's; the second call resulted in the approval of around 45 new projects and in the reinforcement of some of the existing projects for a total contribution of 110 million of ECU's. The new projects started or are due to start in the second half of 1966.
2 . AREAS IN TECHNOLOGY DEVELOPMENT IN ACTS The workplan of ACTS has been developed with an intensive preliminary consultation of all the sector actors, coming from all the main European universities, the manufacturing industry, the network operators and their research centres, the broadcasters, the service and content providers. The projects have responded to the tasks, as described in the workplan, classified into six specific domains, as follows: Interactive digital multimedia services Photonic technologies High-speed networking Mobility and personal communications Intelligence in networks and service engineering Quality, security and safety of communications systems Horizontal actions. The image communication is dominant in the first area, but is present also in all the remaining, because it is very difficult to imagin a high speed network or an advanced mobile system not open to the video signals. In the same way a complex interactive digital multimedia service can not operate in an unsafe and unsecure system, without the appropriate quality, with a good identification and an accurate description of the service components in close cooperation with equally complex network functions.
3 . INTERACTIVE DIGITAL MULTIMEDIA SERVICES In the next ten to fifteen years emerging interactive multimedia services will have a strong impact on social and cultural life. European standards and multilingual services are essential for social cohesion in Europe, and a strong multimedia sector will create new employment opportunities. By the year 2010 there will be many more TV programmes available to the customer and many of these will be interactive in nature. The boundaries between computing, communications and broadcasting will have largely been eliminated and user friendly terminals with fiat panel displays will provide access to a wide range of entertainment, communication and education services. Broadband connections to both workplace and home will eliminate the local loop bottleneck. Service designers will be able to trade bandwidth and coding complexity to achieve the desired service features and usability. If residential and small business information highways are to flourish, an open competitive market is vital. Vendor-independent standards are needed to enable the commercial separation
107 of service provision from network provision. European and international standardisation bodies have shown the technical feasibility and time-scales for the introduction of the requested standards. The MPEG standard defines a range of state of the art image and sound coding and multiplexing techniques. The standard has been accepted by international standardisation bodies Like ETSI as the basis for new television broadcasting services and by DAVIC as the basis for multimedia services via telecom, CATV and satellite networks. The provision of multimedia services over advanced communication networks involves the development and integration of a number of key technologies. The projects addressing these technologies cover the following service elements and background knowledge: 1. 2. 3. 4. 5. 6. 7.
Multimedia content manipulation and management Presentation, interaction and storage Transmission media Interworking across different networks Enabling commercial services Trials Transition scenarios
This structure has been chosen in order to present the general intention to reach the needed convergence, more than reflecting the subdomain structure of the 58 projects. For the majority of these projects (44) the work has started in the second half of 1995; for 7 the commencing date is July 1996 and for the remaining 7 the starting date is planned in the second half of this year, for some of them is subject to the completion of the negotiation. The following list is established in purely alphabetical order of the chosen acronym: AMPA AMUSE ATLANTIC ATMAN AURORA BIDS BURBON CABSINET CATVDC CICC CINENET CODIS COVEN CRABS DAM DIANE DIGISAT DVBIRD DVP
Advanced Multimedia Parallel Accelerator Advanced MUltimedia SErvices for residential users Advanced Television at Low bit rates And Networked Transmission over Integrated Communication system Digital Audio Visual Work Trading by ATM Automatic Restoration of ORiginal film and video Archives Broadband Infrastructure for Digital television and multimedia Services Broadband Urban Rural Based Open Networking Cellular Access to Broadband Services and Interactive Television Concerted Actions in Support of the CATV Delivery Chain Collaborative Integrated Communications for Construction CINEma, films and live events via satellite, cable and Atm NETworks Clip On Demand Interactive Services Collaborative Virtual ENvironments Cellular Radio Access for Broadband Services Davic Accompanying Measures Design Implementation and operation of a distributed Annotation Environment Advanced Digital Satellite Broadcasting and Interactive Services Digital Video Broadcasting Integrated Receiver Decoder Distributed Video Production
108 European MM Services for Medical Imaging Architectures Software and Hardware for MPEG4 Systems EUROpean Research and consensus on Interactive Multimedia Generic Architecture for Information Availability Integrated Broadband Communications on Broadcast Networks Integrated MultiMedia Project INTERACTive Television and multimedia return channel service trials Interactive Satellite Information Systems Kiosk based Integrated Multimedia Service Access for Citizens Learn from Video Extensive Real ATM Gigabit Experiments Multimodal Verification for Teleservices and Security Applications MAintEnance System based on Telepresence for Remote Operators Multimedia Environment for MObiles Multimedia Interactive DemonStrator TElePresence Manipulation of Images in Real time for the creation of Artificially Generated Environments MOMUSYS Mobile Multimedia SYStems MUSICIAN MUltimedia Services Integration Chain In Advanced Networks Multimedia User Interfaces for Interactive Services and TV MUSIST Offer of Contents through Trusted Access LinkS OCTALIS Open Kernel for Access to Protected Interoperable interactive services OKAPI OPARISOD OPen ARchitecture for Interactive Services On Demand PANORAMA PAckage for New OpeRational Autostereoscopic Multiview systems and Applications QUO VADIS Quality Of Video and Audio for Digital television Services REconstruction using Scanned Laser and Video RESOLV Scalable Architectures with Hardware Extensions for Low Bitrate Variable SCALAR Bandwidth Real-time Videocommunication Set Top Box for Interactive Services on Demand SETBIS Scalable Interactive Continuous Media Servers Design and Applications SICMA Storage for Multimedia Application Systems in the Home SMASH Software Open MultiMedia Interactive Terminal SOMMIT Service Provisioning Environment for Consumers Interactive Applications SPECIAL TALISMAN Tracing Authors rights by Labelling Image Services and Monitoring Access Network TAPESTRIES The Application of Psychological Evaluation to Systems and Technologies in Remote Imaging and Entertainment Services Team based European Automotive Manufacture TEAM TELEBORG Communications and Hardware Requirements for Telepresence Supporting Human Like Physical Presence in Real Remote Environments TELESHOPPETelescoping services using virtual reality and interactive multimedia VALIDATE Verification And Launch of Integrated Digital Advanced Television in Europe VANGUARD Visualisation Across Networks based on Graphics and the Uncalibrated Acquisition of Real Data VIDAS Video assisted with audio coding and representation VISEUM Virtual Museum International
EMERALD EMPHASIS EURORIM GAIA IBCoBN IMMP INTERACT ISIS KIMSAC LEVERAGE M2VTS MAESTRO MEMO MIDSTEP MIRAGE
109 4.1. MULTIMEDIA CONTENT MANIPULATION AND M A N A G E M E N T The scope of this area will be to develop audio-visual signal representation, manipulation and management tools able to support the creation of advanced digital multimedia services. The activities will concern software tools and hardware components for generic applications in 3D telepresemce, mobile multimedia communications, program production and distributed virtual environments. New image related standardisation activities in MPEG4 herald a revolution in terms of representation and processing of audio visual signals. The scope goes beyond compression to include new functionality such as object identification and manipulation, hybrid scenes (computer graphic and real images) or 3D images representation. ACTS projects aim at different services and applications but are likely to share a large number of tools and algorithms. State-of-the-art audio visual coding (e.g. H324 for very low bit rate coding over the telephone network) will be enhanced thanks to the use of audio information combined with advanced image synthesis techniques (VIDAS). Technology providing new audio visual functionality (tools, algorithms and syntactic description language) for mobile multimedia systems (MOMUSYS) and videotelephony (SCALAR) will be developed and contributed to MPEG4 and the implementation aspects of this new standard will be pursued (EMPHASIS). Telepresence extends the video conferencing concept so that participants can use non verbal aspects of communication (eye contact, spatial perception, body movement, gesture, facial expressions) in the same way as they would in a face to face meeting. Within ACTS, telepresence services and the underlying technology will be demonstrated in field trials by a number of projects in a variety of application scenarios. The technology involved is often a combination of relatively stable networking techniques and leading edge applied research results such as 3D image acquisition, modelling of virtual reality, and high performance displays (RESOLV, VANGUARD, PANORAMA). A computational service for virtual presence in support of future co-operative teleworking systems will be developed and trialled (COVEN). The main focus will be on the added value of networked Virtual Reality involving varying degrees of multi sensory presence. The benefits of these advanced technologies in the construction of buildings will also be shown (CICC). Following the ACTS second call, the MCM group will also integrate new projects on telepresence including telemanipulation and augmented reality (TELEBORG, MAESTRO, MIDSTEP). Compared to existing multimedia services projects, these initiatives will go one step further towards telepresence. Input and output devices will not only be audio visual based but will also cover touch and physical feedback. The multidisciplinary nature of the work is in line with current global high potential activities (DAVIC, MPEG4, ..). New telemanipulation projects are likely to impact on virtual reality, augmented reality and real time registration techniques (CAD model on video..), video based telepresence (higher flexibility, frame rate view points..), hypermedia data management, broadband networking (QoS, traffic analysis for telepresence applications), teleoperation techniques (robotic arms, data glove..), advanced displays techniques (3D, VR helmets, forcereflecting glove / arm display..), sensors, actuators, human/computer interaction tools and artificial intelligence. Advanced image manipulations, telepresence and distributed operations have also the potential to transform the way TV programs are produced and to cut costs. A testbed for
110 distributed production, post-production, rehearsal, archiving, indexing and retrieval forming a distributed virtual studio over ATM will be developed by DVP. Television virtual production involving various aspects of creation and display of Virtual Reality and manipulation of virtual environments will be the focus of MIRAGE. Parallel architecture for computation intensive multimedia operations such as those needed in advance programme production will be developed in AMPA and content will be deployed over broadband transatlantic links (VISEUM). Advances in signal processing will also be applied for automatic artefact removal aimed at the restoration of old films (AURORA) or the analysis of videosequences for access control (M2VTS). The next step in MPEG2 coding will be the implementation of a full MPEG2 chain starting in the studio and involving switching of compressed streams (ATLANTIC). Finally, understanding the key psychological factors contributing to customer acceptance of new entertainment and informational services will be the aim of TAPESTRIES and will include subjective evaluation of virtual reality, MPEG4 and 3D images (this will also be offered as a service to other projects). MPEG4 is and will continue to be the glue between most of the MCM projects. It is the forum bringing together major actors and technologies from the computer, television and telecom areas. ACTS projects have already achieved an outstanding degree of participation, contribution and leadership within this group. This is specially the case in the Video-VM, MSDL, Implementation and Future workgroups, all of which are led by ACTS project members. This is true also for the MPEG4 audio group in which ACTS is equally well represented. As part of the Integration group, the relatively new SNHC ad hoc group intends to standardise tools for representation and coding of real and synthetic audio visual scenes in support of the creation, manipulation and efficient management of increasingly hybrid content. ACTS projects are stepping up their involvement in this group as well. VIDAS is already an active member and contributions will be reinforced through new project extensions such as COVEN bringing in computer graphic expertise and MOMUSYS extending its activities toward higher bitrates in a hybrid framework. A deeper involvement in SNHC is also expected from projects DVP, MIRAGE (distributed virtual studio, VR), VANGUARD, RESOLV (3D analysis and modelling), CICC, MAESTRO (augmented reality) and PANORAMA (3D handling and display), ensuring a significant impact for European technology in this key area. 4.2. PRESENTATION, INTERACTION AND STORAGE The advent of interactive multimedia services points to the need for intelligent service interfaces capable of assisting the user to understand the extent of many services available, to locate relevant material, access services, contact other parties and retrieve, store and manipulate information. This entails the development of high capacity storage units and terminals equipped with advanced navigation tools. R&D activities in this area will concern set top units, servers and storage units. Innovative navigation and dialogue tools for interactive multimedia services integrating agent technology and exploiting new standards such as MHEG or MPEG4 will be demonstrated. Brokerage services will be developed. The aim of the SICMA project is to design a cost effective, scaleable server for the delivery of images, data and continuous multimedia information. Its efficiency will be demonstrated by a
111 "virtual museum" application, including links to an art museum in Russia. The server will fully realise the DAVIC standard. Set top units are the focus of both the SETBIS and SOMMIT projects. SETBIS is constructing a set top prototype with special emphasis on low cost hardware, for acceptance testing of interactive services. The set top box will be evaluated in a mini trial with real users, using the infrastructure of the CaTV German National Host in Baden Wttrtenberg. The overall goal of SOMMIT is to define an open interactive multimedia terminal architecture, application and delivery media independent, and to achieve a full software implementation. SOMMIT is building on work carried out in the RACE project MARS, and will develop a toolkit of modules which can be reused in different designs of terminals depending on the actual time, manufacturer and product. The project aims at compliance with DAVIC and ATM Forum standards. DVBIRD is developing an optimised and integrated chip set for digital terrestrial TV receivers, starting from the DVB-T (Digital Video Broadcasting Terrestrial) specification, and from work already carried out within the RACE project dTTb. DVBIRD will deliver a demonstrator to other ACTS projects to carry out field trials for validation purposes. It will also take into account the exploitation of commonalties with satellite and cable receiving chains, leading to the definition, design and fabrication of an optimised chip set to be used in a common receiver for satellite, cable and terrestrial broadcasting, compliant with DVB. SMASH is developing a storage system to satisfy both the demand for large storage, and the need to access the stored information in a fast and interactive way. Studies will be done on how to interconnect magnetic tape and disk systems in optimal way, so that they behave as one integrated system to the user. An adverse consequence of the information highway is the possibility of getting lost and being unable to locate the desired information. Electronic programme guides, brokerage services, mobile intelligent agents and user interfaces are the subject of study for the projects MUSIST, KIMSAC and DIANE. 4.3. TRANSMISSION MEDIA
Instead of being integrated on one single medium, the information society will utilise all available transport media in order to create user friendly and economically viable services. This is reflected in the diversity of the projects. Fibre, coax, twisted pair, satellite and terrestrial transmission are all covered. A major effort is to creme a return channel for interactive services on the traditional media for distribution services or to creme broadband capability on the traditionally narrowband switched services. Many direct-to-home satellite and terrestrial broadcasters, CaTV operators and telecoms companies are looking to migrate their residential networks towards bi-directional broadband capability, in a race to pioneer the new interactive multimedia markets. VALIDATE will co-ordinate the tests needed to achieve sufficient confidence in the DVB-T specification to be adopted by ETSI. The laboratory tests and field trials carded out by VALIDATE will also produce information needed for the planning of Digital Terrestrial
112 Television Broadcasting (DTTB) services. INTERACT is studying and implementing return channels for use in terrestrial and cable DVB systems, and will contribute to the debate within Europe (via the DVB and EBU) and in world wide fora (via the ITU and DAVIC) regarding the return channel specification. MEMO will trial the data transfer capability of DAB (up to 1,7 Mbit/s, with a GSM return channel). The DTH (Direct To Home) satellite project ISIS aims to provide tele-education, on line catalogues, newspaper distribution, access to Internet and to LANs in major offices for those working from home. It aims to incorporate a low speed return channel at an increase in cost of around 20% compared to the DTH receive-only terminal. Another satellite project (DIGISAT) will develop the technology for interactive services via SMATV (Satellite Master Antenna TeleVision) networks, based on the studies started in the RACE project DIGISMATV. A key target of the project is to develop a return channel via satellite using VSAT technology. In the downstream-only direction, CINENET aims to distribute cinema quality digital movies, sports, live events and business presentations to conference halls, cultural centres, hotels, etc. over satellite, cable and ATM networks. It will build on the experience gained in the RACE project HDSAT. IBCoBN will concentrate on upgrading CaTV networks to support symmetrical broadband applications such as videotelephony and videoconferencing, and on publicising the potential of these services. IBCoBN also plans to lay the groundwork for establishing a common research facility for independent cable operators in Europe. CaTV operators can carry two way broadband communications over existing coaxial cable networks, using Hybrid Fibre Coax HFC topologies upgraded with a return channel. In HFC, optical fibre runs between head end and a street cabinet, with the existing coaxial cable covering the last km or two to the home. The return channel uses bus techniques to share bandwidth on the coaxial cable which is laid in a tree topology. Many ACTS projects see a switched ATM overlay on an HFC network as necessary, for efficient use of bandwidth, and so that the subscriber can establish connections on demand to servers in different locations. Telecom operators can also carry two way broadband communications over the existing telephony twisted pair local loop, by deploying ADSL (Asymmetrical Digital Subscriber Loop) electronics in the home and at the local exchange. Alternatively they can use a FTTC (Fibre To The Curb) network with a passive fibre extending from local exchange in tree topology to termination points in street cabinets, from which individual twisted pair or coax drops are provided in star topology to the home. Typical bandwidth are 6 Mbit/s downstream and several hundred kbit/s upstream. The trend is to bring nearer and nearer to the home, but FTTH (Fibre To The Home) is generally too expensive today, because of the cost of the optical to electrical conversion in the home. AMUSE will demonstrate the viability of using ATM end-to-end from server to TV set top units, and also PCs over ADSL,FTTC and FTTH networks MPEG2 will be transported over ATM variable bit rate and possibly available bit rate services, for maximum efficiency. The IMMP project is also using end-to-end ATM. IMMP foresees that in the longer term business services to the home and very small enterprises will be of major importance, it is therefore
113 developing a platform that will support business applications as well as entertainment. The focus of OPARISOD is the development of an interactive services on demand Media switching Centre which receives, via ATM-SDH interfaces, a number of information streams from the Media Servers. These streams are switched then mapped into DAVIC transport frames for transmission through the access network to the users. A successful introduction of advanced digital services relies upon the management of the quality of service; the project QUOVADIS is looking to mechanisms for controlling and supervising the performance of digital TV distribution across gathering, transport and broadcast networks. 4.4. INTERWORKING ACROSS DIFFERENT NETWORKS Access to servers in many different locations, and peer-to-peer broadband communication, requires that the various types of access networks be interconnected via core telecom networks. The existing core network infrastrucure is mainly optical fibre, and over this, many operators are now deploying a layer of ATM switches. This ATM switching layer provides instant switched bandwidth on demand, carrying any mix of voice, data and video traffic, with the economy of sharing physical links through statistical multiplexing.
Interconnection between participating countries, and between projects is a key obiective of ACTS. It is planned that this interconnection will be via a pan-European research network> Access in a given country will be provided by the National Host(s) in that country. NH offer access protocols such as ATM and SMDS (Switched Multimegabit Data Services, and were set up to support the national communication needs of ACTS and other EU R&D programmes (A country may have more than one NH, based on different types of network). Several ACTS projects aim to demonstrate open DAVIC and/or DVB compliant interfaces. The ACTS project DAM will test DAVIC profilesover vired access networks, linking servers with access networks between Finland, France, Germany and Italy via an ATM network. DAM is also sisseminating information about the DAVIC specifications to other ACTS projects and to European companies, and will co-ordinate input towards DAVIC. 4.5 ENABLING COMMERCIAL SERVICES The development of technologies for the digital transmission of multimedia information may be impressive, but these technologies will not be implemented unless they can be exploited in commercially viable services. However, besides the higher bandwidth efficiency, the fact that digital information streams can readily be encrypted is an important drive towards digitalisation of existing services. The rapid multiplication of the number of the television programmes and the limited financial resources make it necessary to provide the content only to customers who have paid for the access right and to protect the multimedia material against illegal copying. ACTS projects that study these systems are respectively OKAPI and TALISMAN. Customer care, tariffing and billing concepts for interactive multimedia services will be evaluated by SPECIAL. The introduction of a multimedia broadband service network will give the opportunity to provide a much more efficient and cost effective mechanism to connect consumers with providers, with the introduction of a suitable set of standard brokerage interfaces. Electronic
114 brokering services range from meta-catalogues compiling information from a large number of different providers to intelligent agents that visit the provider on behalf of the user. The brokerage for broadband services is part of the work within DAVIC and the appropriate references to standards will be part of future releases of the DAVIC specifications. GAIA is addressing the task on the development of open architectures for brokerage systems. It will develop a generic scaleable architecture for information availability having as infrastructure platform heterogeneous already established media, networks and protocols. ATMAN, CODIS and MUSICIAN will develop, integrate and test the full supply handling delivery chain for electronic brokerage systems trading digital audio video material of different kind: data from Internet (WWW), AV material through the ATM network, AV material through SNG (Satellite News Gathering), video clips and AV material through broadcasting. 4.6. TRIALS Several projects aim to carry out trials of the whole multimedia service chain, from server to television set and PC. AMUSE will carry out trials on ADSL, FTTC and FTTH networks in Italy, and HFC on CaTV networks in Belgium and Germany. IMMP will use the UK, Swedish and Finnish National Hosts to provide access to the Pan-European ATM Pilot. OPARISOD will carry out a mini trial, in which the users will be chosen from the 4000 subscribers connected to the Baden WOrtemberg pilot system. SPECIAL plans to develop and implement a customer care concept for consumer services in the InfoCity trial in Nord Rhine Westfalen (the InfoCity trial is a major initiative to provide an alternative infrastructure to that of the dominant operator, involving 10,000 users). BURBON will run a trial on cost effective access to ATM based advanced services for SMEs in 6 member states. TELESHOPPE will apply advanced virtual reality technologies in the creation and trialling of multi lingual teleshopping services. TEAM will provide a framework for virtual integration of the entire automotive supply chain. LEVERAGE is focusing on the support of language learning using broadband video services; EMERALD will do the same in the medical environment. 4.7. TRANSITION SCENARIOS The project BIDS will address the main options for progressive and economically sound introduction of digital TV, HDTV, and interactive TV services across Europe, including user requirements and regulatory considerations.
5. THE FUTURE The ACTS programme is due to run until the end of 1998. Already some thought is being given to the direction that future programme should take. Advanced telecommunications will continue to remain of importance both in the direct impact that they have as a major industry in themselves, but more importantly owing the services which they can offer to the business community and to the individual. Multimedia, with special emphasis on image, will remain the major building block of the European R&D activity in telecommunications.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
115
C o n v e y i n g m u l t i m e d i a services within the M P E G - 2 T r a n s p o r t S t r e a m Luigi Atzori, Massimiliano Di Gregorio and Daniele Giusto Department of Electrical and Electronic Engineering, University of Cagliari, piazza D'Armi, 09123 Cagliari Italy In this paper, we present a new system for the transmission of multimedia information within the digital TV channel. The scheme proposed is based on some DSM-CC functions, while the transport structure used is the MPEG-2 Transport Stream, that is the most widespread platform for new digital television systems. The multimedia information conveyed in such application is structured as in the WWW system, where the text is integrated with images, sounds and animation by the means of the html language. 1. INTRODUCTION It is a matter of fact that the standard MPEG-2, defined by ISO/IEC 13818, is the most widespread platform for video and audio coding. This standard provides also the transport protocol in order to multiplex different bit streams and to convey the audio-video synchronization signals ~. The transport structures defined in the standard for the digital television broadcasting is the Transport stream. Such a structure allows to multiplex in one bit stream, different mpeg audio and mpeg video signals as well as every other type of digital stream any way defined, refered to as private data. In point of fact, new elementary streams can be handled at the transport layer without hardware modifications, giving the opportunity to implement additional services inside the TV digital channel. A digital teletext version has been already standardized by the ETSI, giving the syntax to insert the data inside the Transport Stream 2. Furthermore, the part six of the ISO/IEC 138183 describes a global server-network-client system and the related bit stream syntax for the implementation of additional services related to the broadcast and interactive multimedia applications. This is the Digital Storage Media Command and Control (DSM-CC) specification. Our idea has been to carry multimedia information inside the Transport Stream MPEG-2, making use of the DSM-CC features and introducing a service that is somehow similar to the WWW information available in the Internet Network. In the last section of this paper, a simulation of coding-transmission-decoding of this kind of data is described, based on two UNIX processes connected by a socket structure.
2. MULTIMEDIA INFORMATION SYSTEM BROADCASTING An hypertextual system consists of a group of different documents linked each other, as a database where the user can "jump" from a region to an other inside the database. Most of the hypertextual systems allow the integration of documents with other types of data, as graphs, images, animation, making up multimedia information 4.
116 For the purpose of our work, we refer to the WWW database structure, in which every web is made up of html files, images of different type and audio-video sequences. The main characteristic of the WWW database is that it branches out in several local hypertexts, that are separate open systems. This feature cannot be realised in a broadcast application where it is required to send only "closed web", that is every link must be inside. The implementation of such a service requires to send the whole web file system, being care to rebuild, in the client side, the same path for every file as to hold the same file names to preserve the links. The final service offered is a periodic transmission of the file systems (data carousel), where the user can download the webs he is interested in. In order to provide the user with sufficient information to make this selection, data carousel has to include, inside the multimedia data, a directory mechanism that describes the information of the carousel and provides a procedure to group the information. 3. C O D I N G LAYER STRUCTURE The coding and transmission processes of the hypertextual system are made up by three layers. The first one has been introduced to convey specific information about every file such as the name and the path. This information is necessary for the correct building of the file system, so that every link is preserved. The second layer provides the functions necessary to set up the transmission and to carry the files to the client. For this purpose the DSM-CC download messages have been adopted, in order to get a stream that can be parsed by a generic DSM-CC decoder. This choice has introduced some unnecessary fields in the syntax but makes this application more portable. The third layer is the transport layer, that is the well-known MPEG-2 Transport Stream.
3.1. File section The file_section is a structure in which is encapsulated the file, that has been designed to associate basic file information with the file data. As table 1 shows, a name_length field is conveyed to provide the number of bytes used to code the file name in the following field. So a char string is carried, that codes the name in a ASCII fashion. To give the right path to the file, it has been designed to associate a byte code to every path inside the file system. This is carried inside the path_code field that shall be decoded through a table previously sent to the client in a DSM-CC message. m
Syntax
N. of Bytes
file_section() { name_length file_name path_code for(i=O; i
1
name_length 1 1
Table 1. File_section syntax.
3.2. The download layer The Digital Storage Media Command and Control specification provides a set of protocol to realise different applications such as movie on demand, tele-shopping, news on demand,
117 remote database access, etc. The main characteristic of this syntax is that it doesn't specify the underlying physical, data link, transport layers of the overall protocol stack. It is intended to provide a unified signalling layer over a wide variety of underlying network topologies. A particular feature of the DSM-CC is the download function, that is intended to be a very lightweight and fast data or software download from the server to the client, or from the network to the client. Flow-controlled operation, as well broadcast download option, are supported and are based on the same message set. A complete download operation transfers an "image" data, that is made up of logically separate section called modules. In order to meet the transmission constraints, such as bit error rates and maximum transport packet size, each module is then divided in blocks, all of the same size. During a download session the whole web file system is sent to the client, managing each file as a module. In order to realize a web download it has been used two DSM-CC message structures: the DownloadlnfoResponse and the DownloadDataBlock messages. The table 2 shows the syntax for the DownloadlnfoResponse message, where some additional fields have been introduced, paying care for the syntax constraints of the message. Furthermore we have already assigned the value to some field that have constant values or have not meaning in our application. These fields couldn't be removed from the message in order to be correctly parsed by a DSM-CC decoder. The DownloadlnfoResponse starts a download session, conveying some general information of the session and specific information related to each module. The downloadld is a four bytes length field, that identifies the download session.
Syntax
N. of
Bytes
DownloadInfoResponse0 {
Value
-
File type
Type_code
2 1 1
4018 0 0
4 4
0 0
html jpeg mpeg-2 audio mpeg-2 video
2 2
0 -
000000 000001 000010 000011 000100 000101 000110 000111 001000 001001 001010 001011 001100-111111
downloadld blockSize windowSize ackPeriod tCDownloadWindow tCDownloadScenario
4
compatibilityLength numberOfModules
gif
for(i=0;i< numberOfModules;i++) { moduleId moduleSize
moduleVersion moduleInfoLength
compression_flag reserved type_code
text avi 2 4
-
1 1 1 bit
1 -
1 bit
-
6 bit
-
2
-
1
-
}
privateDataLength for(i=0;i< privateDataLen;i++)
path_description_byte
{
}
Table 2. DownloadlnfoResponse syntax.
ps an wav
quicktime AIFF-C user
private
Table 3. Type_code assignments.
118 The following field conveys the length of each block for this download session. For the goal of our application, we have adopted the size of 4018 bytes. This is the maximum size that the transport layer allows to use, because each download message must be encapsulated in packets of 4 KB maximum length (see the transport layer section). For each module specific information are conveyed as the module identity, the size, the version and the modulelnfo whose length must be equal to the value of the modulelnfoLength field. This last field gives the opportunity to introduce some application dependent information, indeed we have used it for the compression state and the file type. The first one is a one bit field that shall be set to 1 when the file has been compressed. The second one is a six bit field that conveys the file type coded by means of the table 3. Finally, the message conveys a privateData field that shall contain private information about the image. We have used this field to introduce the path_description, that gives the tree structure of the file system making use of the preorder traversal method. In a preorder traversal, the work a node is performed before its children are processed. It means that first the root is descripted then each child is visited in a preorder fashion too. The description given for each node is related to the directory name length (coded in one byte), the directory name and the path code (coded in one byte). After a node description is conveyed, another directory_name_length is expected if the last node is not a leaf, otherwise a back code of value zero is introduced. The back code indicates that the next node is located one step back in the path. The end of the description is given by the privateDataLength. Syntax DownloadDataBlock 0 { moduleld
moduleVersion reserved blockNumber for(i=0;i
N. of Bytes
2 1 1 2 1
Table 4: DownloadDataBlock syntax. Once the client has received the whole DownloadInfoResponse message he is able to parse the data messages and to build the file system through the DownloadDataBlock messages, whose syntax is shown in table 4. In order to identify the module to which the block belongs, the module identity is conveyed. It shall the moduleld sent in the control message, otherwise an error occurred. The blockNumber is a field to number the block of the module. In the blockDataBytes are sent the block bytes in number of blockSize. The DSM-CC download also requires furthermore the encapsulation of each message in a message header. This header contains information about the type of message being passed as well as any adaptation data which is needed by the transport mechanism including conditional access information needed to decode the data.
3.3. The Transport layer As already mentioned the DSM-CC functions are designed to be transport indipendent, so that they are not affected by the transport layer. Only main constraints concerning the reliability
119 of the data are defined, in the way that the error detection shall be provided and the corrupted messages should be discarted. The MPEG-2 Transport Stream (TS) is a transport layer suitable for this application, providing the means for multiplexing the download messages data with audio and video mpeg compressed streams. Although none of the messages are required to be carried within an MPEG-2 Transport stream, if it is used the messages shall be encapsulated inside a dsmcc_section that inherits all of the MPEG-2 private section syntax. In this section only one download message is carried each time, and it has the maximum size of 4 Kb. Each program video or audio sequence as each sequence of DSM-CC messages made up a separate stream (elementary stream), that is divided into Transport Packets in order to multiplex each data stream in the Transport Stream. Every elementary stream is identified by a PID, that is a number of 13 bits. Elementary streams can be grouped in Programs, as an audio and a video elementary stream referring to a movie. For this reason the decoder needs additional information to recognise the streams included in a program, and to know the list of programs conveyed in the TS. These information are called Program Specific Information (PSI). The most important are the Program Association Table (PAT) and the Program Map Table(PMT). The PAT lists all programs in a transport stream. It is easy for the decoder to extract the TS packets containing the PAT because they have PID=0. For each program the PAT gives the PID of the TS packets containing the PMT (each program has a specific PMT). Then each PMT gives specific information to the related program and each elementary stream of the program. 4. SIMULATION OF THE H Y P E R D O C U M E N T S TRANSMISSION In order to simulate the transmission of hypertextual data according to the ISO 13818, we have devised a communication scheme between UNIX processes. Actually both a software encoder and a prototype of a proper decoder have been designed. The encoder multiplexes and codes web data employing the DSMCC Download protocol and sending out a Transport Stream; the decoding process demultiplexes the stream and builds the file system of the transmitted hyperdocuments. This kind of processes, running on two different workstations connected by a LAN, manage to communicate by means of a stream socket in the Internet domain. It is a distributed application based on a client/server model, in which the server implements the encoder that delivers transport packets when a decoding client calls for a connection. The algorithm created for the simulation of a Transport Stream encoder essentially performs the capture of hypertextual files from several directories and executes the multiplexing, inserting all the necessary information. The file system, belonging to the webs to be multiplexed, are placed in different root-directories. In order to accomplish the multiplexing of the data from the several webs, the encoder takes the data coming from a different file system each time, in a cycles fashion, assigning equal priority to each web. The overall operation ends when all the files have been sent, then the broadcast starts again in such a way as to simulate a real application in which all the multimedia information is transmitted by means of continuous cycles of data-refresh, that allow the file to be updated in real time. This can be rather useful in case of particular information transmission such as stock exchange data and to update the database that a firm needs to deliver to its own clients and employers all day long. Furthermore it allows the user to receive data in every moment
120 decoding one of the refreshed stream. A general layout of the implementation of the encoding process is presented in figure 1. elementary stream 1
.." .....
DownloadInfoResponse message ~ ~ DSM-CCL_~-transportp a e k e t ~ _ ~ web file j - block L~ DownloadDataBlock ~ sequencer I generator 1"7 messagegenerator l)ownloadInfoResponse message
"l
section I ' l
generator I
I
elementary streamN "
M
/ .........
TransporL Stream "
U
"~ ~ DSM-CC~ transport packet ~_~ X generator I "webfile~ block ~DownloadDataBloek ~,,~ "[ section ["[ sequence/ generatorl [ messagegenerator / Tables "1
generator
"/
9
Figure 1. Coding and multiplexing system layout 5. CONCLUSION In this paper we have proposed a new system for coding hypertextual data within the MPEG-2 Transport Stream. In particular we have referred to the features of html webs and developed a novel method to multiplex different hyperdocuments in the same numerical stream. To achieve this goal we have made use of the DSMCC Download protocol, defined by ISO-13818-5, that has been adapted to both such application and transport layer. In summary this work deals with the transfer of several file systems (webs), embedded and coded in such a way as to respect the constraints of the protrocols mentioned above. This model can be the basis for providing more advanced additional multimedia services in the field of digital TV broadcasting. REFERENCES
1. ISO/IEC JTC 1/SC29/WG 11, ISO/IEC 13818-1: Information technology - Generic coding of moving pictures and associated audio - System, November 1994. 2. ETSI, Digital broadcasting systems for television; specification for conveying ITU-R system B teletext in digital video broadcasting (DVB) bitstreams, prETR 300472, November 1994. 3. ISO/IEC JTC1/SC29/WG 11, ISO/IEC 13818-6: Information technology - Generic coding of moving pictures and associated audio information - Extension for Digital Storage Media Command and Control, Dallas, November 1995. 4. Woodhead N., Hypertext and Hypermedia, Theory and Applications, Addison Wesley, Englewood Cliffs - NJ 1991.
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.)
9 1997 Elsevier Science B.V. All rights reserved.
121
A Subband Video Transmission Coding System for ATM Network M. Eyvazkhani Ecole Nationale Supdrieure des Tdldcommunications D d p a r t e m e n t Images 46, Rue Barrault 75013 Paris-France Tel.:(33-1 )45.81.80.91 Fax:( 33-1 )45.81.37.94 e-mail: [email protected] Abstract
This paper presents a three dimensional video coding scheme which allows us to preserve the main information against packet loss in transmission of video signal over an A T M network. This kind of loss of information occurs frequently in the network. It could seriously degrade the perceptual quality of image and ever desynchronize the coding system to loss the picture. The three dimensional subband video coding is a good alternative to this purpose because it decomposes video signal into the subbands in the manner that all of the important information which is concentrated in the lowest frequency level could be transmitted over a protected channel with high priority to generate an acceptable quality. To compare this coding scheme to a classical approach as MPEG, it presents more efficiency and robustness specially when the bit error rate in the network is very high.
1
Introduction
During the last few years, for transmission of video signal the ATM (Asynchronous Transfer Mode) network has been considered as a good candidate in the B-ISDN (Broadband Integrated Services Digital Networks). On the contrary to the advantages offered in ATM network for video transmission, as the selection the bit-rate or the possibility of variable bit-rate coding, it may also introduce some time delay and the packet loss could be occured with a probability varying from 10 -s to the worst case about 10 -2 corresponding to a bit error rate in the rang of 10 -9 to 5 • 10 -3. We can mention three factors which provide the packet loss in the ATM network: transmission bit errors, packet buffer overflows in the switch and the excessive time delay [3]. In video transmission, because of the high bit-rate of a video signal a compression technique should be applied, without any degradation in image quality, in order to limit its bit-rate to the channel capacity of network. In the other hand in the presence of packet loss, compression techniques make video signal more vulnerable and generally using the classical approaches in which they use the motion compensation and movement vector techniques are not recommended to this kind of application. Eventually in this case, the general techniques of error correction as FEC (Forward Error Correcting) could ameliorate the image quality but when the bit error rate increases they become ineffectual. For this purpose , the 3-D video subband coding can solve this problem by using different priority level and protecting the part of information judged more important.
122 In this paper, the transmission of video signal over an ATM network using three dimensional subband coding to compensate the effect of packet loss is presented. First of all, the global subband system in the form of three dimensional filter bank is treated. Thereafter, the coding scheme followed by the filter bank is discussed. Finally, some simulation results obtained with the codec are reported and compared with ones from an MPEG codec.
2
subband
codec
Subband coding for packet video systems refers to compression methods that divide the signal into multiple bands to take advantage of a bias in the frequency spectrum of the video signal. The subband codec consists of two major parts: the filter bank and the appropriate coding of each subband. In the first stage, the video signal pass through the filter bank in order to be partitioned into the subbands. The filter bank itself is composed of analyzing filters, which produce the subbands container information with different range of frequency, and the downsampling section for reducing the bit-rate. Thereafter for the purpose of minimizing the signal bit-rate as much as possible, each subband is coded, quantized and is sent separately over transmission channels of ATM. In the reception, the coded subbands are decoded and then they pass through the upsampling section and synthesis filters as a means to reconstruct the original video signal, figure 1. ENCODING/DECODING TRANSMISSION Hl(Zl,Z2)
"-'~"
NZl,Z
~
(D)PCM/VQ VLC " ~
FIZl'Z2) ~
~-
frame n+l
frame n
_ (D)PCM/VQ. ~ VLC
Fl~Zl,Z~
Figure 1: The global scheme of 3-D subband coding system. The video is considered as a three dimensional signal with one dimension in the time domain and two dimensions in the spatial domain. Thus, we have tried to implement the 3-D filter bank to cover all of the information in temporal axe and two spatial axes. The procedure of filter bank in the temporal direction is achieved by filtering the frames along the temporal axis. It reduces the redundancy information in the time. The best candidate for this purpose is the Haar filter because it offers ideal filters in low frequency and high frequency branches in the filter bank [2]. Haar family filters produces short filters with low corn-
123 plex computation which allows us to reduce the number of successive flames in which it leads to minimizing the buffer size. The Haar filters used in the codec are:
I Ho(~) - ~+~-~ -
v~
Ht(z) Hi(z)
--
l - z -1
v~
where Ho(Z) and Hi(z) are respectively the low pass and high pass filters. The synthesis filters are designed as: Fo(z) - H l ( - z ) and Fl(z) = - H o ( - Z ) in order to remove the aliasing in temporal filter bank. As for frequency decomposition of video signal in spatial dimensions, the non-separable filter bank is designed to process the signal in horizontal and vertical axis. On the contrary to separable 2-D filter banks which are based on two 1-D filter banks, the non-separable filter bank offers more flexibility and performance. It also gives a regularity in frequency separation of subband images and avoid the mixed orientation which is present in the case of separable filter bank [8]. This orthogonal filter bank requires a linear phase and cancel the aliasing terms coming from the subsampling section. As to downsampling section, the hexagonal sampling procedure is used because it is the optimal sampling scheme, for signals which are band-limited over a circular region of Fourier plan, in the sense that exact reconstruction requires a lower sampling density that the alternative schemes [4, 5]. In the other hand, using the hexagonal sampling lattices provides the tightest packing of all regular 2-D lattices and can eliminates the mixed orientation problem by using hexagonally symmetric filters. The process of downsampler is based on sampling its input s(n) by mapping points on the sublattice As, generated by the sampling matrix S defined here as:
to another lattice A which is the union of As according to:
v(~) = ,(s~) and discarding samples of s(n) not on As. Inversely to downsampler, the process of the downsampling maps a signal on lattice A to another that is nonzero only at points on the sampling sublattice As and the output is defined as:
w(n)
3
_ f y(S-ln),
0
ifS-ln E A otherwise.
A p p r o p r a i t e c o d i n g of s u b b a n d s
Befor sending the subbnad images generated by the analysis filters and decimated by hexagonal downsampler, we should encode them as a means of reducing the data rate.
124
As it has been mentioned in the last section, the video signal after the temporal filtering is devided into two subband images. Then each of them passes separatly through the 2-D spatial filter bank. According to the sampling matrix, there are exactly four subbands ( ] d e t ( S ) ] = 4) in each of the two low and high temporally frequency branches. For the purpose of fitting any model for the energy conserved in each subband images, the normalized histograms of intensity values are used. The probability density function (pdf) of subbands of an image is defined as a class of function known as generalized density function and between the different pdfs [6], the Gaussian pdf has been fit to the histograms of intensity values for each subband [7]. Since the energy of any natural image is concentrated in the low frequency region, the DPCM technique and a uniform quantization has been used to minimize the data rate. This low frequency region is find specially in the two first subband images where we find out a higher correlation among pixels. As to other subband images where the correlation between pixels is not so high, the P C M technique with uniform quantization has been implicated to maximize the reduction of bit rate. A natural approach to protecting the packets is to ensure an appropriate level of quality for each service class by assigning some sort of priority to the packets. In an ATM network, the definition of the priority can be done explicitly by assigning the cell loss priority (CLP) field in the packet header. The actual prioritization of signals can be accomplished by subband coding technology. In the transmission channel level, since the important energy is conserved in the first subband image, it is sent over the network with the CLP field in the packet header set to "1" in order to offer better protection against packet losses in the network [3]. If ever the packet loss should occur in the network, this subband image remains intact and allows us to recover at least, the essential information to displaying the picture and all the contour information resting in the other subband images complete the quality of picture.
4
S i m u l a t i o n and r e s u l t s
As to results obtained from the simulation, a simple ATM network has been simulated to test the two codec, the subband and MPEG-2 codecs. The MPEG-2 codec used in simulation uses a train of the frames coded in mode IPPB [1]. The simulation has been carried out in the presence of Gaussian noises with a probability varying in the range of 10 -3 to 10 -9. As shown in the figures 2 and 3, the subband codec offer more performances compared to this of MPEG-2 in which this is measured as the Signal to Noise Ratio (SNR) for the VOITURE sequence video which has a resolution of 512 • 512. In general this subband codec offers more of 1.5 to 2 dB of performance. It is true that SNR measure is not a good criticism to compare the performance of the video quality. But when we encode the video sequence with MPEG-2 in the situation where the bit error rate falls down to a range greater than 10 -4, the quality of image becomes dramatical because the empty blocks being appeared in the picture. In this case, the empty blocks come from the lost packets in the network in which they contained an important part of information concerning the movement vectors between different blocks in an image and this degradation is more significant than 2 dB of difference in SNR measure. With subband codec in the same circumstance, we obtain at least the whole of image because in the high priority channels in which we sent the low frequency part of information have been preserved against the transmission error and the entire discard rate is absorbed by the low priority data. In this case, the SNR has been lowered by the packet loss, but because vital data has been preserved, degradation occurs only gradually and does not appear in obvious blocks.
125
5
Conclusion
In this paper, the performance of 3-D subband video coding in the presence of lost packet in an ATM network has been evaluated without using any error control codes. The results has compared to the performance offered by an MPEG scheme in the same circumstances. A major aspect of the codec has been presented in which the non-separable filter bank is used with hexagonally sampling. We could observe from the simulation results that this proposed codec offers more improvement comparing the classical methods in which the DCT and motion compensation techniques are used, specially when the bit error rate in the network increases.
References [1] ISO/IEC 13818 Draft International Standard: Generic coding of moving Pictures and Associated Audio, part 2 video, 1993. [2] I. Daubechies. Ten Lectures on Wavelets. CBMS-SIAM, 1992. [3] Martine de Prycker. Horwood, 1993.
Asynchronous Transfer Mode, Solution for broadband ISDN. Ellis
[4] Eric Dubois. The sampling and reconstruction of time-varying imagery with application in video systems. In Proceeding of IEEE, volume 73, pages 502-522, April 1985.
[5]
E. Dudgeon and R. M. Mersereau. Multidimensional Digital Signal Processing. PrenticeHall, 1984.
[6] N. Farvardin and J. W. Modestino. Optimum quatizer performance for a class of nongaussian memoryless sources. In IEEE Trans. Inform. Theory, volume IT-30, pages 485-497, May 1984. [7] A. Papoulis. Probability, Random Variables and stochastic processes. McGraw-Hill, 1984. [8] P. P. Vaidyanathan. Multirate systems and filter banks. Prentice-Hall, 1993.
126
34.5
~ . ~
X: T h e M P E G
codec
/
34 33.5 33 ~" 32.5 rr z
3231.5 3130.5 30
0
5
10
15
20 Frame number
25
Figure 2: The quality measured as SNR at B E R
35.5
i
|
I
|
|
30
35
40
10 -4.
-
|
X: T h e M P E G
|
codec
35
34.5 ,,,-,, I:o nrZ
34
33.5
33
32.5
_
0
;
~o
~;
2'0
Frame number
~'~
3o
Figure 3: The quality measured as SNR at B E R - 10 -9.
~'~
~o
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.)
127
9 1997 Elsevier Science B.V. All fights reserved.
A High Efficiency Coding Method Global Brightness-fluctuation Compensation and Global Motion Compensation Methods Kazuto Kamikura, Hirohisa Jozawa, Hiroshi Watanabe, Hiroshi Kotera, and Kazunori Shimamura NTT Human Interface Laboratories 1-2356 Take, Yokosuka, Kanagawa 238-03, JAPAN Phone: +81 468 59 3475 FAX: +81 468 59 2829 Email: kamikura@ nttvdt.hil.ntt.jp 1. INTRODUCTION The emergence of multimedia has resulted in a growing interest in video applications in a wide range of fields such as communications, broadcasting and computer manufacturing.
Under the recent circumstances where multimedia is becoming more and
more popular, there is a growing demand for various applications such as desk-topconferencing using personal computers, portable visual telephony, networked games, shopping, and VOD access. The practical realization
of such applications
though requires
improvement in video coding efficiency at low or very low bit rates.
a significant
This is especially
necessary to transmit images of sports, landscapes, live concerts, and the like, which often include scenes where the global motion is due to the panning or zooming motion of the camera or the global brightness fluctuation is due to fade-in/out, camera flashes, and so on. With conventional video coding standards such as H.263 t~l, video scenes containing global motion or global brightness fluctuation suffer terrible degradation.
To solve the problem
of global motion, we have already proposed a global motion compensation (GMC) method and shown its effectiveness [21. In this paper, a global brightness-fluctuation compensation (GBC) method is proposed to improve coding efficiency when the coding is used for video scenes that contain global fluctuations in brightness. The GBC method is discussed in section 2. describe how the GBC and GMC methods work when combined.
Simulation results that
verify the feasibility of the above methods are presented in section 4. conclusions in section 5.
In section 3, we We draw some
128 2. GLOBAL BRIGHTNESS-FLUCTUATION COMPENSATION
2.1 Overview of the method The GBC method compensates for global brightness fluctuations caused by fade-in, fade-out, camera flashes, and so on, in the whole image.
We assume that luminance xid at
pixel point (i,j) changes to x'i,j such that
Xi[j -- FC . xi, j q- FG ,
(1)
where Fc and F6 are the contrast and gain components of the brightness fluctuation. These are the parameters and they are estimated for each image.
In the actual process,
xi,j and x'i,j must be the luminance values of the corresponding pixel positions in consideration of local motion. Therefore the corresponding pixels between the coded image and a reference image are first detected as shown in Fig. 1.
Global brightness-fluctuation
parameters Fc and Fc are then estimated. Finally, the luminance value xi,.i of each pixel in the
reference
Detection of the correspondingpixels ] between codedimage and reference imageJ
image
is
changed
to
x'ij
according to Eq. (1) to compensate for the global brightness fluctuation.
Estimation of global brightnessfluctuation parametersF c and Fc
Each process is
described in detail in the following sections.
Generation of the image compensated 1 for the brightness fluctuation Fig. 1 Frameworkof the GBC method
2.2 Detection o f the c o r r e s p o n d i n g p i x e l s The corresponding pixels between a coded image and a reference image are first detected by using local motion estimation. conventional block matching technique.
This is basically done by using a
However the technique can not detect real
motion well when global brightness fluctuation occurs between two images because it assumes a constant brightness between the corresponding pixels.
To remove the
influence of brightness fluctuation, the mean value in each block is subtracted from the luminance mean value in each block in advance, and block matching process is done using the differences. That is, the motion vector for each block is decided as a vector (k,l) that minimizes the error functional
O: ~l
j=l i=1
(Si,j-M)-(Si+k,j+l-i~k,l)l,
(2)
where M and Mk,l are luminance mean values of blockB and block/~, respectively in Fig. 2.
129
block/}k~ ~....J+, block B ~.
Si'j
Fig. 2 Local motion estimation
2.3 Estimation of global brightness-fluctuation parameters In consideration of local motion obtained by 2.2, we can replace Eq. (1) by
(3)
x;,j = F c . ~ , ) + Fc,
where i" = i + KB, J = J + Ls, and (Ks ,LD is a motion vector of block B in which there is a pixel point (i,j). Let yi,j be the actual luminance value corresponding to the luminance value x'i,j. We define the two parameters Fc and Fa as a set of values where the sum of the squared difference between yi,j and x'i,j becomes minimum.
Thus, evaluation function
E is represented as
N
E=~.,{Yi,j-x[,j} i,j
2
= Z Yi'j-(Fc'xi,) +FG
(4)
'
t,J
where N is the number of pixels in the coded image. Setting the partial derivative of E with respect to Fc and Fa to 0, we obtain
o~FcOE= ~ (2Fc "x]^ i,j - 2xi') "Yi'j + 2Fa N
9~.,(Fc.x]^9. i,j x;,j 9Yi,j +FG'xL))= 0"
X~,)
)=0,
(5) (6)
l,J
OFG i,j
x~,) - 2Yi,j
N
.'.Z(Fc .x~,) -Yi,j + FG)=O. l,J
(8)
130 From Eqs.(6) and (8), Fc = N . Z - X . Y N.W_X 2
(9)
W.Y-X.Z FG = N . W _ X 2
where N
N
X = ~, . . x~,) '
Y = E Yi, j, i,j
t,j
N
Z = E. (x~,j'Yi,j . . }. t,J
N
W = ~., x i,j" 2^ t,J
The global brightness-fluctuation parameters F c and Fc are obtained for each coded image using Eq. (9).
2.4 Generation of image compensatedfor brightness
fluctuation
The luminance value xi,j of each pixel in the reference image is changed to x ' i j according to Eq. (1) to compensate for the global brightness fluctuation. 3. CONBINATION OF GLOBAL BRIGHTNESS-FLUCTUATION COMPENSATION AND GLOBAL MOTION COMPENSATION In the GBC and the GMC TM methods, the first step is almost the same. the two methods can be easily combined in a video coding scheme.
Therefore
That is, as shown
in Fig.3, local motion estimation is first performed using the block matching technique with subtracting the mean value in each block.
The global brightness-fluctuation
parameters are estimated next and a global brightness-fluctuation-compensated image is generated from the reference image using the parameters for the GBC process.
On
the other hand, for the GMC process, the global motion parameters are estimated and a global-motion-compensated image is generated from the global-brightness-fluctuation compensated image. This image is used as a reference image for interframe prediction in a conventional coding algorithm such as H.263 tlj and MPEG-1. TM 4. SIMULATION Computer simulation was carried out using two video sequences in the SIF format (352 pixels• 240 lines). zoom, and fade-in. included fade-out.
Sequence 1 included shots characterized by camera pan, tilt,
Sequence 2 included shots with similar operations and also The basic algorithm was H.263.
variable frame rate technique was used for bit-rate control.
The bit-rate was 112 kbits/s and a
131
Fig. 3
Framework of GBC and GMC methods
Signal-to-noise ratios (SNR) using "H.263 only" and "H.263 + GBC &GMC" schemes are shown in Fig. 4 and frame-rates are shown in Table 1.
Fig. 4 shows that the GBC and
GMC methods improved the quality by 4-5 dB for a fade-in scene in sequence 1 and 2-3 dB for a fade-in/out scene in sequence 2.
Furthermore, Table 1 shows that the GBC and the
GMC methods improved the frame-rate by about 1 frame/s. 5. C O N C L U S I O N S In this paper, we proposed a global brightness-fluctuation compensation method. The method compensates for global brightness fluctuations caused by fade-in, fade-out, camera flashes, and so on, in the whole image.
Furthermore, we showed how this method
works together with the global motion compensation method we have already proposed to improve coding efficiency for video scenes that contain global motions and fluctuations in brightness.
Simulation results showed that the two methods remarkably improved coding
efficiency for fade-in/out video sequences. Table 1 ~......~
Frame-rates for each scheme
H.263 only
H.263 + GBC & GMC
sequence 1
6.8 frames/s
7.6 frames/s
sequence 2
5.6 frames/s
6.6 frames/s
132
55
H.26~ + GBC & GMC
i
fade-in
i .... H.26~ only
~, 50 nz
r
i
40
,
.
i
i
\
!
35
50
100
frame number
150
(a) s e q u e n c e 1 55 H.26~ + GBC & GMC . . . . H.26" only
50
m" 45 z r
fade-oIJt .:~_____
i
40
t fade-in! ,, ri
i
ii
35
~
30
/II
0
50
100
frame number
150
(b) sequence 2
Fig. 4
Coding performance
REFERENCES [1] Draft ITU-T Recommendation H.263, "Video coding for low bitrate communication," Dec. 1995. [2] K. Kamikura and H. Watanabe, "Video coding for digital storage media using hierarchical intraframe scheme," SPIE Symposium on Visual Communications and Image Processing '90, vol. 1360, pp. 1540-1550, Oct. 1990. [3] "ISO/IEC 11172, Information technology-coding of moving picture and associated audio for digital storage media at up to about 1.5 Mbit/s" (1993).
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
133
A sequence analysis system for video databases
M. Ceccarelli
a, A. Hanjalic b, R.L. Lagendijk b
a Philips Research Laboratories Eindhoven, Storage and Retrieval Group, Prof.Holstlaan 4, 5656AA Eindhoven, The Netherlands b Delft University of Technology, Dept. of Electrical Engineering, Information Theory Group, P.O.Box 5031, 2600 GA Delft, The Netherlands The proliferation of digital transmissions and video services will lead to a demand for efficient and flexible local storage devices for time-shifting, personal archival and fast access to downloaded material. Visual search tools must be provided for easily locating specific information within huge volumes of video data. In this paper we consider advanced methods for automated analysis of compressed video sequences, extraction of representative information and its organization in a video database for efficient browsing and retrieval. 1. INTRODUCTION In the context of the european SMASH project, sponsored by the EU in the framework of the ACTS programme, technical possibilities for a consumer storage device for multimedia applications are currently under investigation (see http://www-it.et.tudelft.nl/pda/smash/). The main issues involved are the recording of digital video streams and multimedia documents. The current focus of our research is on the implementation of an automatic video abstracting system, able to simplify retrieval operation on large amounts of video data. We will consider storage of MPEG compressed streams [1], used in present and future digital video services, including DVB and ATV transmissions, but the same principles can be applied to other coding systems, e.g. to support non-linear editing of streams from a digital video camcorder (DVC). The high bit rate and the compressed nature of video streams, together with the large amount of stored data, hinder the retrieval task of the user. With MPEG video streams, the standard trickmodes (VCR-like fast viewing operations) cannot be easily implemented, hence new fast and effective techniques for video browsing must be found [2]. The descriptive power of textual annotations is unsatisfying for our purposes, a visual representation is needed. The solution could be found by coupling a textual description with representative images, but manual insertion of text and extraction of images would be time consuming, hence the whole system must be automated. In our implementation (see figure 1), real time analysis of the video content is performed during the recording operation, aimed at producing a practical reference to the semantic structure of the programme and to extract the information necessary for the indexing module. In order to obtain a reconstruction of the original storyboard, structural analysis is performed to parse the video through a process of reverse-editing.
134 Figure 1. Overview of the system for real-time analysis of recorded video streams dc .=1 Video I i "-I parsing I Part'al" :~deco~,n~
video e.SL
I
MPEG T.S..I DeMUXe J
"-I
'1
._I Activi .ty
mot.vectr-I estimation I
timestamps
audio e.s.._I Audio I "-I analysis I aux. e.s.._1 S e r v i c e infd
"-I
to tape
extraction[
_ _Key-framesl • selection] 1 --IHierarchicall
"-! structuring I
"J'-! Indexing I to
~sk
2. VIDEO PARSING The video parsing routines produce a temporal segmentation of the sequences in their structural elementary units, the camera-shots. The scene change detection algorithms identify the boundaries between consecutive camera shots by determining the frames where a transition occurs from one shot to another. Many scene change detection techniques operate on the pixel domain [3-5], but full decompression of the bit stream would involve Huffman code decoding, inverse quantization, inverse cosine transformation and motion compensation. The algorithms we introduce operate on compressed sequences, requiring just minimal decoding, since we exploit the available information on frequency component values and motion vectors.
2.1 Analysis in the Compressed Domain In a video stream encoded using the MPEG2 standard [ 1], it is possible to obtain useful information about the encoded I frames without performing IDCT, by using the DC values of the DCT. Since these yield an average value for each 8x8 block in the spatial domain, we can reconstruct a reference image (called DC-picture) reduced from the original by a factor of 8 (by 16 in chrominance components in case of a 4:2:0 sampling format). This subsampled pictures can be used for detecting scene changes [6]. The spatial averaging obtained with DC values also results in less false alarms in cut detection because of decreased sensitivity to local variations [3]. However, this technique shows its weakness with standard Group of Pictures (GOPs) involved in video broadcasting, which employ several P and B frames between I-frames. Due to the low temporal resolution, there may be false detections in sequences with high motion, and the scene change cannot be exactly located. In order to obtain even a reduced resolution version of P and B frames, 'complete IDCT decoding should be performed in order to apply motion compensation. This process is computationally expensive, especially for B frames, where each block may have been encoded in at least four different manners. A few proposals avoid full decoding of the predicted frames by adopting considerations on the number of predicted blocks in comparison with the intra-coded blocks as a criterion for detecting scene changes occurring on P and B frames [7]. Of course this number also depends on the search range adopted by the motion estimation algorithm of the encoder, hence false alarms could happen in high motion sequences. Furthermore, since our focus is on broadcasted (DVB) sequences, we have to keep into account that professional MPEG encoding systems for content and service providers normally employ scene change detectors on the original sequence to be encoded. In
135 order to optimize the bitrate/quality ratio in the encoded sequence, as a sharp scene change is encountered, the GOP length is adapted in order to have a closed GOP (ending with a P or I frame) in the end of each shot and an I frame corresponding to the first frame of a new shot. Yet it is necessary to improve the robustness of scene change detection in sequences with high motion content, by increasing the temporal resolution of the monitored frames. The interval between an I and a P frame or between two P frames is normally not larger than two (B) frames, hence we extract DC-pictures only from P frames [8], avoiding the complexity of B frames decoding. The DC values of predicted macroblocks in P frames can be obtained through an approximated inverse motion compensation as follows: the DC values of the block of the present picture are obtained by an area-weighted average of the four blocks pointed by the motion vector in the previous frame plus the residue error term of the prediction (see figure 2). Since cascaded prediction ultimately refers to the anchor I frame, the quality of these approximated picture will deteriorate as the distance from the anchor frame increases, but our tests revealed reasonably good quality for standard 13 frames long GOPs. 2.2 Scene Change Detection
One of the main issues in temporal segmentation techniques is the definition of a valid measure to express the difference between two frames. The histogram of the colour components is one of the most effective metrics and has the advantage of being insensitive to motion. The RGB colour space is commonly used in literature, but the performances of cut detection algorithms in the YUV colour space, adopted by the MPEG standard, were proven to give more satisfactory results [9]. Before parsing, a simple test, based on colour histograms, is used to detect whether the analysed sequence is in B/W, so that proper measures can be employed during the parsing process. For detection of cuts, the difference between two consecutive frames must be measured. In order to compare the binned distribution of the two frames, we compute the difference of the combined histograms of the chrominance components [4]. Adopting the Chi-square measure for statistical binned distributions, we can define the global difference between the frame at timecode t a and the frame at timecode t b , given their N-bins histograms, as: Y,U,V N - 1
z2(fa, fb) =
2
~
s
2
(Hi(fa)-H~(fb))2 (1) s i = o ( H i (sf a ) + H~(fb) ) To determine whether a scene change has occurred, we have to compare the resulting frame to frame difference value with a threshold. The setting of a proper threshold is very much dependent on the analysed sequence, hence it is important to stress the problem of an optimal setting. Considering the variations of the average frame to frame difference, depending on the evolution of the content, an adaptive thresholding can be applied. The class of frame difference values in correspondence of a scene change must be separated from the normal difference values along a shot. Techniques using mean and standard deviation and kmeans clustering algorithms were discarded because missed detections spoil the statistics. Instead we adopt a technique for shot adaptive thresholding based on the concept of a temporal sliding window around the presently analysed frame [ 10], to which we apply a temporal differential filter. When the ratio of the difference between the present and the past frames is by far larger than the differences between all other neighbouring frames within the window, the present frame is classified as the first of a new shot. By means of this technique we could obtain a quite robust scene change detector, with performances between 93% and 98% of correct detection, depending on the analysed sequences, and a quite low false alarm rate, around 4-5%.
136 3. KEY F R A M E E X T R A C T I O N The goal of key frame extraction algorithms is to obtain a synthetic representation of the most meaningful scenes of a program. Theoretically, semantic primitives as objects, actions, events should be used, but such analysis is not currently feasible. Extraction has been based, so far, on scene changes: frames corresponding to a detected scene change are extracted, but the first frames of shots generally hold low representativity (worst cases being fades and dissolves). Some proposals decide to select a particular frame as a key frame if a certain frame difference measure exceeds a threshold [ 11 ], but still the selected frames may not be the most representative ones since the threshold can only be subjectively chosen. Another important drawback is that the resulting number of key-frames is known only a posteriori. In practical applications a limit will exist on the maximum allowed storage space and an excessive number of key frames would not be handy for manipulation of the user. The number of resulting key frames must be set a priori and the extraction rate must be adaptive on time and content in order to avoid massive overhead information.
3.1 Activity Estimation In order to estimate the effectiveness of a frame in representing a particular scene, the information about spatio-temporal activity is particularly valuable. Generally a few key frames are sufficient to describe stationary sequences, while more key frames are necessary to represent sequences containing much activity. Based on such a criterion, an activity estimation module monitors the sequences by giving a low value where scarce motion is measured and vice-versa. Information about motion is carried in MPEG sequences by predicted (P) and interpolated (B) frames. The real temporal variation of the content must be distinguished from motion due to camera operations or moving background [7], which will not be considered for evaluation of content activity. The dominant motion components and the characteristic patterns of vectors must be examined in order to distinguish different classes of camera operations and detect the areas containing vectors due to translation, rotation or scaling camera operations. The vectors due to camera operations are detected through histogram of vectors (see figure 3): in order to classify the motion vectors, the total distance from the phase of the modal vector is computed. If a motion vector is found out to be due to a camera operation, it is not taken into account in the total sum of the vector which yields the activity measure due to only object motion. We must take into account that when an MPEG encoder employs a restricted motion search area, the number of intra-coded macroblocks in predicted frames will increase. Therefore we assume
Figure 2. Approximated motion compensation with DC-pictures for P frames.
Figure 3. Distribution of phases of motion vectors in a sequence with camera panning operation.
137 areas with intra-coded macroblocks in predicted frames, as expression of significant variations in the content. Motion vectors are also weighed considering the distance from the centre of the screen, where probability of meaningful action increases. We can define the obtained sum of the motion vectors of the content as a measure for quantifying the activity of the n-th frame of the shot i: P
Ai(n) = 2 ( W k k
I-lmk
1
(2)
) + (X" 2 W i i
where m k are the motion vectors of predicted macroblocks not due to camera operations, I is the number of intra coded macroblocks and w is a weight for the position of the macroblock.
3.2 Key Frame Allocation The cumulative action of i-th shot can be represented by: Ni
Ci = 2
Ai(n)
(3)
n=l
The steepness of the resulting non decreasing curve of C i is proportional to the activity of the content (see figure 4). Then the number of key-frames K i assigned to the i-th shot is taken proportional to the cumulative action C i of the shot [ 12]. In this manner it is possible to adapt the keyframe extraction rate according to the content of the sequence, for example by allocating to each shot of the sequence, according to their content activity, one part of the total N key-frames assigned for the whole program. This parameters can also be adapted for each program by identifying whether the category of the analysed sequence is movie, sport event, news etc. (e.g. extraction rate can be limited for commercials, trailers and musical video-clips). The following step is the actual selection of key-frames to represent each shot. To this end we apply the following criterion [ 13]. Given the cumulative action Ci(x ) for the i-th shot, the K i number of key-frames is distributed such that the following criterion function is minimised: K i tj
g(kl' ""' kK~' tl' ""' tK~-, ) = 2
~ ICi(X)
-Ci(ki)l dX
j = ltj_ !
Figure 4. Action measure Ai(n) along several shots and the corresponding cumulative action function Ci(x)
Figure 5. Distribution of key-frames and breakpoints for a varying cumulative action measure Ci(x )
(4)
138 where kj are the temporal positions of the key frames, tj_l and tj are the breakpoints between the shot segments represented by the key-frame kj. A recursive search algorithm, described in [ 13], can be employed to solve this equation. Given the low temporal resolution of the reference frames, a few iterations should be sufficient in most cases. Figure 5 presents the results of such an optimal allocation for an example sequence having stationary content in the first part and increasing activity in the second part. The underlying concept is that, once a given shot is allocated a number of key-frames, a larger number of key frames is extracted where higher activity is measured along the shot. 4. CONCLUSIONS In this paper an automatic system for abstracting visual content from compressed video sequences has been presented. This has been achieved by identifying a reliable parsing technique, defining a suitable measure for representation of content and introducing a new approach for key-frame allocation. Through this method it is possible to control the number of key-frames extracted from a video sequence. The extraction is not based on any parameter setting, but is fully automated. The applied numerical algorithm optimally delivers locations for the assigned amount of key-frames in each shot based on a suitable measure of the activity of the sequence. The analysis of the content gives support for a high representativity of the selected key-frames. Future work includes clustering of key frames for pyramidal search operations and exploitation of audio tracks. REFERENCES
1. ISO/1EC JTC 1/SC29: "ISO/IEC 13818, Information Technology - Generic coding of moving pictures and associated audio information", November 1994 2. E Arman, R. Depommier, A. Hsu, M.Y. Chiu: "Content-based Browsing of Video Sequences", Proc. ACM Multimedia-94, pag. 97-103 3. W.Xiong, J.C.M. Lee, M.C.Ip: "Net comparison: a fast method for classifying image sequences", SPIE 2420, 1995, pag. 318-328 4. I.K.Sethi, N.Pathel:"A statistical approach to scene change",SPIE vo12420, 1995, pag.329-335 5. A.Hampapur, R.Jain, T.Weymouth:"Digital Video Segmentation", Proc.ACM Multimedia 1994, pag. 357-364 6. EArman, A. Hsu, M.Y. Chiu: "Image processing on compressed data for large video databases", ACM Multimedia 1993, pag. 267--272 7. H.J. Zhang, C.Y. Low, S.W. Smoliar: "Video parsing and browsing using compressed data", Multimedia tools and applications, Mar. 95, pag. 89-112 8. S.EChang, D.G.Messerschmitt: "Manipulation and Composition of MC-DCT Compressed Video" IEEE Journal on Selected Areas in Communications, Vol.13, n.l,January 1995, pag. 1-11 9. U.Gargi, S.Oswald, D.Kosiba, S.Devadiga, R.Kasturi: "Evaluation of video sequence indexing and hierarchical video indexing", SPIE vol. 2420, 1995, pag. 144-151 10. B.Yeo, B.Liu, "Rapid Scene Analysis on Compressed Video" IEEE Transactions on Circuits and Systems for Video Technology, Vol.5, N.6, December 1995 11. J.J.Boreczky, L.A.Rowe: "Comparison of Video Shot Boundary Detection Techniques", Proc.of SPIE vol. 2670, 1996, pag. 170-179 12. A.Hanjalic, R.L. Lagendijk, J.Biemond: "A New Key-frame Allocation Method for Representing Stored Video Streams", Proc.lst Int. Workshop IDB-MMS, August 1996, pag.67-74 13. R.L.Lagendijk, A.Hanjalic, M.Ceccarelli, M.Soletic, E.Persoon "Visual Search in a Smash System", Proc. ICIP 96
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
139
Subjective i m a g e quality e s t i m a t i o n in s u b b a n d coding: m e t h o d o l o g y a n d h u m a n visual s y s t e m a p p l i c a t i o n Zoran Bojkovi6 a, Andreja Sam6ovi6 ", and Branimir Reljin b aFaculty of Traffic and Transport Engineering, University of Belgrade, Vojvode Stepe 305, 11000 Belgrade, Yugoslavia b Faculty of Electrical Engineering, University of Belgrade, Bulevar Revolucije 73, 11000 Belgrade, Yugoslavia This paper seeks to provide some subband coding techniques from the subjective image quality estimation point of view. Human visual system is included, too. Recent trends in subband coding are presented. In order to estimate the image quality, some simulation results are demonstrated. Finally, mean opinion score quality scale with five degrees is recommended. 1. INTRODUCTION The need for flexible quality of service (QOS) communications can be recognized when considering interworking between terminals with different capabilities [1,2]. The QOS of video communications depends first on the capabilities of sending terminals, which will perform coding of source images and organize image data into transmission signals. The QOS perceived at the receiving end will be affected by the network's transmission impairments referred to as network performance. In principle, QOS of communication will be left to the user [3]. Representation of images according to the information content is performed by source coding schemes. Specific image coding schemes are dependent on the application. The requirements on picture quality and the characteristics of communication channels and storage media have strong influence on the applied scheme. Subband coding (SBC) methods exploit a nonuniform energy and a probability along frequency axes of an image as well as a space frequency sensitivity of human visual system. The perceptual coding that matches the compression algorithm to characteristics of human visual perception has been considered as one of the promising solutions to improve the image coding efficiency. Better performances and simpler encoder can be achieved by subband than space domain methods. This fact makes SBC as a promising candidate for industry coding standard. The use of properties of the human visual system (HVS) can lead to a significant reduction of the number of bits needed to encode an image with perceptual
140 transparency. In what follows, images subband coding techniques will be discused, focusing on differential pulse code modulation (DPCM) as a method to encode subbands, as well as subband coding with the discrete cosine transform (DCT). A simple subjective measure for image quality including HVS will be proposed. 2. HVS P R O P E R T I E S The characteristics of HVS should be taken into account in order to effectively exploit the visual communication. In SBC, the errors occuring in different subbands lead to distortions of varying perceptibility in the reconstructed images. The perceptual redundancy inherent in images is due to to the inconsistency in sensitivity of the HVS to stimuli of varying levels of contrast and luminance changes in the spatial domain. The human visual perception is sensitive to luminance contrast rather than absolute luminance values. The ability of human eyes to detect the magnitude difference between an object and its background is dependent upon the average value of background luminance. The error visibility due to background luminance in the spatial domain is given in Fig.1. It can be used to allocate coding bits or distortion, by adjusting the quantizer step size of the target signal as inversely proportional to the sensitivity of the corresponding frequency. Visibility Threshold 213 15 10
0
32
64
96
12g
160
196
Back, ground Luminance
Figure 1. Visibility threshold vs background luminance. The noise in dark areas tends to be less perceptible than that occuring in regions of high luminance. The eye is noticeably more sensitive to flicker noise at high luminance that at low luminance. Many psychovisual studies have shown that the human perception of distortion depends on its frequency distribution. Current methods to incorporate HVS properties into existing coding schemes are usually based on heuristic methods that are only valid for a specific coding scheme. Subjective evaluations are of two broad types, rating-scale methods and comparison methods. The results of rating depend upon the experience and motivation of the subjects, the range of the picture material used and the conditions under which the picture is viewed ( ambient illumination, contrast ratio and viewing distance).
141 3. IMAGES SUBBAND C O D I N G T E C H N I Q U E S SBC is motivated by the idea that the subbands can be coded more efficiently than the entire fullband image. The idea is to divide the frequency band of the signal into a number of subbands using a bank of bandpass filters. Each subband is then lowpass translated by subsampling. Most earlier references to SBC implicitly assumed that DPCM is used to encode the lowpass band. The encoded signals then pass through a noiseless channel to the receiver. The receiver decodes these source-coded subband signals and resamples them back to their original frequency band. The signals are then summed to give a close replica of the original signal. Two main advantages have been suggested for SBC. First, the errors in encoding a subband are confined to that particular subband. Thus, the quantization noise from the encoder in a subband gets refelected back into the same subband due to the aliasing of the noise. Hence, it does not mask the weaker signal in another subband. Second, by varying the rate assignment among the subbands, the noise spectrum may be shaped according to some perceptual criterion. Moreover, each subband can be encoded using a separate encoder which is closely matched to the requirements of that band. In keeping with this, various subband coding schemes have been studied extensively and developed for video coding applications. DPCM is a simple and popular predictive coding method [4]. It exploits the property that the values of adjacent pixels in an image are often similar and highly correlated. The use of DPCM as a method to encode the subbands is motivated by increasing efficiency of a predictive encoder for a nonwhite power spectral density [5]. This is a good technique for a small number of subbands, since each subband will still have significant pixel-to-pixel correlation. The advantage of such a coding scheme is that the quantization noise generated in a particular band is limited largely to that band in reconstruction, and is not allowed to spread to other bands. In general, the subbands do not have fiat-topped power spectra and therefore are not memoryless. One way to take advantage of this memory to reduce the encoding rate is to encode the difference between a source sample and linear prediction of this sample from past reconstructed sample values. However, the prediction coefficients are optimized for actual past values of the samples, which are not available to the decoder. This type of prediction is therefore suboptimal and approaches optimality only in the limit of large encoding rate when the mean square error (MSE) between the actual and reconstructed samples approaches zero. In the case of DCT, an image is divided into blocks of fixed size (typically 8 x 8 pixels). Each block is then transformed into a frequency space. The transform most commonly used for this operation is the two-dimensional (2D) DCT. The DCT transforms a block of image data to the same number of coefficients where to each coefficient is assigned a basis image with different frequency content. Coefficients with high index number are assigned to basis images with high frequencies. The properties of the DCT are [6]: -
-
the coefficients are highly decorrelated, and the low frequency coefficients carry the most of the information content.
142 These properties support both the reduction of redundant and irrelevant information. Because the sensitivity of the human eye is reduced for high spatial frequencies, transform coefficients with high index number can be quantized more coarsely. A specified average transmission bit rate can be adjusted by the quantizer characteristic but this will also affect the picture quality. In practice, the transform is applied independently to non-overlapping subblocks of the image. Images will contain severe amounts of aliasing. Since the transform is orthogonal, the subband aliasing is cancelled in the synthesis stage. If the transform coefficients are quantized or discarded, the aliasing no longer cancels and the errors appear as block edge artifacts in the reconstructed image. 4. S I M U L A T I O N RESULTS Despite its popularity as a distortion measure, MSE can be a poor indicator of the subjective quality of reconstruction. Namely, perceptually based criteria may be more appropriate. One example is the mean opinion score (MOS). A number of subjects view an image and rate its quality on a five point scale of bad, poor, fair, good or excellent. The MOS is simply the average rating assigned by all the subjects. In order to estimate the image quality obtained by some SBC techniques, the images after the process of reconstruction shown in Fig.2, are compared. The reconstructed images are obtained using computer simulations (PC-486). Fig. 2a represents the original test image "Lena", while Fig. 2b belongs to the reconstructed image from the SBC-DPCM technique. The reconstructed image from the SBC-DCT technique is shown in Fig. 2c. The application of the DCT at low bit rates causes visible block effect on a coded image which disturbs an user. On the other hand, when one combines DCT with SBC, the block noise disappears, while the texture reproduction becomes better. However, SBC causes greater granular noise in an image. In order to subjectively estimate the quality of obtained images, we used the MOS quality scale with five degrees from 1 to 5, recommended by ITU-T 500. The images with the same bit allocation are compared. The inquiry with 10 independent observers has been made. They gave their subjective estimation having the choice from bad (mark 1), poor (2), fair (3), good (4) to excellent (5) in quality scale. We evaluated three coding methods: SBC-DPCM, SBC-DCT and DCT. Each observer had to give for each of the reconstructed image a mark between 1 and 5. Average user's mark using subjective quality scale is shown in Table 1. Table 1 Subjective and objective (MSE) quality parameters related to tested images in Fig.2 MSE
SBC-DPCM
AVERAGE MARK 3,78
SBC-DCT
2,67
213
DCT
1,78
127
FIGURE
TECHNIQUE
2.b 2.c
169
143
Figure 2. Experimental results for the "Lena test image: a) the original test image "Lena", b) the reconstructed image from the SBCDPCM technique, c) the reconstructed image from the SBC-DCT technique
The results are obtained by averaging the opinions of the observers. However, the coding algorithm optimization cannot be considered based only on a subjective test. The test results depend also on the observer's motivation, luminance and image contrast, as well as the distance from the point of observing. Namely, the human visual perception is sensitive to luminance contrast rather than absolute luminance values.
144 5. CONCLUSION Subjective quality estimation in subband image coding techniques, as SBC-DPCM and SBC-DCT, has been discused. The average observers' marks of 3,78 and 2,67 have been obtained for the reconstructed SBC-DPCM and SBC-DCT images, respectively. Some properties of the human the larges visual system were taken into account in the sense of observer's adventure. The worst case of 1,78 points was obtained for the reconstructed DCT image at low bit rates. On the contrary, the MSE is the largest in the SBC-DCT coded image, and the smallest in the case of DCT. REFERENCES 1. D.Scott, M.Biggar, D.Dorman: "Getting the picture-integrated video services in BISDN", Telecom. J. Australia, Vol.40, No.2, Feb.1990 2. J.Ellershaw, M.Biggar: "Network impact on interworking packet video systems", Proc. Packet Video'90, Morristown, Paper A-2, March 1990 3. K.Yamazaki, M.Wada, Y.Takishima, Y.Wakahara: "ATM networking and video coding techniques for QOS control in BISDN", IEEE Trans. on CAS for Video Technology, Vol.3, No.3, June 1993, pp 175-181 4. A.K.Jain: "Fundamentals of Digital Image Processing", Englewood Cliffs, NJ: Prentice-Hall, 1989 5. J.Woods, S.O'Neil: "Subband coding of images", IEEE Trans. on ASSP, Vol.34, No.5, 1986, pp 1278-1288 6. K.R.Rao, P.Yip: "Discrete Cosine Transform Algorithms, Advantages, Applications", Academic Press, Inc., 1990
E R E M O T E SENSING DATA AND IMAGE PROCESSING
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
147
N e u r a l N e t w o r k s for m u l t i - t e m p o r a l a n d m u l t i - s e n s o r d a t a f u s i o n in l a n d c o v e r classification Alessandra Chiuderi Space Applications Institute, Agriculture Information Systems, Joint Research Centre - 21020 Ispra (Varese), Italy
In this paper the use of multi-sensor and multi-temporal data for land cover classification is investigated. In particular a multilayer Neural Network (NN) trained by means of the Back Propagation algorithm is employed for classification experiments on remotely sensed images. The data set employed is composed of two coregistered images of the agricultural area surrounding the city of Valladolid (Spain) acquired in two different dates, June and July, by two different satellites, SPOT and LANDSAT respectively; ground truth data was acquired during a in situ campaign carried out by the Spanish Ministry of Agriculture in 1993. For the experiments presented here, three different data sets were employed: i) SPOT data, ii) LANDSAT data and iii) (SPOT + LANDSAT) data; in both cases ii) and iii) the LANDSAT image was resampled in order to have the same resolution as the SPOT image. In this paper the use of this integrated data set by means of a Neural Network is investigated and the results of land cover classification on the three different data sets outlined above are compared and discussed.
I.INTRODUCTION Remote sensing (RS) is defined as the science of acquiring information on a given object, without any physical contact with the object itself [ 1]. Even if this extremely wide definition includes every kind of mean which allows long distance information acquisition, in the present context we shall be concerned only with remotely sensed images acquired by space satellites. As a matter of fact, RS constitutes an extremely interesting application and research field as far as image processing is concerned: first of all we are asked to deal with real data, as opposed to ad hoc images acquired for the purposes of checking the performances of a given algorithm, secondly the amount of data available is enormous, allowing us to test our algorithms in different situations, thirdly there is a growing interest in RS image processing as it can be considered one of the most powerful tools for Earth observation and change monitoring, and, last but not least, the amount of data concerning the same area acquired by different sensors and future developments in sensor technology make it mandatory to develop techniques which can both deal with different sources and select, among all available data, the ones carrying more information for a given task [2]. The work presented here has been carried out within the MARS project [3] of the Joint Research Centre of the European Commission, one of the widest projects in terms of RS
148 applications. The aim of MARS (Monitoring Agriculture with Remote Sensing) project is to provide decision support to the Commission as far as agricultural policies are concerned. An important section of the project (Action 4) is concerned with rapid acreage estimation of the principal European cultures: the overall idea is to acquire several images (up to four) of selected sites (60 in all Europe) throughout the growing season and to compute field acreage through the automatic classification of each site. Computed figures are then passed through a statistical module which allows to estimate the extension of each culture over all Europe. Within this context, LANDSAT and SPOT images of the 60 selected sites together with ground surveys are available at the Joint Research Centre of Ispra.
2. INTEGRATING DIFFERENT DATA SETS The acquisition of different images throughout the growing season allows crop monitoring and early estimation of land cover type, increasing therefore class separability. As a matter of fact, different cultures usually have different growing cycles, and the comparison between two successive images allows to identify different crops that are, within the same image indistinguishable. It could be believed that class separability increases as the season advances, and therefore earlier images will always lead to less accurate classifications; this is generally true but not always. In a mid summer image for instance, permanent cultures such as pastures or forages may be confused with maize but if an earlier image of the same site is available, a simple comparison of the two should allow class separability as maize is not appeared yet (the field would give a response similar to baresoil) and forages will have instead the typical strong reflectance in the mid-infrared. In this paper two images Valladolid (Spain) acquired by two different satellites, SPOT and LANDSAT, at beginning of June and mid July respectively are employed. The advantage offered by this data set are the following: SPOT having 20 m. spatial resolution is more suitable for agriculture monitoring in a country such as Spain whose landscape is characterised (as most of Europe) by small fields, LANDSAT, on the contrary, despite the coarser spatial resolution (30 m.), acquires data also in the mid-infrared region of the electromagnetic spectrum which is particularly important for vegetation response.
3. NEURAL NETWORKS IN R E M O T E SENSING The use of Neural Networks (NN) in Remote Sensing image processing, is not new: starting from the late eighties, several authors have been employing this technique as a useful and suitable processing tool in particular for image classification, as illustrated in the interesting review paper [4]. A more detailed background can be found in [5], [6]. Statistical algorithms usually employed for image classification, assume a Gaussian distribution for input data. When dealing with Remote Sensing on one side and ground truth data on the other, this assumption is rarely true: very often, due to non perfect class separability, training data show a multimodal distribution which causes a loss in accuracy in the classification phase. Moreover, if data is acquired by different sensors, and at different dates, the hypothesis of unimodal Gaussian distribution becomes even more restrictive.
149 NNs, on the contrary, thanks to the "learning from examples" strategy overcome this problem and actually take advantage of all the available sources of information. Input patterns in these cases are obtained by simply concatenating, for each image pixel, the spectral values acquired by the first sensor to the values acquired by the second sensor. Training can consequently be performed on this new integrated data set. In this study, the same area was classified by employing SPOT data, LANDSAT data and the (SPOT+LANDSAT) data set. The results reported in section 5 show how the integration of the two data sets increased classification accuracy.
4. THE DATA SPOT image was acquired on June 1st, 1993, it has a ground resolution of 20 m and three bands of sensitivity: (0.50-0.59 Ixm), (0.61-0.68 lxm) and (0.79-0.89 ILtm). The image was geocorrected by means of the GRIPS software [7] A subscene of 2000 lines and 2000 pixels was employed for these experiments. LANDSAT TM image was acquired on July 17th 1993, raw data has a ground resolution of 30m and 7 bands, ranging from visible to the near and mid-infrared. This image was coregistered to SPOT image and resampled in order to have the same ground resolution. A subset of 2000 lines and 2000 pixels was extracted for bands 2 (0.52-0.60 ktm), 3 (0.63-0.69 ~m), 4 (0.76-0.90 l.tm) and 5 (1.55-1.75 l.tm). Ground truth data was provided by the Spanish Ministry of Agriculture. The coregistration of such data to the image and the pixel values extraction was carried out at the JRC laboratories. The data set employed for the three sets of experiments reported in the next section was composed of 27080 pixels and representing 13 different land cover classes; 18055 were used to train the NN, whereas 9025 were employed exclusively to evaluate the results obtained summarised in Table 1.
5. E X P E R I M E N T S AND RESULTS
Three sets of experiments were performed on the data available: in each experiment, data extracted from the satellite image was used to train a NN by means of the Error Back Propagation algorithm [5]; the network was therefore constituted by a variable number of input neurons (according to the data set employed) and 13 output neurons, with a variable number of hidden nodes arranged into 1 or 2 layers. Establishing which is the best architecture for a given task is not easy as usually different architectures give different performances in terms of single per class accuracy; the results reported here therefore refer to the "best" architecture in terms of average omission precision. 5.1 SPOT data The best performances were obtained by a 3 layer NN with 3 input nodes and 22 hidden nodes, having an overall classification accuracy of 71.75%, as reported in Table 1. In comparison to the experiments reported in sections 5.2 and 5.3 network training for SPOT data was longer than for any other data set, requiring 800 iterations before obtaining the results reported here. On the contrary, for LANDSAT data and (SPOT+LANDSAT) data, after 200 iterations training was interrupted as the overall accuracy was satisfactory.
150 Table 1 Classification accuracies (%) for the three data sets employed (Test set only) Class SPOT LANDSAT SPOT+LANDSAT
#Pixel
Omission Commission Omission Commission Omission Commission 1 Cereals 2 Sunflower 3 Potatoes 4 Sugar Beat 5 Forage 6 Set aside 7 Permanent 8 Woods 9 Built 10 Lands 11 Dry Pulses 12 Other Cer. 13 Water Overall
94.8 79.12 17.14 6.78 12.06 36.56 4.29 68.08 46.09 27.87 75.95 18.89 100.0
87.67 50.79 54.55 100.0 43.64 55.63 31.58 82.39 48.30 34.96 75.95 47.22 100.0 71.75
95.18 80.04 74.29 84.75 42.21 83.07 37.14 92.73 58.97 24.83 29.11 25.56 81.82 80.7
86.88 70.37 96.30 87.72 71.19 95.27 50.00 67.24 80.0 65.92 53.49 85.19 100.0
97.67 92.22 81.25 86.36 60.80 93.08 61.31 75.85 74.31 72.27 79.01 65.52 100.0 88.66
97.13 86.30 83.87 91.94 76.10 88.74 50.30 91.04 83.4 57.03 98.46 85.07 100.0
4273 1092 35 59 199 703 140 921 831 592 79 90 11
It can be noticed that, generally speaking, the accuracies are quite low. In particular for class 4 (Sugarbeet) and class 7 (permanent cultures, such as vineyards or olives) practically all pixels are classified into class 2 (sunflower) which shows, also, a very low commission accuracy. This mis-classification could partially be due to the great difference in the number of pixels for the 3 classes: class 2, being so numerous, tends to over train the network in it's favour.
Figure 1 Scatter plot of classes 2 and 4 for bands 1 and 2 (SPOT data)
151 It must also be said that, due to climatic conditions, sunflowers at beginning of June in southern Europe are usually not very blooming, giving therefore a quite confused signal (baresoil, spontaneous vegetation, and sunflowers) which, combined with the high number of samples (1092), can lead to high commission errors, as far as class 2 is considered and low omission precision for all other classes. The scatter plot reported in figure 1 shows the distribution of classes 2 and 4 in the feature space described by bands 1 and 3, and it can be clearly seen that all pixels belonging to class 4 Oust 59!) fall within the area covered by class 2, leading to an extremely poor separability.
5.2 LANDSAT data Among all experimented architectures, the best results in terms of average omission precision were obtained by a three layered network having 25 hidden nodes. Not surprisingly, the results obtained on this data set are much more accurate than the previous ones mainly due to the information concerning the mid-infrared region of the electromagnetic spectrum. In particular, as far as classes 4 and 7 are concerned, it is evident that in this case the network was able to separate them from class 2 which, moreover, shows a higher commission precision. For class 11, on the contrary, the network trained on SPOT data gave substantially better results. 5.3 SPOT+LANDSAT data Finally, the simulations on the multi source and multi temporal data set are reported in columns 7 and 8 of Table 1. As expected this data set gives the higher class separability, all classes having an omission accuracy over 60% and an overall accuracy of 88.66%. The advantages offered by this third data set are mainly due to the time gap between the two images rather than to the increased amount of spectral information, as SPOT bands 1, 2 and 3 and LANDSAT bands 2 3 4 refer roughly to the same portions of the spectrum. All classes show an accuracy improvement with respect both to sections 5.1 and 5.2, except class 8 (Woods) which lost nearly 17%; unfortunately for this class, in the integration of the two data sets, part of the SPOT data mis-classification was propagated, leading to the reduced separability between class 8 (Wood) and class 10 (Land), as only 691 pixels out of 831 were properly classified whereas the remaining 138 were assigned to class 10.
6. CONCLUSIONS In this paper, the use of a multi-layer neural network for land cover classification purposes has been investigated. In particular, two data sets have been selected for the experiments reported here, all data referring to the same area, the agricultural region surrounding the city of Valladolid (Spain). The two data sets were obtained by extracting the pixel values from two different remotely sensed images, a SPOT image, acquired on June the first and a LANDSAT image acquired in July 17 respectively. In section 5 three different experiments were reported: SPOT data was employed in the first classification, LANDSAT data was used for the second experiment and the integrated data set (SPOT+LANDSAT) constituted the input data set for the third simulation. The purpose of this paper was to evaluate the contribution of data integration as far as classification accuracy is concerned, and therefore compare the results obtained on the three data sets outlined above.
152 It is common opinion that neural networks represent a suitable tool for classification problems, especially when the mathematical modelling of input data is difficult: as a matter of fact, such techniques, not requiring any hypothesis on data distribution, are particularly useful in applications as the one presented here. The results reported in the previous section highlight the importance of data integration and in particular, of the use of multi-temporal data: the classification based on SPOT lead to very poor results, both in terms of overall accuracy (71.74%) and in terms of average omission and commission precision, 45.2% and 62.51% respectively; the experiments carried out on LANDSAT data showed an overall accuracy of 80.7%, omission precision of 62.28% and commission precision of 77.67%; the (SPOT+LANDSAT) data set, on the contrary, scored an overall accuracy of 88.65% and omission and commission accuracies of 79.97% and 83.79% respectively. This dramatic difference cannot be due just to the higher number of components of the third data set. As a matter of fact, in [8] it has been shown that high correlation between input channels decreases classification accuracy, and SPOT bands 1, 2 and 3 overlap with LANDSAT bands 2 3 4, therefore the increase in accuracy between the results on LANDSAT and on (SPOT+LANDSAT) should be definitely due to the difference between acquisition dates of the two images. ACKNOWLEDGEMENTS The author would like to thank Ioannis Kanellopoulos (EMAP-JRC) for providing the neural network software package and Javier Gallego (AIS-JRC) for the assistance during the coregistration of ground truth data to satellite images. REFERENCES 1. Manual of Remote Sensing, 1983, R.N. Colwell (Ed.). American Society of Photogrammetry, Falls Church, Va 2. Wilkinson, A. Chiuderi: "I1 telerilevamento alia fine del ventesimo secolo: una nuova sfida nel campo dell'informatica" - Proc. of the workshop ll telerilevamento ed i sistemi informativi territoriali nella gestione delle risorse ambientali- Trento, October 27, 1994, Published by the Office for Official Publications of the European Communities Luxembourg 3. J. Gallego, J. Delinc6, C. Rueda: "Crop area estimates through remote sensing: stability of the regression correction" Int. J. Remote Sensing, 1993, Vol 14, N. 18, pp. 3433-34459 4. Paola, R.A. Schowengerdt "A review and analysis of backpropagation neural networks for classification of remotely-sensed multi-spectral imagery". Int. J. Remote Sensing, 1995, Vol 16, N. 16, pp. 3033-3058 5. Rumelhart, G.E. Hinton, R.J. Wiliams,: "Learning internal representaion by error propagation" in Prallel Distributed Processing: Explorations of the Microstructure of Cognition, J. L. McClelland & D.E. Rumelhart ed. MIT Press, 1986, pp. 318-362 6. Wasserman: Neural Computing, Theory and Practice, Van Nostrand Reinhold, 1989, NY 7. Casteras, G. Doyon, E. Martin, V. Rodriguez: "Corrections Geometrique et atmospherique Manuel Utilizateur" CISI-Geo Design, CCR (Ispra), RSO/DOT-CGA-MU, Ed. 2 (1994). 8. Chiuderi, A.: "Improving the Counterpropagation network performances", Neural Processing Letters, 1995, Vol. 2, N. 2, pp. 27-30
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
153
Influence of Quantization Errors on SST Computation Based on AVHRR Images Pier Franco Pellegrini, Francesco Leoncino, Enrico Piazza (memb. IEEE), Margherita Di Vaia Dipartimento di Ingegneria Elettronica, University of Florence, Via di S. Marta, 3- 50139 Firenze - Italy Tel. +39-55-4796267, Fax. +39-55-494569, E-Mail: labtol@ingfil .ing.unifi.it
1.
INTRODUCTION
The estimation of sea surface temperature (SST) from satellite is performed by the mean of multichannel algorithms with infrared channels from AVHRR-NOAA or using radiative transfer models and radiosounding profiles of air temperature and humidity. At the PIN-Prato Ingegneria in connection with the Electronic Engineerig Dept. of the University of Florence is operative a primary receiving station for Meteosat PDUS, NOAA AVHRR HRPT and Meteosat WEFAX images. Other sensors' images are collected by the mean of computer networks. Measurements of the Sea Surface Temperature variation were performed using data from such station, in particular from NOAA polar satellites. The NOAA satellites, as it is well known, orbit the earth in a near-polar orbit at an altitude of 833 km, 14 times per day. The AVHRR instrument scans a 2400-km wide swath while it passes. Thus every point on the Earth's surface is viewed at least about once per day (the exact frequency varies with latitude, normally 4 pass per day at european latitudes).
S~ATH VVlOTH
"" "~
Fig 1. - NOAA AVHRR Scan Geometry
154 The thermal infrared channels of the advanced very high resolution radiometer, AVHRR, (channel 4:10.8 mm; channel 5:11.9 mm) have been used successfully to measure SST for well over a decade. Using several images taken at different times it is possible to follow the temporal evolution of several marine environmental and climate parameters influenced by the surface temperature such as water currents or biological activity. In order to use in the best way these data it is necessary to have good and reliable retrieval algorithms. The importance of SST is tight connected to the evalutation of the variation in the climate characteristics because it controls the variation of the latent heat in the lower layers of the atmosphere and thus the energy transfer from earth to space. The Sea Surface Temperature (SST) computation based on the Advanced Very High Resolution Radiometer (AVHRR) depends on several factors that can lead to results affected by errors. In literature it is possible to find algorithms that allow the correction of effects such as the atmospheric Water Vapor influence and the sunlight reflex [ 1], [2]. Tipically the SST computation and Water Vapor correction algorithms consist of a linear relation between the SST and the values of the brightness temperature of the AVHRR channels 4 and 5. SST = a T 4 + y (T 4- T5) + c The coefficients of these relations are derived by teoretical considerations that take into account the frequency responses of the different sensors and the atmospheric profiles, sometimes, they are computed with a regression on true data [2], [5]. This work points out the influence of the quantization on the brightness temperature computation. Such quantization is due to two different causes: the intrinsic AVHRR data quantization (10 bits/sample) and the choosen precision range of the calibration procedure, that is the conversion from raw data to brightness temperature data of each channel. The spectral characteristics of channels 4 and 5 of AVHRR radiometer supply very close brightness temperature, and, because in the algorithm the difference between T 4 and T 5 is much relevant (that is (T 4 - T5) that may be very close to zero) the SST may be affected by cancellation errors. The main effect of this error is a spatial distribution of the SST field with some "oscillation" instead of a regular slope. Such oscillations are more evident when a less precision is used (that is a larger quantization interval) and became less evident when a high precision is used, becoming a sort of granularity on the SST image. This granularity depends on the (T 4 TS) coefficien of the relation used to derive the SST Some comparative results concerning the application of these improvements on AVttRR images are here outlined.
155
] Satellite
x~Q__.__Counts-qumltization 10 bits
Receiver
I Calibration
II II
I
Counts - Quantization 10 / 8 bit
I
Brillance Temperature 10 bit or less
Parameter Extraction ex: Sea Surface Temp.
+
Temperature 10 bit or less
Fig. 2 - Quantization Errors Sources This work mainly deals on the valutation of the precision to be choosen in the calibration procedure and on the opportunity to take into account some improvement on the SST computation algorithms as to lower the quantization errors.
11
EXPERIMENTAL EVALUATION OF THE INFLUENCE OF QUANTIZATION E R R O R S O N T H E C O M P U T A T I O N O F T H E SST
With reference to the MCSST (Multi channell Sea Surface Temperature) three sets of coefficients have been experimented. They are reported in Tab. 1 and valid for an average latitude and during Summer. a = 1.002
7 = 1.933
c = -0.32
Minnett (1990)
a = 1.037
7 = 1.157
c = -9.28
Emery et al. (1994)
a = 1.000
7 = 2.520
c = 0.14
Sobrino et al. (1995)
a = 1.000
3' = 2.122
c = 0.00
Yu and Barton (1995)
'l
Tab 1. - Three sets of coefficients for MCSST here experimented. They are valid for an average latitude and during Summer. In this work some N O A M A V H R R images have been used with focus on the southern Europe and the Mediterranean Basin. To obtain a reliable value of SST the selected images had to be relatively free of clauds.
156 The following Fig. 3.. shows the differences between the various algorithms. They report on the vertical axis the Sea Surface Temperature in Celsius Degrees and on the horizontal axis the pixel number inside a portion of an AVHRR scan line. This portion was entirely about the Adriatic Sea. It can be seen that all the applied algorithm have similar behaviour and give results quite good mathching the real Sea Temperature. Comparison between different SST algorithm on actual data
180
"em-0act.plt" * "yu-0act.plt" 0 "so-0act.plt" o "mi-0act.plt" ~<
175
"-
170
,-=*
165
160
.
0
.
.
.
.
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
30 40 50 60 70 pixel number Fig. 3. - Differences between the various algorithms. They report on the vertical axis the Sea Surface Temperature in Celsius Degrees and on the horizontal axis the pixel number inside a
portion of an AVHRR scan line Approximation errors in the calibaration procedure produce in the obtained SST maps sorta of granularity. This effect, due to gaps in the possible values of computed SST, is quite evident. Since in the algorithms a differential term is always present, that is (T4 - Ts) and being T4 quite near to Ts, then the differential term if otten affected by a remarkable cancellation error weighted by the ~ coefficent. The influence of the cancellation error due to the differential term present in the used formulas, responsible of a worsening in the computation precision of the SST, compared to the one obtainable for the brightness temperature in the 4th and 5th channel of AVHRR. It follows that the algorithms that weight less the differential term are less subject to this kind of error and thus they give more smoothed SST maps. The best one is then the one proposed by Emery et al. [2]. Fig. 4. show the worsening in the results due to the cutting of the least significant bits on the calibrated data so that they are coded with ten bits (-0 bits) nine bits (-1 bit) and eight bits (-2 bits). It can be seen how SST computed on eight bits calibrated data is meaningless so that it is mandatory to use ten bits codingin all the steps of AVHRR processing.
157
Comparison between actual data and truncated ones
180
9
"em-Oact.plt" -+-"em-1 act.pit" em-2act.plt"
175
~
-e--
_
170
165
160
0
10
20
30 40 pixel number
50
60
70
Fig. 4. - The worsening in the resulting SST due to the cutting of the least significant bits on the alibrated data so that they are coded with ten bits (-0 bits) nine bits (-1 bit) and eight bits (2 bits) Attempts to smooth obtained SST values by low pass filter application have been carried on but, as it can be seen in Fig. 5, the filtered values (obtained with the Emery's coefficients applied on 8 bits data passed throght a mean filter) doesn't match the ones computed on the ten bits data. Comparison between actual data and averaged truncated ones
180
"em-Oact.plt" -+-"em-2ave.plt" -~-175
,~, ico 09
1
170
165
160
0
10
20
30 40 pixel number
50
60
70
Fig. 5. - Comparison between the SST obtained with the Emery's formula on ten bits calibrated data and filtered SST values obtained with the eight bits data
158 3. CONCLUSIONS The present work made it possible to estimate the effectiveness of the algorithms for the SST computation and to point out their limits with reference to the errors coming from the precision of the calibrated data. Particularly, it was put out the influence of the cancellation error due to the differential term present in the used formulas, term who is responsible of a worsening in the computation precision of the SST, compared to the one obtainable for the brightness temperature in the 4th and 5th channel of AVHRR. It follows that the algorithms that weight less the differential term are less subject to this kind of error and thus they give more smoothed SST maps. It can be seen how SST computed on eight bits calibrated data is meaningless so that it is mandatory to use ten bits coding in all the steps of AVHRR processing. REFERENCES 1. 2.
3. 4. 5. 6.
7.
Ian J.Barton, "Satellite-derived sea surface temperatures: Current status", Journal of Geophysical Research, May 1995 Emery W.J., Yu Y., Wick G.A., Schluessel P., Reynolds R.W. "Correcting Infrared satellite estimates of sea surface temperature for atmospheric water vapor attenuation", J. Geophys. Res., 99, 1994 McMillin L.M., "Estimation of sea surface temperatures from two infrared window measurementwith different absorption", J. Geophys. Res., 80, 1975 McMillin L.S., D.S. Crosby, "Theory and validation of the multiple window sea surface temperature technique", J. Geophys. Res., 89, 1984 Minnett P., "The regional optimization of infrared measurements of sea surface temperature from space", J. Geophis. Res., 95, 1990 Sobrino J.a., Li Z.L., Stoll M. P., Becker F. "Multichannel and multi-angle algorithms for estimating satellite sea surface temperatures", IEEE Trans. Geosci. Remote Sens., 31, 1994 Wick G.A., Emery W.J., Schluessel "A comprensive comparison between satellitemeasured skin and multichannel sea surface temperature", J. Geophis. Res., 97, 1992
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
159
STUDY OF ECOLOGICAL CONDITION BASED UPON THE R E M O T E SENSING DATA AND GIS The Neural Network Approach M. Z H A N G * , J. B O G A E R T *t & I. I M P E N S t t
1. INTRODUCTION Information about the habitats (e.g. land surface cover) via Geographical Information Systems (GIS) is very useful for various applications such as nature resource management or ecological condition studies. Habitat features hypothesized to influence species distribution patterns can be mapped and analyzed in relation both to individual species and to species richness distribution patterns. These approaches produce maps derived both from field data and from satellite imagery data. In this study an effort has been made to use multi-output Radial Basis Function Neural Network (RBF-NN) as an image processing tool to extract more information from satellite image (TM). The results obtained form the RBF-NN are used as the input to an automated GIS and in order to get higher accuracy. 2. AN OVERVIEWOF THE NEURALNETWORK Artificial Neural Network (ANN) is a network of many simple processors, we call them nodes, each possibly having a (small amount of) local memory. The units are connected by unidirectional communication channels ("connections"), which carry numeric (as opposed to symbolic) data. The units operate only on their local data and on the inputs they receive via the connections. A Neural Network is the structure of NN = , where the node in the set N is called neuron which is a basic calculation unit, and each edge in E connects nodes with the weight ~. The brief architecture of NN is illustrated in Figure 1 The basic processing of NN consists of two steps, first training processing in which the selected input and output samples are used to train the network via training the weights for each edge; second working processing inputs the realistic problem specified by the user and outputs the results. The main algorithm is illustrated by the following: The processing of Neural Network: 1. Initializing the data 2. Read the input and output data I and 0 3. For each node to calculate its output O n
4.
IfllO-Onl I < ~ goto 7
5. 6. 7. 8. 9.
Modify the weight ca Goto 3 Save the weight ca Read the real problem For each node to calculate its output O n
10. Output the result o f O n
Input
Output
Figure 1: Basic Neuron Model
11. Stop 2.1 Pre-processing
The remotely sensed images consist of the seven TM bands and each pixel is described by a vector I : ik (0_< ik_~55, k=], 2 . . . . . 7). There are 2567 possible states for each pixel. Therefore it requires huge numbers of the
hidden nodes for the classification. Hence a pre-processing is used to reduce five TM bands into one band by False Colour Composite (FCC) due to the available of the TM data in the study area. As the result, the number of states of the pixel is diminished to 256. The input of NN contains one band which makes NN more efficient. However we notice that it may increase the ambiguous types of pixels due to the information is lost when FCC * Contact address: 287 Parliament Str., Toronto, ONT, M5A 2Z6, Canada, E-mail: [email protected] ** Lab of Plant Ecology, Biology Department, University of Antwerpen, (ULA),Universiteitsplein 1, B-2610 Antwerpen (Wilrijk) Belgium, E-mail:[email protected]
160
conversion is used and NN may not make the distinction between those types. A post-processing is presented to classify those ambiguous types. The pre-processing reduces input data from five TM bands to one band by using FCC conversion which is a function f g) = i mapping from R 3 to R. 2.2 Neural Network Model
Let n i and n o be the dimensions of the input and output respectively and n h be the number of hidden nodes. The hidden nodes are specified by Xj = ([[1- mjl[,6j)
l<_j<_n h
(1)
Where I ~ R ni is the input; m e R ni are the RBF centres; Jj are the real positive scalars known as the widths; xj is referred to as the non-linearity of hidden nodes. The [[ 9[[ is the Euclidean norm, and ~ ( . , 8 ) is a non-linear function from R + to R which is defined by: ~(r, 6)=exp(-r/J2) (2)
where r=[[I-n~[[. The output node is a linear combination which is defined in the following equation. Y i = ~'~co~jxj
i=I, 2 . . . . n o
(3)
Where the coo 9are the weights which are delivered from the learning processing step. The goal of the Learning Step is the determination of the weights which are undertaken by an adjustment of the connection weights so that the overall error is minimized. The learning processing is defined by the following different equation with the arbitrary initial state of cog (t + 1) = cog ( t ) - 2xj
(4)
where the constant 2 is the learning rate to control the training processing. The value of 2 should be set properly to prevent the oscillation. The processing of learning goes on continuously until the termination condition is met:
V
i 9
Where ~i is the desired output and ~ is a very small constant i.e 0. 00]. As the result of learning, the weight co/j is defined and fixed as a classifier for the test data. It can be used to classify the whole image ff the result of the test meets our requirements. 2.3 Post-processing Post-processing classifies the ambiguous types by using the extract information such as the environment data, which is described as follows: Y i=g (Yi , H)
(6)
Where Yi is the output of NN, and H is extract information of GIS. This step can be considered as a simple expert system which may link ANN with GIS. 3. METHODOLOGY The overall objective of satellite image classification procedure is to automatically categorise all pixels of an image into land cover classes. The multispectral TM data were used to perform the classification and the spectral pattern present within the data for each pixel is used as the numerical basis for categorisation. In this study, the objects are land cover pixels, and their features are the values of their spectral intensities. The classification can be viewed as a mapping from the feature space to the category space that is performed by using the multi-output radial basis function (RBF) network. The classification can be divided into two stages. First stage is the training procedure which uses the data of the training sample set to determine the weights of the network. When the training procedure is finished, the
161
trained neural network can be used as a classifier. The output products are digital thematic maps which are amenable to an inclusion in a geographic information system (GIS) as a GIS "input" (Figure 2). The result was compared with the conventional supervised classification method. 3.1 NN Architecture The RBF network was calibrated for each land cover class. The input layers consist of 5 neurons Figure 2 : The general procedure scheme of study. for each TM image band (TM band 1, band 2, band 3, band 4, and band 5); 6 neurons in the output layer corresponding to 6 land cover classes respectively (Figure 3)
3.2 Pre-processing and Sampling The study area is located in the north-southern part of Jiangning county, rolling hilly area of Nanjing. Six classes of land cover type: river (R), lake (L), paddy field (P), baresoil in the upland (B1) with dry condition, baresoil in the lowland (B2) with wet condition and forest, which have been identified by visual interpretation of TM image (dated 7 December, 1987) combining with Figure 3 : Topology of an elementary RBF network. GIS reference data site. The False Colour Composite (FCC) TM image was used for pre-processing and sampling to create training and test data as input patterns. This scheme not only reduces the dimension of the classifier weight space and extracts the salient features of the image for each land cover class sample set, but also decorates the data that is an important property for training and testing the neural network classifier. The size of the sub-site area for the training and test are 50 • 50 pixels. This is useful to reduce the training testing time. The Neural network was able to train itself with a small set of training data. But such arbitrary selection of a limited number of samples of training data would lead to poor classification accuracy since they are hardly representative of the dispersion of the data in the feature space (T. Hosomura, et al, 1992). So the training data were obtained in this way: first identify the sampling area (sub-sites) for each land cover class on the FCC image and record the location, and then take those pixels covering the whole range of variance as a training's set at the centre of the sampling area. The training data set consists of five values for each pixel, corresponding to five TM bands, as an input pattern for training the RBF network.
3.3 Training and Testing of the Network Training was executed by the back-propagation learning algorithms using these training data sets. The RBF network for each class was trained for 4000 iterations. The learning rate of the network was initially set to 0.5, but it was observed that with real data this learning rate led to oscillatory behaviour of the network's sum squared error (SSE). The learning rate was reduced to 0.3, but with some data some decaying oscillatory behaviour was observed. The critical value of error was set to 0. 05 which means that the accuracy of learning was kept in appropriate 95 % Another part of input patterns from the same sampling area was used to test the network and the results were used for the evaluation of the network and to made corrections if necessary.
3.4 Combining with Expert Systems by using NDVI and DT During the training and test stage, we have found that there are some confusions in the certain of the classes They are overlapping in the feature space such as rivers and lakes, baresoil at upland and lowland area, the rivers and forest (Table 1). Previous studies have shown that the Normalised Difference Vegetation Index (NDVI) follows a close relationship with the ground water capacity, landscape and texture of the ground surface which can provide useful information for identifying land cover type and biomass intensity (M. Zhang, 1993). We have applied a simple experts system by using NDI/I and a Decision Tree (DT) to overcome this problem.
162
Table 1 Classification results from RBF-NN (Without NDVI and DT Input) at test area 1 Ture Class 1 2 3
R L B1
4
B2
5 6
P F Total
RBF network classified results R
L
B1
P
F
U
56 510 178
162 390 108
626 660 68
546 723 342
55,53 56,82 61,29
398 5493
172
120
215
73,67
90 256 248 218 6087 292 300 813 586 112 158 308 4301 1023 3747 6462 3039 6613 7227 6067 3149
81,26 58,91
2208 314 64 346 4180 548 28 330 1669 262
796
B2
Accuracy
Overall accuracy,
(%)
65,94
The NDVI was generated from the Red and near infrared (NIR) channels (band 3 and 4 in TM image) by using the equation of NDVI=(NIR-Red)/(NIR+Ned). The values of NDVI are close to 100 when the spectral response of the ground is similar for both bands. It is a dominant value in the bare soil area. Those areas will be covered by an high amount of vegetation when the value of NDVI is higher than 100 because the photosynthetically active biomass has a high reflectance in the near-infrared portion of the spectrum. The values that are lower than 100 are obtained for the area covered by water or clouds due to their relatively high reflectance in the red band and lower reflectance in the infrared band. By using GIS techniques, the crosstabulations of the NDVI values and the pixels' values from TM classified images were accomplished to help the identification of the biomass-soil relationships. Since satellite data now have improved spatial resolution, contextual analysis becomes very important. Postclassification, "filtering" may be necessary to improve the result. Data from the neighbouring pixels will have to be considered in the contextual approach. For example, the river may be spectrally classified as lake, but they can be recognised by looking at the linear features. They can be separated by a simple D e c i s i o n Tree (DT), for example: 1F cover = water A N D 1 F c o v e r is a linear f e a t u r e T H E N l a n d use = river E L S E l f N D V I > 130 l a n d use = f o r e s t O T H E R W I S E l a n d use = lake.
Table 2 shows the classification result after input of the NDVI values by using decision tree. The overall accuracy was increasing to 82.35%. Table 2 Classification results from RBF-NN (With NDVI and DT Input) at test area Ture Class 1 2 3
R L B1
4
B2
5 6
P F Total
RBF network classified results R
L
B1
Accuracy
P
F
U
60 248 186
18 428 34
252 308 8
138 301 99
83,60 76,57 79,16
406 6130
108
18
292
81,90
26 236 110 128 6197 258 56 434 340 34 32 122 6256 286 3927 6861 2870 6784 6907 7100 1172
88,39 83,37
3074 79 56 158 5722 308 28 160 1956 207
324
B2
Overall accuracy
(%)
82,35
163
3.5 Classification And Comparison With The Conventional Methods
Trained network was used as a classifier for the land cover classification in the test area. The output of the thematic map is shown in Figure 4. The result was compared with the conventional statistical classification methods. It is well known that the conventional statistical classification methods are based on the Bayesian classification theory. The most popular method is the Maximum likelihood classification (MLC) method which assumes a multivariation data source following a normalized distribution. Our previous investigation pointed out that MLC method may produce better results than other statistical classification methods (M. Zhango 1992). However, it is not working well when we use the same training data set which was used in the RBF network as Figure 4 9Output of Classified image by RBF-NN network, a sampling data for the MLC method. We have to take care that the data are distributed in a meaningful way and the separability for each class in the feature space. The sufficient number of sampling data must be sufficient for the statistical analysis. Then we can get acceptable results (Table 3). Table 3 Classification results from Maximum Likelihood Classifier (MLC)
1 2 3
R L B1
4
B2
5 6
P F Total
Accuracy
MLC classified results
Ture
Class
R
L
B1
1517 91 173 2835 14 80
B2
P
F
(%)
U
39 200 983
32 131 89
44 154 19
110 177 6
42 81 60
80.91 75.61 78.61
248
199
2749
52
33
338
73.30
45 111 177 120 2056 3484
48 16 1484
109 3246 95 42 146 3204 3152 3660 3625
97 46 664
86.55 85.43
131
Overall accuracy
80.18
The accuracy of the classified images from both Maximum-Likelihood Classifier (MLC) and Neural Network classifier (RBF-NN) approaches were assessed by examination of the training area pixels and independent test-area samples respectively. Six test areas (60 x 60 pixel blocks) were selected randomly on the original FCC of the study area. The agreement between the test-area and the ground truth information was shown in Table 1, Table 2 and Table 3. 4. RESULTSAND DISCUSSION RBF-NN network can be applied to satellite image processing for land use classification and provide useful information for geographical information systems. The following results were obtained.
164
1) In the case of the selected training data set which described in 3.2 at page 3 , we may easily define the sampling area and create the training pattern for the RBF network according to the visual interpretation by experienced analysts. Then RBF will produce a specified output corresponding to the training pattern. It will be useful to the pattern identification such as land use change detection and environmental monitoring. 2) From Table 1, we find that there was a confusion between the water (R and L) and the forest (F). Most unclassified pixels (U) were located in the area where reflectance characteristics of water and trees, especially along the Qinghuai River. We have checked with ground truth and found that they were cased by some trees and buildings along the river at the training area. It is quite difficult to separate them due to the complicated matters at ground, for instance the river is narrow, sometimes less than 30 m. The class of dry land (B 1) and the wet land (B2) (bare soil at that moment of image recording) has a similar spectral feature that could have cased mis-classification. To overcome those problems we have to input extra information mostly from experienced analysts. The RBF-NN network has an ability to formalise prior knowledge about land cover types and to incorporate this knowledge into the classification process. The NDVI has a close relationship with ground cover types, this knowledge base can be fed to the network by using a decision tree (DT). It can help to decide which is the most likely class, if there are some confusions between classes. Our experiment result shows that the accuracy was increased for each class and the overall accuracy increased from 65.13 % to 82.61% (Table 1 and Table 2 ). 3) The RBF-NN network has a very general approximation ability and very mild assumptions for the multispectral data source and the training data set. It is not necessary that those data follow the Gaussian normal distribution. This is a main reason why the RBF network may perform satellite image classification better than conventional statistical methods (overall accuracy is 82.35 % - 65.94% = 16.41%). 4) Though RBF-NN can be used to classifying the land cover based upon satellite images, the training data can be selected without any prior knowledge about the statistical distribution of the data sources. However, if the ambiguous types of pixels in the training data sets increase, the RBF-NN may not make the distinction between those classes. 5. CONCLUSIONS On the basis of the study described above, the following conclusions can be made. The Radial Basis Function (RBF-NN) network has been applied to the extraction and analysis of land cover features in view of a land cover classification and ecological condition study. The experiment result illustrates that the ancillary data or spatial data from other sources will help in remote sensing data classification. ANN can be used to bring multisource spatial data together. It shows the higher classification accuracy with less sampling data (without the ambiguous types of data), especially for a specially designated training set. Thus the application of RBF - NN to land cover classification is a promising method for satellite image processing and classification.
REFERENCES Bischof H, Schneider W, and Pinz A.J., "Multispectral classification of Landsat-image using neural networks", IEEE Transactions on Geoscience and Remote Sensing, Vo130, No.3, pp. 482-490, (1992). T. Hosomura, et al,. Cloud free mosaic images, Proceedings of ISPRS, Washington DC (USA), Vol. VII, pp.209-216 (1992). Zhang, et al., Application of satellite remote sensing to soil and land use mapping in the rolling hilly area of Nanjing, Eastern China", Proceeding of workshop on "Remote Sensing and GIS intergrated for the management of less favoured areas" Louvain-la-Neuve, Belgium, 29 June/1 July 1992. Zhang, " Land covers inventory using remote sensing and GIS techniques for assessment of bio-mass and soil relationships," Proceeding of the International symposium "Operationalization of remote sensing", J.L. van Genderen at el, ITC Enschede, The Netherlands, Vol. 4, pp. 253-262, 19-23 April 1993.
1 R=River, L=Lake, B1 =Dry land, B 2 =Wet land, F=Forest U --unclassified
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
165
PEICRE PROJECT: a Practical Application of Remote Sensing Techniques for Environmental Recover and Preservation Marco Benvenuti ~, Claudio Conese 2, Carlo Di Chiara ~, Andrea Di Vecchia' ~Ce.S.I.A. - Accademia dei Georgofili, Via Caproni n.8, 50145 Firenze (Italy) Ph. +39-55-301422, Fax +39-55-308910, e-mail: [email protected] 2I.A.T.A.- C.N.R., Via Caproni n.8, 50145 Firenze (Italy) Ph. +39-55-301422, Fax +39-55-308910, e-mail: iatl @sunserver.iata.fi.cnr.it
1. INTRODUCTION In this work the preliminary results achieved by the PEICRE project will be presented. In this project remote sensing techniques have been used to evaluate the effects of a previous project against draughtiness in Niger: the PIK (Projet Integree Keita) project. The studied area, positioned around the Keita village in Niger (Africa), is characterised by adverse climatic conditions with a vary low rainfall level and an heavy erosion. The PIK project started in 1984 thanks to a Niger/Italy/FAO agreement and it has been interrupted two years ago to observe its effects on the interested territory, and now in 1996 it is started again. During these several years a lot of field works have been developed to improve the potential agricultural productivity of such region, that is to reduce the desertification and erosion processes.
Figure 1. The region of interest
166 The aim of the PEICRE project is to evaluate if the PIK interventions have produced the desired effects, by means of remote sensing techniques and geographical information systems. A multitemporal analysis of Landsat TM and MSS and of SPOT multispectral images has been used to investigate how the environment in the studied area changed from 1984 until 1995. Three sets of images, collected in three different dates during the PIK project years (the first one at the begin, the second one in the middle of the period and the last one at the end of the project), have been processed to compare the different classification maps, one for each year, in order to establish the evolution of the agricultural and natural vegetation coverage. The processed images have been integrated with digitised aerial photographs, digital cartography and digital terrain models into a geographical information system (GIS). Furthermore, the results of the fields observation have been compared with the information extracted by means of remote sensing, Aerial photographs have been also used, together with field observations, to verify the reliability of the developed algorithm. To improve and to make easier the photographs interpretation, a data fusion process between satellite data and aerial photographs has been performed producing encouraging results. Besides the description of the defined methodology, the successful preliminary results will be showed, so that the importance of the remote sensed data availability in this field of application will be pointed out. 2. THE PEICRE P R O J E C T In the last years a lot of applications of remote sensing techniques have been proposed for environmental analysis, also thanks to the continuous improvement of sensor reliability and precision. The enhanced sensibility of world governments and research institutions to the environmental problematic and the new awareness of the close relation between the environmental thematic and the human health and safety, has led to an increase of the world research efforts in environment management and preservation. So it is important to study and define new methodologies, new algorithms and innovative instruments in the field of the environmental analysis. The remote sensing represents an useful technique for the observation of wide areas and for the extraction of information related to environmental parameters and indicators. In the PEICRE project the effects that a previous project, the PIK project (Projet Integree Keita) for the territory recover have been investigated. In the case of the PIK project the interventions was devoted to guarantee an higher food availability and autonomy for the local population. The area of interest of the PEICRE project is inside the Keita district in Niger and it has an extension of about 3.500 km 2. This region is characterised by an heavy environment degradation problem andan high mortality level of native population, due to the low availability of alimentary resources. The PIK project has been developed from 1984 until 1995, thanks to an agreement among FAO, Italian Government and Nigerin Institutions. During this years several specific recover interventions and field works have been realised to slow down erosion and desertification processes. Because of the wide extension of the area to investigate, it is extremely difficult to monitor its in time simple by means of field works and observations, also because continuous
167 field works can be very expensive economically and they could require a lot of time and personnel. On the contrary the remote sensing techniques allow to define some methodologies to observe the environment modification during the time, in wide regions too, with an high accuracy level and a good spatial detail. In the PEICRE project images acquired on 1984, 1989, 1990 and 1995, by different multispectral sensors, have been processed and analysed. Unfortunately not all the images have been collected by the same sensor because of the poor satellite data availability for this region. For this reason Landsat TM, MSS and SPOT images have been processed and then compared in order to evaluate the evolution of the vegetation coverage over the ten years of the PIK project. One the main problems has been encountered in this work is due to the nonhomogeneity of the data. Finally a new methodology has been defined and implemented in order to solve the several problems coming from the heavy soil influence on the measured reflectance, the wide test area extension and the similarity of the typologies to extract. The defined classification model is based on a previous spectral classification and on the integration of the spectral data with territorial information collected whether during field missions or acquired from cartographic data and geomorphologic analysis. 3. CHARACTERISTICS OF THE STUDIED AREA The studied area is characterised by an arid climate, with a very low rainfall level, around 300-400 mm/year concentrated in few months, from the end of May to the first weeks of September. In this period, the precipitation are so intensive to generate an enormous terrain erosion, producing temporary rivers, the so called "kori" and producing tremendous consequences for the few cultivated areas there are. Moreover, such a situation is extremely dangerous for the terrain stability and for the agrarian and pastoral economy of such an area. It is important to take into account that in 1984, when the PIK project started, there was an actually dramatic situation. In fact in such year the rainfall was practically absent and the food availability was so scarce that the local population starved. After this year the situation improved and the increased precipitation level, together with the positive effects of interventions for environment recover, led to a larger extension of agricultural areas and consequently to a yield increase. By the geomorphologic point of view, three main phisiographic units can be distinguished: a. the plateau, covering about the 29% of the territory; b. the slopes covering about the 32% of the territory; c. the "glacis" (bottom valley), covering 19% of the territory; d. the dunes, due to wind erosion, covering 6% of the territory. The high altitude variability of the terrain, the really close valleys, together with the sparse vegetation coverage, have represented the main sources of the encountered elaboration problems. A large amount of meteorological data, collected by ground stations since 1982, have been processed to be compared with the classification results, in order to well understand the vegetation temporal dynamic. 4. THE CLASSIFICATION M E T H O D O L O G Y To assess the effects produced by the PIK field works on the environment, an important indicator is represented by the vegetation changes, either as its distribution over the surface, than as its greenness intensity, over the years of the project. An useful instrument to perform
168 this kind of analysis is the classification map of the vegetation coverage. By comparing the classification maps relative to different years, it is possible to evaluate how the extension of each vegetation class has changed. In the case of this project a very high obstacle was represented by the poor remote sensed data availability and by their low quality level. In fact images acquired by different sensor have been processed and the results compared. Then some specific radiometric corrections were necessary to permit quantitative comparison of the spectral information extracted from the images acquired by different kind of sensor [1 ],[2],[3]. The classification procedure is based on the integration of different information layers, provided by digital cartography, on-field observations, digitised aerial photographs and multispectral satellite images. More in detail digital cartography has been used to build the digital terrain model (elevation, slope and exposition) of the studied area, while the on-field observations provided some characteristic thresholds for morphological parameters necessary to have a logical separation among phisiographic units.
Figure 2. Classification of vegetation distributionoverthe project area in 1984. The satellite images, after radiometric and geometric corrections have been applied, have been classified by using a partially supervised methodl based either on the knowledge of the observed terrain than on the spectral characteristics of each class to extract. Before applying spectral classification procedure, synthetic bands have been created and added to the spectral bands directly acquired by the sensor. Also different vegetation indices have been calculated in order to have an objective idea of the actual vegetation coverage[4][5]. Anyway, the ground contribution has been takes into account to enhance the reliability of the indices calculation. To assess the classification quality, photo-interpretation has been performed over a fusion image between aerial photographs and satellite images has been realised, obtaining an high spatial resolution spectral image, with 9x9 meters of pixel dimension. The product of this
169 was useful also to update digital cartography directly on the workstation, because in such image a very good detail level is maintained together with the spectral information. An example of the map produced by the classification model is showed in figure 2, while in figure 3 the state of the vegetation on 1984 is showed. Class Name
Area '95 (ha)
Area '95 (%)
Area '84 (ha)
Area '84 (%)
Diff. '95-'84 (ha)
Diff. '95'84 (%)
NDVI
Without veget.
61054.27
10.78
275869.56
48.38
-214815.29
-37.60
-1.00,-0.21
Very poor veget.
76954.37
13.59
209294.08
36.70
-132339.71
-23.11
-0.21 ,-0.17
Poor veget.
205706.68 36.32
77196.84
13.54
128509.83
22.78
-0.17,-0.09
Middle veget.
188426.47 33.27
6454.65
1.13
181971.81
32.14
-0.09,-0.02
High veget.
34168.32
1417.67
0.25
32750.65
5.78
-0.02, 1.00
6.03
Tab. 1. Comparison between vegetation distribution in 1984 and 1995
5. THE M U L T I T E M P O R A L ANALYSIS An important instrument to improve classification performance, when the terrain is particularly complex, is the multitemporal analysis of the available data [6]. So the classification model has been applied to the whole set of images collected for each year, in order to compare the output maps to evaluate the differences of vegetation coverage between 1984 and 1995.
Figure 3. Vegetation distribution in 1984.
To achieve the purpose of the P.E.I.C.R.E. project, that is to evaluate if the PIK project improved or not environmental conditions by the agricultural production point of view, the analysis of vegetation temporal dynamic seemed to be the only possible or, at least, useful to produce reliable results. Due to the differences on radiometric and geometric characteristic of Landsat TM and SPOT HRV, to be able to do a quantitative comparison, based on vegetation index values, a correction factor has been applied to SPOT images to attenuate these differences [ 1].
170 Particularly, the comparison has been done for the treated areas by considering some near reference zones which have similar coverage on 1984 and that have not been treated before 1995. In this way the different behaviour of treated and not treated areas up to 1995 has been observed. The results of the multitemporal comparison are showed in table 1 and in figure 4.
Figure 4. Vegetationdifferences between 1984and 1995.
REFERENCES [ 1] J. Price, "Calibration of Satellite Radiometers and the Comparison of Vegetation Indices", Remote Sensing of Environment, 21:15-27, 1987 [2]K.P.Gallo, C.S.T. Daughtry, "Differences in Vegetation Indices for Simulated Landsat 5 MSS and TM, NOAA-9 AVHRR and SPOT-1 Sensor Systems", Remote Sensing of Environment, 23:43 9-452, 1987 [3]F.G.Hall, D.E.Strebel, J.E. Nickeson, S.J.Goetz, "Radiometric Rectification: Toward a Common Radiometric Response Among Multidate, Multisensor Images", Remote Sensing of Environment, 35:11-27, 1991 [4]G.Rondeaux, M.Steven, F.Baret, "Optimisation of Soil-Adjusted Vegetation Indices", Remote Sensing of Environment, 55:95-107, 1996 [5]C.Conese, F.Maselli, "Use of multitemporal information to improve classification performance of TM scenes in complex terrain", ISPRS J. Of Photogrammetry and Remote Sensing, no.46, pp. 187-197, 1991 [6]L.Xia, "A Two-Axis Adjusted Vegetation Index (TWVI)", Int. J. Remote Sensing, vol. 15, no.7, pp. 1447-1458, 1994
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.)
171
9 1997 Elsevier Science B.V. All fights reserved.
A WAVELET TRACKING
CLASSIFICATION
CHAIN FOR RAIN PATTERN
FROM METEOROLOGICAL
P. Gamba (~ A. Marazzi (~
RADAR DATA
A. Mecocci (*)
(o) Dipartimento di Elettronica, Universit~ di Pavia Via Ferrata 1, I 27100 Pavia, ITALY (*) Facolt~t di Ingegneria, Universit~ di Siena, Via Roma 77, I 55300 Siena, ITALY
ABSTRACT In this paper we present a new wavelet based classification chain to analyze sequences of images obtained from meteorological radar. The system is able to track the different rain cells and wavefront of rain event data acquired with a frequency of 4 images per hour. The detection is made with a packet wavelet transform in order to exploit the physical characteristic of rainfall fields and the textured patterns as discriminating features for the different structures of such images. The chain was applied to a radar sequence of an event that occurs in Northern Italy in October 1992.
1. I N T R O D U C T I O N The increasing availability of space/time rainfall data from radar and satellite sources, push more and more researchers towards the study of systems able to analyze in an automatic way the evolution of the weather. The forecast in meteorology is becoming a task with a set of implication, not only for the simple interest in knowing whether it is raining or not, but also in previewing tornadoes or floods in high risk zones and in the field of microwave communication, where there is a great need for a complete study of the propagation characteristic of the atmosphere. The advantages of the use of weather radars is that the rain estimation is made for wide geographic regions to consider large scale rain structures, but with a fine space resolution [1][2]. The tracking process is not an easy task, because the temporal evolution of the rain event results from complex processes, organized at different space-time scales. The features to detect range from the little rain cells, to the cluster potential regions, to the rainbands. Rainbands occurs with storm and move in the same direction as the storm; within these
172 rainbands, rain cells exist, which are born, grow, decay and dye, moving in a common direction that do not usually coincide with the direction of the storm. It is so necessary to find a system able to separate the different patterns at the different meso-scales and to detect the relative motion of all the structures involved in the meteorological event. The recognition of the different shapes is made with the discrete packet wavelet transform [3]; the use of such algorithm is motivated by the high efficiency and low computational cost of the transform and by the observation that a large class of textures can be modeled as a quasi-periodic signal whose dominant frequencies are located in the middle frequency channels, easily provided by this transform. The subimages so obtained have an energy related to the different scales contribution for the different patterns and can be considered as a multiband representation of the same scene and thus resolved as a multidimensional clustering problem. Once isolated the different meso-scale patterns, we calculate the relative velocities and we verify that the movement of the shapes is de-coupled in different direction.
2. T H E P A C K E T
WAVELET
TRANSFORM
Wavelets [4] has generated a great interest in both theoretical and applied areas, especially over the past few years. A lot of work is made with the application of such tools in image processing [5], but other fields take advantage of the great evolution of this transform. The general structure of the discrete wavelet transform is quite similar to those found in subband coding system, with the main difference that the wavelet filters are required to be regular. A general extension of the classical wavelet transform is made using a library of modulated waveform orthonormal basis, with the introduction of the discrete packet wavelet transform [3], that corresponds to a general tree-structured filter bank. In the simple wavelet transform, leading from a low pass filter that satisfies the following standard quadrature mirror condition: H(z) H(z "s) + H(z) H(-z "s) : 1
(1)
and from a complementary high pass filter G(z), obtained by shift and modulation, a recursive filtering is applied only to the low-pass signal. The 2-D packet wavelet transform can be so seen as a tensor product of two 1-D wavelet basis function along the horizontal and vertical directions as shown:
h~(k,O-h(k)(O
hL,,r
hmfk, l)=g(kJh(1)
h..~.O=g(k)g(l)
(2)
where the pyramidal recursive filtering is applied to all the subband, generating a set of subimages strictly related to a particular range of spatial frequencies. The wavelet packets allow more flexibility in adapting the basis to the frequency contents of a signal.
173 3. T H E C L A S S I F I C A T I O N
CHAIN
The concept shown in section 2 is here exploited in order to extract the textured patterns at the different scales and to detect their velocities. The algorithm works as follows: i) Apply a 2-D packet wavelet transform to each image of the sequence; with this first operation, the components of the rain at different spatial and frequency scales are subdivided and the decimation process permits to eliminate high spatial frequencies linked to very short term phenomena of little interest for our purpose. We decide to exploit a 2-D Battle-Lemari6 basis[6], because it shown better results in the separation of the features. We arrive to a level 2 for the transform, in order to share all the information in 16 subimages; an higher level would produce a too great number of images, with a too poor resolution for our purpose, while a simple 1 level wavelet transform is not enough to well detect the different patterns. ii) Each of the subimages presents a different distribution of the energy of the wavelet coefficients that permits the detection of the different textured patterns. In order to retain such an energy, an envelope estimation algorithm is applied to the subimages. Here a simple zero crossing algorithm is performed, where the maximum value between two adjacent zero-crossings is found and assigned to all point within all the interval. This procedure is applied row-wise or column-wise depending on the characteristic of the wavelet filter direction for the image. At the end of this step, the subimages present high gray level value in correspondence of the zones where there is an high activity for the particular spatial frequency selected bythe filtering process. iii) The basic idea is that different textured patterns have a different representation in the wavelet subimages, so at this point we have a certain number of images, different representation of the same scene, where the gray levels represents a good features to separate the different shapes. So after a normalization process in order to restore the variance of the gray level between 0 and 255 (wavelet transform gives as output a set of coefficients, with a proportionality with the gray level of the original image), a multidimensional K-means clustering algorithm is applied, exploiting the 16 subimages information. The clusters centers are initialized randomly and we choose a number of clusters equal to three, due to a priori knowledge of the kind of data here used. iiii) Once extracted the different shapes related to the different structures involved in the rain event, it is necessary to extract information on the behavior of the patterns at the different meso-scales. In particular we detect the different velocities by means of a lag correlation model based on the maximization of the Lagrangian spatio/temporal correlation.
174 4. E X P E R I M E N T A L
RESULTS AND DISCUSSION
The classification chain was applied to a radar sequence of a rain event that occurred on 4 October 1992 in Northern Italy; the data consist of a sequence of 80 raw images with a dimension of 100xl00 pixels and a resolution of 1 km 2 per pixel. The acquisition was made with a C-Band Doppler radar operating at the 5.5 cm. wavelength with a frequency of 4 images per hour; the ground station is positioned in Teolo (Pd), in the Colli Euganei zone with a range of bservation of 160 km wide. The reflectivity was converted into rainfall rate using the Marshall-Palmer formula: Z=200R 1"6
(3)
Where Z is the reflectivity and R the intensity of precipitation. In fig. 1 are shown three consecutive frames of the rain event, while in fig. 2 it is possible to see the result of the application of the complete classification chain. It is easy to verify that the two different meso-scales patterns show a de-coupled dynamic, with different velocities. This was verified making a comparison with the results obtained exploiting the observation of the data from the geostationary METEOSAT satellite and the movement of the 0-isoallobaric contour line. A large scale motion was here detected and a small scale motion of cell clusters was recognized with a direction towards N NW, following the wind at 700 hPa. This result proves the consistence of a wavelet based approach, that is mainly based on the physical characteristic of clouds and rain. As it has been widely demonstrated, the rainfall fields' structure is intimately bound to a spatial/frequency representation well approximated by the wavelets. The analysis of the recorded data shows an inner structure of this fields (due to the atmospheric phenomena that generated them) that wavelets allow to represent in a straightforward way. The system can be improved in future with the addition of some shape recognition and matching algorithm, in order to better follow the complete evolution of all the different rain patterns and to give a complete description of the rain events in all its complexity.
Figure 1. A sequence of three consecutive frames are here shown. It is visible the front of the storm and the position of the ground station.
175
Figure 2. Two different meso-scales pattern are here well recognized and separated
REFERENCES [1] A. Pawlina, "Rain patterns motion over a region deduced from radar measurement", Alta Frequenza, Vol. LV (2), pp. 99-103, 1987 [2] P.V.Hobbs, "Organization and structure of clouds and precipitation on the mesoscale and microscale in cyclonic storm", Rev. of Geophisics and Space Phisics, Vol. 16 (4), pp. 741-755, 1978 [3] R. R. Coifman, Y Meyer, V, Wickerhauser in Wavelets and their applications, pp. 453470, Jones and Bartlett eds, 1992 [4] M. Vetterli, C. Herley, "Wavelet and filter banks: Theory and design", IEEE Trans. on Signal Processing, pp. 2207-2232, Vol. 40, No. 9, Sept. 1992 [5] R.A. De Vore, B. Jawerth, B.J. Lucier, Image compression trough Wavelet transform coding", IEEE Trans. on Information Theory, pp. 719-746, Vol. 38, No. 2, March 1992 [6] C. K. Chui, An introduction to Wavelets, Academic Press, San Diego 1992
176
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
F r e q u e n c y L o c k e d L o o p System for Doppler Centroid Tracking and a u t o m a t i z e d R a w Data Correction in Spotlight R e a l - T i m e S A R Processors Fabrizio Impagnatiello a, Andrea Torre ~ aAlenia Spazio S.p.A., via Saccomuro 24, 00131 Rome - Italy
Abstract. Spot-Light mode SAR instruments are high resolution imaging radars mainly used for military purposes. The bi-dimensional signal data is generated during the acquisition of the echo backscatter from observed area at a certain microwave radio frequency. In this operative mode the acquired data are typically affected by a quasi-linear phase term distortion, which varies a lot during the radar acquisition period. Similarly to other radar instruments operating in a similar way (e.g. MTI applications) the real-time estimation of the vectorial changes in the spectrum centroid is absolutely needed by the processing system since the final focused image shall be generated in few seconds after the acquisition itself. A frequency control system shift-locked on the input vectorial data spectrum allows both the estimation of the time dependent Doppler centroid and the restoration of raw data into their proper unambiguous Doppler bandwidth. The estimation/compensation system, as it has been conceived, is quite independent on the SAR processor type which follows in the data flow architecture. Nevertheless our goal was to disclose the possibility of using the absolute best processor available for its high accuracy characteristics in squinted/partially squinted acquisition geometries: the Extended Chirp Scaling processor (ECS). Furthermore the in-house know-how of parallel computing architecture, based both on commercial DSPs and propietary QUADRICS TM system, can be used for developing an integrated system for real-time acquisition, dynamic processing parameters estimation and focusing. 1 INTRODUCTION High resolution Synthetic Aperture Radars often require the real-time estimation capability of Doppler parameters for image focusing procedures. Several methods exist to compute the exact bandwidth extent of raw data interpreted as a vector signal of the so called 'radar slow time' co-ordinate. In conventional strip map operation mode the microwave instrument acquires a signal bandwidth according with Nyquist theoretical boundary. Even if the Doppler centroid describes a small span on azimuth frequency axis the total input bandwidth on along track direction is confined within the observation bandwidth determined by Pulse Repetition Frequency only. The PRF is a instrument parameter which shall be strictly related to antenna along track size and kynematic properties simply as
177
PRF
2vA > rl . ~
Lal-trk
where L is the antenna length in the flight direction, VA the maximum module of the platform speed relative to the target area and h is the oversampling factor (typically equal to 1.2). If the desired azimuth resolution shall comply with VA Pat > P R F
it is clear that a very high resolution requires an equivalently high PRF value which directly impacts on the maximum pipelined pulse/echos burst accomodation during the worst case round-trip time NFpeline d 2R ,~.,, ~
....
9P R F .
r
The minimum Doppler bandwidth for an arbitrary target within the observed area shall be then VA BWtarget > Paz
Nevertheless no restriction are placed about the mechanism of Doppler bandwidth acquisition during imaging. The only condition to be satisfied is the continuous phase change detection and recording during total integration time. This is the key of SpotLight SAR mode in which each target (properly its Doppler echo return) is tracked and acquired within an instantaneous bandwidth much smaller than the total one. Such a goal is obtained by adjusting the antenna beam over the same target area when the platform moves.
2 DOPPLER TRACKER ALGORITHM When a target area is sensed by a SAR instrument operating in a SpotLight mode raw data are collected at a constant along track sampling frequency. Here it is assumed that a straightforward flight line is followed by the platform. This does not cause a lack of generality at this level since the only impacts can be identified at focusing level. The input data Doppler spectrum is dynamically mapped onto the azimuth observation bandwidth according to the plot in Figure 1. In that figure the axis co-ordinate t represents the radar integration time (radar flight time) during which the antenna beam is redirected to track the same 'circular' area on the ground. The effective Doppler energy is spread onto the total physical Doppler frequency shift span but the sampled operation mode only allows the folded acquisition of the data spectrum. The object matter can be summarized as follows. Data are collected at a rate equal to PRF having a compatible instantaneous bandwidth extent. Therefore data can be locally
178
t
I
-'"
v
I
Figure 1 Doppler spectrum of SpotLight raw data reconstructed by keeping the same sampling rate but it is impossible at all the preservation of the PRF rate for a correct data representation at focusing processor input. Moreover the estimation of the time-varying Doppler centroid shall be performed because the focusing processor uses it in many cases. Thus two tasks shall be caried out by the Doppler tracker: 9 unambiguous Doppler centroid estimation 9 data reconstruction according to the total acquired bandwidth. Both the above tasks shall be guaranteed at run-time focusing processor by means of a realtime implementation compatible with the SpotLight operation mode itself. In particular the opening of the sampling window within the PRI (pulse repetition interval) is continuously adjusted in order to compensate the small distance changes, expecially when a large squint angle is used (forward/backward spotlight modes). The target area time-tracking involves an alignement task in the data preprocessor before Doppler estimator kernel. Figure 2 reports a scketch of the proposed Doppler tracker. The core of the whole system is the Dual Hermitian Frequency Generator (DHFG). This is a numerical implementation of a voltage controlled oscillator (VCO) which makes use of a linear stepped control technique. The DHFG supplies two coherent reference signals at different sampling frequencies in order to perform both down and up conversions on input raw data. Conditioned data are forced by the system to stay confined within a zero frequency centred bandwidth where they are constantly tracked. A preliminary restoration on input raw data compensates the effects of both phase and amplitude errors due to hardware tolerances (phase orthogonality and amplitude unbalance) in I/Q demodulator. If data are acquired in the real format because of a video band sampler selection, data restoration is not needed but data require to be translated into a complex format. Data restoration has the purpose to compensate possible distortions introduced by on-board quadrature coherent demodulator and baseband input chain up to A/D converters. The distorsion effects are listed here below: 9 individual DC offsets 9 I/Q amplitude imbalance
179 9 I/Q phase orthogonality error. Assume to have a certain number N of consecutive samples on each echo line m
S...,=I...,+jQ..., n = 1,..,N m= 1,.., M m
The DC offsets I and Q may be calculated starting from S,, m as follows: N M
i=--ff-~' E
N M
1,,,,, Q=--~ - 1 E Q...,
n=l m= 1
n=l m= 1
The gain imbalance G will be calculated by measuring the r.m.s, values on two channels: N M
1Z
1r/?13' " -
n=l hi=|
N M
Q~.,,= - ~
2m n=l m= l
and then
c=( IL.,.-i' QL,.-O' " In order to compute the phase orthogonality error cp, a correlation figure R will be computed against I and Q distributions
' Z I,,..,Q..., N M
R=---~
n=l m=l
by which it is easy to determine the phase orthogonality error tp,. sin ~, =
2R 12,,.,.+Q~,,.,.
Finally it is possible to recover the expected complex signal informations are available
Z,,,,, :[(I,,~,,-i)+jG.(Q,,.,,,--Q)]'e -j~" 9
Z,,,,,
since all required
180 It shall be noted that only this last expression has to be intensively used during preprocessing. In fact the system's parameter deviations are quite constant in the short time scale and therefore the estimations of both gain unbalance and phase orthogonality error can be performed once. Raw data are also used to carry out the Doppler centroid initial guess. The operation is absolutely needed to avoid big variations of Doppler Centroid Estimate output during a first lock-in transient. This estimation is perfomed via the Madsen algorithm which computes the Doppler centroid looking at the signs of input complex samples only, obtaining the following expression:
,) ' ,) n,m fDc2re --~'tan-lZSRe(Zn,m).SRe(Zn,m_l)-Sim(Zn,m).Sim(Zn,m_l ) PRF
/.l,m
Once data have been preprocessed they are ready to be down converted to 'hard' baseband. To this specific scope a quasi monocromatic tone is generated by DHFG and applied via a digital mixer on raw data vectors, sorted per azimuth lines. In this way a linear phase term is removed from data achieving the resulting averaged power spectruna to be centred around the zero frequency. This behavior is guaranteed (after a short transient) by the closed loop which tries to keep locked such a state performing a correction based on the residual spectrum centroid evaluation after low pass filter. The dual hermitian reference source substantially introduces an either positive or negative phase excess which is added to raw data phase information. It is of fundamental importance that the phase excess is exaclty removed after tracking/filtering operations otherwise an uncontrollable phase distorsion would be embedded in preprocessed raw data. Thus the numerical oscillator provides two signal outputs which are phase negated one against the other. From a mathematical point of view this corresponds to generate two complex conjugate sample sequences at a rate compatible with input data streams and oversampled output data streams (vector processing) respectively. The phase error detector is a monitor of residual Doppler centroid in the filtered data. The basic algorithm used for such an estimation is a short time weighted correlation
z,'
W = X
......... 1 m=2
The main criticality of the control loop is the selection of a proper loop-bandwidth to assure both intrinsic stability and a limited peak error value in the tracker. This shall limit the big changes at phase detector block output.
181
~_____.__p
Frame aligner blaoncd k
J
resda:uon
I
<
Maximally Flat Low-Pass FIR vector filter
:~ I high gain at start-up
Y
I
II7
J
Time Domain vector oversampler
.or
phase error detector
PRF
rate sampling
Doppler centroid initial guess
KxPRF
rate sampling
1 Dual Hermitian Frequency Generator Linear Stepped ConboUer Oscillator
J
I D~ 1 Cenboid Estimate
Figure 2 Doppler Tracker block diagram
Much care shall be put on baseband data filter since no phase distortion can be allowed at all on outcoming raw data. Thus a FIR implementation is absolutely requested (typical 15 to 35 taps) even if the real-time implementation is quite hard. The scope of the filter is to recover the azimuth data profile from natural beam shape broadening which occurs expecially with phased array radiators. In fact whether the microwave beam -3 dB points are correctly specified a sufficiently strong fraction of total backscattered energy is folded in aliased spectrum then it is convenient to cut away frequency contributions located around
fro=-
PRF 2
and
PRF flit=+ T
The maximally flat is a choice driven by the optimal impact on global signal-to-noise ratio. So no changes within pass band is applied if the amplitude gain is shaped as in Figure 3. A cosine shaped transition band has been imposed since it satisfies the zero intersymbol distorsion and moreover the simplest numerical implementation. This has been derived from binary communication theory relevant to tunable roll-off filters.
182
tmaaalt(m Imad
trlalall~n band
r
Figure 3 Baseband FIR filter mask Also the time-domain oversampler benefits of the spectrum shape conditioning imposed by the FIR filter because of the smoothness of spectrum edges. Oversampling is based on the simple Shannon formula which assures the sampling theorema reversibility. The analog waveform can be rebuilt starting from its samples according the linear algebra relation: oo
z(t) = LZk .h(t-t,) k =-oo
where h(t) is the impulse response function of the low-pass filter. Such a function goes to zero faster than the canonical Sinc function corresponding to the theoretical sharp low-pass filter (step shaped). This helps the implementation phase making the impulse response function (IRF) length compatible with a real time execution. The K-factor oversampler applies the above relationship in corrispondence of tk,p times defined as P
tk,p = tk +(tk+l--t,)'--~
p=0,..,K-1 allowing the recover of the along track waveform in a bandwidth K times wider. The h(t) function is stored in a tabled format and superscalar routine is called to perform each sample building. The oversampled data, still centered around the zero frequency, are finally moved back in their original spectral displacement. The conjugated complex reference is then applied at 'up converter' stage performing the required frequency shift. In order to preserve the phase contents the delay introduced by the data FIR filter is compensated by a proper delay line. The overall loop stabilization is obtained selecting a proper loop filter bandwidth and phase shaping. The goal is typically the minimization of the transient duration achieving residuals in the order of some tens of milliseconds. The filter by itselt can be designed as a four pole either chebychev or elliptic filter. Nevertheless it has been discosed the opportunity of a sharp phase rotation near to 90 degrees. Then an addition modification has been obtained by means of a quasi Hilber mask shaping.
183
Several simulation of clutter echoes with and without bright targets inside has been generated in order to test the preprocessor. Results have always put in evidence the capability of the Doppler centroid restitution with an error not greater than 0.7-1.5 % of PRF used value. Both Apollo HP workstation and Linux workstation have been employed for development and simulation activities. 3 CONCLUSIONS A fast Doppler tracker for spotlight raw data auxiliary processing has been developed. It is based on a frequency-locked closed loop concept. The real-time implementation has been obtained through a correct data organization and feeding criteria to the core processor. An exaustive data restoration has been demonstered to be needed for a correct tracking (loop locking) of time varying Doppler centroid in acquired raw data. Enhanced techniques have been widely introduced at several levels with a view of optimizing performance within each stage. The double estimation of Doppler centroid allows an efficient linear prediction of FM rate changes applied by DHFG without using Kalman filters. A full test activity has been carried out on both numerically simulated data and artificial spotlight SAR data synthesized starting from ERS-1 raw data.
REFERENCES
1. A.G.Evans and R.Fischl, 'Optimal Least Squares Time-Domain Synthesis of Recursive Digital Filters', IEEE Trans. on Audio and Electroacoustics, AU-21, No. 1, 61-65, Feb. 1973. 2. A.G.Deczky, 'Synthesis of recursive digital filters using the Minimum P-Error Criterion', IEEE Trans. on Audio and Electroacoustics, AU-20, No. 4, Oct. 1972. 3. L.R.Rabiner and B.Gold, 'Theory and Application of Digital Signal Processing', PrenticeHall, 1975.
4. N.C.Currie, 'Radar Reflectivity Measurement: Techniques and Applications', Artech House, 1989. 5. B.Porat, 'Digital Processing of Random Signals: Theory and Methods', Prentice-Hall, 1993 6. J.D.Taylor, 'Ultra-Wideband Radar Systems', Ed. U.S. Air Force, CRC Press, 1995
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 184
1997 Elsevier Science B.V. All fights reserved.
Use of clutter maps in the high resolution radar surveillance of airport surface movements G. Galati a, M. Ferri b and M. Naldi a aUniversit~t di Roma "Tor Vergata", Dipartimento di Informatica Sistemi e Produzione, Via della Ricerca Scientifica, 00133 Roma, Italy -E mail: [email protected] bOerlikon Contraves SpA, Via Affile 102, 00131 Roma, Italy 1. I N T R O D U C T I O N The primary radar is a key element of any Advanced Surface Movements Guidance and Control System (A-SMGCS), as it is the only sensor capable to detect and locate non co-operative targets such as obstacle or accidental intruders. Recently a distributed solution, made up of a network of millimetre-wave short-range radar (Surface MiniRadar Network) has been proposed to survey the airport surface[l]. The design of the radar signal processor to perform the detection function represents anyway a challenging task, due to the following critical factors: 9 the disturbance background is generally a mixture of clutter and noise, with proportions unknown and time-varying; 9 the clutter is varied in nature (asphalt, grass, buildings, rain, etc.); 9 the clutter spatial distribution is sharply non-homogeneous; 9 the targets of interest may be either point (e.g. a suitcase) or extended (e.g. land vehicles or aircraft); 9 the clutter statistical distribution is non Gaussian. Signal processing devices traditionally employed to get a constant false alarm rate (CFAR) under varying background conditions (e.g. cell-averaging CFAR processors) are of little help in this case, due to the spatial non-homogeneity of the disturbance and to the variable extension of the targets. Both phenomena can instead be countered by clutter maps, in which the detection threshold is updated individually in each cell, as proposed in [5], where the performance of a monoparametric (i.e. incorporating an estimator of a single parameter of the clutter + noise pdf) clutter map had been analysed under some limiting assumptions. In this paper the monoparametric clutter map performance is more deeply analysed, by removing some of the simplifications employed in [5]. The monoparametric approach shortcomings are pointed out and some directions of improvement are suggested. 2. T A R G E T S AND C L U T T E R IN AN A I R P O R T E N V I R O N M E N T As already hinted in the Introduction, the number of backscattering objects in an airport environment is quite large. A tentative list of disturbance is represented by the various forms of clutter: meteorological (rain etc.), ground (asphalt, grass, etc.). In addition, the presence of man-made structures (buildings, trellis, lamp posts, etc.), and the shadowing effects, must be accounted for (e.g. through a static map). There are many types of target classes of interest: aircraft, obstacles (suitcases, mobile stairs, etc.). Backscattering characteristics, (at least the probability density function (pdf) of the echo amplitudes) must be known for both target and clutter sources.
185 A thorough investigation has been carried out for planes [2]; the radar cross section (RCS) has been found to follow a log-normal probability law with a shape factor nearly equal to 1.4. At present no data are available to the authors for the remaining categories of objects, though they are expected to exhibit a much lager RCS than that encountered at lower frequencies. Again a log-normal pdf has been found to describe well the statistical characteristics of rain and grass clutter [3], with a shape factor varying in the range [1.3+2]. In the following both clutter and targets have been assumed to follow a log-normal probability law with a shape factor equal to 1.5. 3. A M O N O P A R A M E T R I C C L U T T E R M A P IN A M I X E D B A C K G R O U N D The functional block diagram of the proposed clutter map is shown in Figure 1.
X~(k) '
I
Y..~(k)}
T
Figure 1 - Functional block diagram of the monoparametric clutter map As can be seen its core is represented by a single pole filter, which performs a recursive integration. The filter output in the k-th cell is given by ~n(h~ = ~ ( k )
+ ( 1 - ~} Yn__l(h~
(1),
where Y~(k) is the output after the n-th update, Xn(k) is the radar measurement after the logarithmic detector and ~ is the gain coefficient of the clutter map filter, setting the balance between steady state accuracy and readiness in tracking clutter variations. The input past samples are therefore exponentially weighted and summed to provide an asymptotically unbiased estimate of the mean value of the input process. The detection threshold is then set as
IYn_l (k) -t- ].~
T = [E_,(k)- 6
in the n o - target case in the presence of a target
(2).
An additive, rather than multiplicative, term is used in the threshold expression, because of the logarithmic characteristic of the receiver, needed to cope with the high dynamic range of the RF signals typically encountered in this context [4]. The selection of the constant (g or - 8) is commanded
186
by a logic whose aim is to limit the occurrence of the self-masking phenomenon, which typically afflicts clutter maps" its mechanism is better explained in Section 4. 4. C L U T T E R M A P P E R F O R M A N C E The performance of the clutter map has been evaluated through the computation of the false alarm and detection probabilities. Some simplifying assumptions have been used: the pdf of the mixture of clutter and noise has been approximated by a Gaussian function, whose parameters have been set by matching its mean and variance to those of the actual mixture determined via simulation; the same approach has been followed for the detection threshold. It has been found, by simulation, that the error is negligible for clutter-to-noise ratios (CNR) larger than 5 to 10 dB. 1 E-2
___= _
1E-3
_
1E-4 ! _ 1E-5
!
1E-6
!
1E-7
!
_
r
Q.
_
1 E-8
1E-9
\
_ _ !
1E.10 !
C/N = 20 dB filtro
_ _
-4.--
1E-11
affi 0.25
_ _ _
1E-12
!
1E-13
_!_ = _ _
~=oo~ affiO.125
_
~ -
vt = o.5
_
1E-14
I 0.04
Figure
0.08
2 -
'
I 0.12
'
I
'
0.16
False alarm probability (CNR
I
'
0.2
0.24
= 20 dB)
Under these conditions the false alarm probability can be computed as shown in [5], where an ideal logarithmic characteristic - relating the output y to the input x through the expression y = a. log(b, x) - had been used. This approximation is here removed, as a real logarithmic characteristics, including both a silencing and a saturation portion and the A/D conversion effects, has been considered. A sample resulting curves is reported in Figure 2. Both the design parameters (the threshold step-up constant # and the filter gain a) have a remarkable influence on the performance: the dependence of Psa on p is nearly exponential; a variation of a over the range considered can lead to more than a decade variation of the false alarm probability. In addition, the achievement of the desired false alarm probability requires a changing # as the CNR varies. The required variation for # is plotted in Figure 3. The problem of the determination of a single value of # capable of providing acceptable performance over the expected range of CNR values is made easier by considering the shape of the curves of Fig. 3: these curves exhibit a knee, after which the value of /~ is nearly constant. The selection of the after-knee value will guarantee the desired false alarm probability when
187
the CNR is larger than 10 dB approximately and a better-than-desired performance at lower CNR's (however, in this region the approximations used in the analysis can be quite heavy). 0.14
'
I
'
I
' .(,
< . .,,_. r
0.12
-
-
eQ 0 "0 "0
.0625 1 .125 0.10 I
0ee'0.08
-
-
i 0.06
' 0
I 10
'
I
'
20
Clutter-to-noise ratio [dB]
30
Figure 3 - Dependence of/~ on the CNR ( Py,, = 10 -6 ) Unfortunately detection performance of clutter maps suffer from the self-masking phenomenon. Slow or steady targets stay in the same cell for a number of scans and therefore contribute to raise the threshold, which ultimately gets so high to make the target undetected. An anti-self masking logic has been proposed in [5]; it is based on the principle of lowering the threshold when a target is assumed to be present. In the absence of any anti-self masking logic, the detection probability is expected to decay very rapidly as the target keeps staying in the same cell. To overcome this problem, the threshold additive constant is switched to a lower value -~ when the presence of a target is declared. As three scans at least are expected to elapse before the probability of detection reaches unacceptable values [5], the switching event between the additive coefficients/.t and -~ can be set as the occurrence of three consecutive detections. From Fig. 4 the dependence of the detection probability on S can be assessed: the higher the absolute value of S, the larger the detection probability. However, the value of ~ has to be set considering the desired map behaviour as the target exits the map: the processor has in fact to recognise the return to a no-target condition and switch back to the ~t coefficient. At the target departure a smaller absolute value of S lowers the false alarm probability, as can be seen in Figure 5. The value ~5=0.07 can finally be set as a good trade-off between the desire for a high detection probability (99%) and the needed number (larger than 4) of guard scans to switch back to the coefficient /1.
188
1.00 - -
/
0.96 - -
r/
"O 12.
o.= 0.0625 or.= 0.125 --4 =.-
(x= 0.25 (x= 0.5
0.92 - -
I
'
I
0.04
'
I
0.08
0.16
0.12
Figure 4 - Detection probability (SCR=20 dB" CNR=30 dB)
0.10
5=0.07 0.08 -5=0.055
~5=0.045
0.06 --
r
5=0.015
13. 0.04 --
0.02 --
0.00
§ )
I' 2
~
T 4
Number
T
.....
[ 6
of scans
,
I 8
, 10
Figure 5 - False alarm probability at the target departure (SCR=20 dB; CNR=30 dB" o~=0.125)
189 5. C O N C L U S I O N S The performance of a new processor, based on the use of clutter maps, to be used in a millimetrewave radar for the surveillance of airport surface movements has been analysed. It has been shown that its parameters can be easily set to achieve the desired CFAR performance and to avoid the selfmasking effect typically present in clutter maps. However the design procedure relies on the assumption of a known shape factor for the lognormal amplitude distribution for targets and clutter (assumed equal to 1.5 in this paper). The presence of any disturbance, whose probability distribution is characterised by a different shape factor, leads to different values for the PIa" The use of a new biparametric processor, capable of estimating the shape factor of the disturbance, will therefore be analysed in a further study. REFERENCES [1] G. Galati, M. Ferri, F. Marti : "Distributed Advanced Surveillance Techniques for SMGCS", ECAC APATSI and EC Workshop on Surface Movement Guidance and Control Systems, Frankfurt, 6-8 April 1994 [2] E. Angelocola, P. Piermattei, G. Fristachi :"Field Experience on Millimetre Wave Application for High Accuracy Tracking Radars", NATO AGARD Guidance and Control Panel's 57th Symposium on Pointing and Tracking Systems, Seattle (USA), October 1993 [3] N.C. Currie, R.D. Hayes, R.N. Trebits : "Millimetre-Wave Radar Clutter", Artech House, 1994 [4] M. Ferri, G. Galati, F. Marti, M. Naldi : "Advanced airport surveillance and imaging using the surface miniradar network", CIE International Conference of Radar, Beijing, 8-10 October 1996, pp. 246-249 [5] M. Ferri, G. Galati, M. Naldi, E. Patrizi : "CFAR techniques for millimetre-wave mini radar", CIE International Conference of Radar, Beijing, 8-10 October 1996, pp. 262-265
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 1997 Elsevier Science B.V. All fights reserved.
190
Simulation of Sequences of Radar Images for Airport Surveillance Applications Fausto Marti(*), Maurizio Naldi (*), Enrico Piazza(+) (+)Department of Electronic Engineering, University of Florence Via di S. Marta, 3- 50139 Firenze- Italy Tel. +39-55-4796387, Fax. +39-55-494569, E-Mail: labtel@ingfil
.ing.unifi.it
(*) Department of lnformatics, Systems and Production and Vito Volterra Center TorVergata University Rome Via della Ricerca Scientifica- 00133 Rome- Italy Tel. +39-6-72594454, Fax. +39-6-2026266, E-Mail: [email protected] Work supported by Progetto Finalizzato Trasporti 2, Italian National Research Council 1.
Abstract
An Advanced Surface Movement Guidance and Control System (A-SMGCS) requires an integrated surveillance function based on radar sensors for non cooperating targets. A new generation of Surface Movement Radar (S-MRN: Surface MiniRadar Network) has been designed with radar image processing tools to extract orientation angle of aircraft moving on the airport surface. Tools, like radar signal and aircraft image simulators, are needed to test its performance. A radar signal and aircraft images simulator suitable to provide output image into a real airport scenario is described in this paper. The signal is backscattered from a moving aircraft, considered as a rigid body. Some results into a real airport scenario are also presented.
2.
Radarimaging of aircraft in the A-SMGCS frame
Greater safety and efficiency of the air transport require an Advanced Surface Movement Guidance and Control System (A-SMGCS) with co-ordination between the air route traffic control and the approach control. An A-SMCGS requires efficient surveillance functions in order to provide the moving aircrafts and vehicles precise position and identity, as well as other target information with high reliability and high data rate [4]s ~.t3~ The surveillance is an important function that must provide a detailed picture of all the ground traffic (aircrafts and vehicles) on the surface area of the airport. Surface Mini Radar Network (S-MRN) [1], [2] is of paramount importance because of its capability of detecting and locating non-co-operating targets such as obstacles and intruders, for warning of runway incursions and detection of not allowed and dangerous ground movements.
191 The main characteristics and functions related to aircraft radar images provided by a S-MRN can be smrunafised as follows: - high tracking precision - class labelling, i.e. classification according to aircraft shape and dimension; - image processing necessary to implement the tracking algorithms for an extended target (a large aircraft) and to improve the performance of the safety logic which provides warnings where conflicts are detected according to safety rules [3], [5].
3.
Radar images simulator
A radar signal and aircraft images simulator and an analysis tool suitable to provide output image into a real airport scenario are described in this paper. The need for a radar signal and image simulator arises from test and design of plot extraction algorithms; test and design of image processing algorithms; test and design of trackwhile-scan algorithms for extended targets; the creation of a database of aircraft image templates from different angles of sight for cross-correlation with real radar images. Radar simulators suited to the considered application are not available; therefore a new one has been developed. In fact, some aircraft images simulators do exist but such tools have the following limitations: a) the electromagnetic model of backscatter is valid for the X-band and are extended up to 16 GHz only, not in the millimeter wave region; b) the tools are not easily integrated with other software for purpose of research and design. The proposed radar images simulator provides simulation of Tx and Rx chain of a Surface Movement Radar (like S-MRN) and simulation of Radar Cross Section (RCS) of aircraft. The software provides for an aircraft an electro-magnetic model obtained either by Physical Optics theory and by rough surface theory, suitable to millimeter band (95 GHz) and processing of the signal received from each radar resolution cell. Each position of an aircraft in respect on airport scenario is simulated providing image of aircraft like the new generation S-MRN image. Fig. 1 shows the basic diagram of simulator. It has been realized on an UNIX SUN workstation and it is made up of graphics and computation functions.
IMAGES
] ~
\
" ' " " ' ~ - " " """
'
DATA F I I . ~
J
J
Fig. 1 Basic block diagram of the simulator
3.1
The aircraft models
The proposed aircraft model provides information about the real shape and dimension of aircrafts (Boeing 747, Airbus A300, MD80, DC9). All the aircraft surfaces are described by means of a combination of geometrical primitives: cylinder, frustum of
192 cone, parallelepiped, dihedral, trihedral. The aircraft parts where there are connections between the wings and the fuselage are described by dihedrals. Finally, the engine intake and engine exhaust are described by trihedrals. Next, a flat plates model of the aircraft is introduced which makes use of 3D representations of fuselage, wings, engines, tail wings and rudder: each 3D element is approximated with small flat plates (circa 0.2x0.2 m) characterised by its position and orientation with respect to the aircraft reference system. The dihedral and trihedral elements keep the original 3D representation. [7] Physical Optics theory provides RCS of flat plate, dihedral and trihedral depending on its dimension and orientation. Another data file with positions and orientations of the aircraft with respect to the airport's reference system is used to have representation of its trajectory into an airport scenario. 3. :9
The radar's parameters
The features of the radar sensor [ 1], [2] are used to build the radar model. Parameters as frequency, antenna rotation speed, range resolution, azimuth resolution, pulse repetition frequency are useful to simulate Rx and Tx chains. 3.3
Functional structure of the simulator
The simulator receives the radar model file, the aircraft model file and the aircraft trajectory as input. Fig. 2 describes the fundamental simulation functions. a) The first step is to define the aircraft flat plates model. This step is made one thne for each aircraft. The model is independent from the relative position between aircraft and radar. b) Next the radar model and the trajectory of the aircraft are built. This step is made one time before start the simulation. c) Next there is the analysis of the elementary flat plates that are visible from the radar, as a function of the position of the aircraft (i.e. each flat plate) and of the radar. d) Next there is the calculation of Radar Cross Section of each scattering center and calculation of the complex echo received from each one. e) Next the total RCS and the power of the echo signal are calculated for each radar resolution cell. Thenrml noise is taken into account to provide total echo signal at the input of the receiver. Signal to thermal noise ratio is also calculated. f) Next receiver and video integration processing are simulated and finally aircraft image~re generated.
193
airc :-aft ~late" lll()dq19file
"flat
radar par~uneters file
laale,:tory fi le ,
calculation of parmneters of each elementary scatterer (,position, orientation, class)
selection of scatterers that are visible
~ ~
sunulation of logaritntic receiver
calculation of RCS of each scatterer
video integration
simulation9 of thermal . gausslan noise
,-rod
calculation of echo signal amplitude
A/I) converter
---~
aircraft image
I
Fig. 2 Fundamental simulation functions
4.
Graphical software
The image provided by simulator feed a Digital Scan Converter whose output ks 32x32, 64x64 or 128x 128 pixel wide for a displayed area of 96, 192 or 384 meters (that is greater than an aircraft). The resolution is 3m x 3m/pixel. The output is suitable to the characteristics of images supplied by the new generation of S-MRN graphics display. A graphical images software follows the Digital Scan Conveter and uses a dedicated window of the work space to display a surface area of the airport including the aircraft. The graphical images software is made up of graphics and computation functions and utilizes the Xlib and Xview libraries to provide and to manage a Graphic User Interface (GUI). On the graphic screen of the workstation, the radar data may be represented either in the raw format (grey level related to signal amplitude) or with each target represented by the shape of the aircraft. There are several shapes available, ahnost one shape for each different aircraft model. Moreover, it is possible to draw the airport layout with a suitable scale. Fig. 3 shows data flow connections between software modules.
194
aircr~d't m,,del file
Simu lat i,.,n SW
#
Colffiguration file
t-
.raz file
trajectory file
Runway file
Digital Scan Converter
ffaphical hnages software
2D
hnage
Fig. 3 Flow chart of graphical images software
Datas of image at the output of simulation software are described by different kinds of files, (i.e. file ASCII named ".raz"). For each rsolution cell is provided range and azimuth in respect to reference system of airport and grey level (8 bit). The Configuration file contains information that are used in the graphical images software such as number of samples in x, y after the scan converter; position of the radar in the airport reference system; screen mode (color or b/w); radar dot dimension in screen pixels; representation scale. The file Runways provides information about the airport layout. The aircraft model is a file containing the shape of the aircraft and trajectory file provides positions and orientations with respect to airport reference system of the aircraft to be simulated. 4.1
Path simulation
Trajectory file is automatically generated by a procedure that computes several positions of an aircraft on the runway on the basis of the digitized map of the airport. The first step is to represent the runway and the taxiways by the mean of simple objects such as lines and arcs. A software computes the positions of a point moving on lines or arcs with a given sampling distance (Fig. 4). This way if far from the truth since an aircraft don't move at costant speed on the runway. Anyway resulting positions are shown in Fig. 5.
195
v Ira] PI
Pa
P~ P,
u[m] As o s. . . . . . . . . .
~_~
stm]
Fig. 4 A path generated with a sampling step of As
Fig. 5 Position of an aircraft running on the runway 16R of Rome-Fiumicino Airport computed on the b~Lsisof a constant step As.
5.
E x a m p l e s of simulated image and airport scenario
Here it follows the simulated radar charateristics used as input for the simulator software. Fig. 6 shows an example of simulated image of aircraft B747-200. Radar_X(m) 1952.01 Radar_Y(m) 3319.37 Radar_Z(m) 35.0
196
RPM(r.p.m.) 60.0 Range_Resolution(m)3.0 Azhnuth_Resolution(degree)0.18 PRF(Hz) 100(X).O
Fig. 6 Synthetic imageas it is displayed on the Workstation monitor
REFERENCES [ 1] G. GNarl, M. Ferri, F.Marti, Advanced Radar Techniques for the Air Transport System: the Surface Movement Miniradar Concept, 1994 IEEE National Telesystems Conference S.Diego, May 1994. [2] G. GNarl, M. Ferri, F.Marti, Distributed advanced surveillance for SMGCS, 1994 ECAC-APATSI-EC WORKSHOP on SMGCS Frankfurt, April 1994. [3] G.L. Foresti, M. Frassinetti, G. Galati, F. Marti, P.F. Pellegrini, C. Regazzoni, hnage Processing Applications to Airport Surface Movements Radar Surveillance and Tracking, 1994 IECON' 94, Bologna (Italy), September, 1994. [4] EUROCAE, Surface Movement Guidance and Control Systems, EUROCAE WG 41 Report, ED-200 A, 1994 [5] Pellegrini P.F., Palombo P., Leoncino F., Cuomo S., Frassinetti M., Piazza E., Moving Object Detection and Tracking in Airport Surface Radar Images, in Cappellini (Ed.), Time-Varying Image Processing and Moving Object Recognition 3, Elsevier, Amsterdam, 1994 [6] Pellegrini P.F., Piazza E., Airport Surface Radar Signal Analysis For Target Characterization. A Model Validation, IEEE IECON 95 Conference, Orlando, FL, nov 1995 [7] G. Galati, A. Manna, F.Marti (1996) Simulatore 3D Di Irmnagini Di Aeromobili Per Sorveglianza Aeroportuale. 1996 Intenal Report, Vito Volterra Center-University of Rome TorVergata, July, 1996.
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
197
Data fusion and non linear processing o f E.L.F. signal for the detection of Tethered Satellite System S.Monteverde, R.Ruggerone, D.Traverso, S.Dellepiane, G.Tacconi D.I.B.E. Facolt/l di Ingegneria, Universit~i di Genova, via Opera Pia 1 la 16145 Genova This paper deals with the analysis of EM waves propagated from the ionosphere, in the ULF/ELF/VLF frequency range, as acquired during the NASA/ASI Tethered Satellite System missions. A non-linear adaptive pre-processing step is described devoted to the deletion of random acquisition noise from the utilised SQUID magnetometers. Through a data fusion process the information data set is enlarged while maintaining limited analysis time interval, so that detection probability is increased. Finally, a time-frequency analysis of the so-called background noise is presented, made possible by the proposed pre-processing method. 1. INTRODUCTION One of the main objectives of the Tethered Satellite System (TSS) is the investigation of EM Waves propagated from the ionosphere, dealing with frequencies in the Ultra/Extremely/Low range (ULF/ELF/VLF: 3-3000 Hz) [ 1]. To this purpose, the University of Genoa participated to the first NASA/ASI (Agenzia Spaziale Italiana) TSS-1 mission and has also been involved in more recent TSS-1R (Reflight) mission, where DIBE has been in charge of acquiring and analysing electromagnetic signal from space. An acquisition set up, made of two magnetometers, is utilised in order to acquire and analyse the signal in the correct frequencies range. A tri-axial system composed of traditional induction coils together with a tri-axial SQUID (Superconducting Quantum Interference Device) [2] system have been utilised both in TSS-1 and TSS-1R campaigns. The systems have been installed at a site lying behind the Shuttle orbital track. This paper deals with the analysis of the above mentioned signal, taking into account the Near-Field hypothesis conditions and the main problems related to this specific task. In particular, a non-linear adaptive pre-processing step is here described, devoted to the preparation of the signal for a subsequent detection step through the deletion of random acquisition noise from SQUID magnetometers. To prove the validity of the proposed method, a spectral analysis of the processed signal is reported, where the features expected from the theoretical model can be easily detected even when analysing a very short time interval. Finally, the contemporary exploitation of the two magnetometer system, by means of a data fusion process, allows to increase the information data set, while maintaining limited analysis time interval, so that the detection probability is increased.
198 2. SIGNAL ACQUISITION AND MODELLING Similarly to Middleton's model of ELF natural noise [3] the electromagnetic signal of interest s(t) may be thought as composed by various independent sources, and can be described as: s(t)= x(t)+ hA(t)+ riB(t)
(1)
where x(t) represents the eventual target signal, emitted (in a spontaneous or inducted way) by the Tethered Satellite or by any analogous system located in the ionosphere; hA(t) represents the impulsive noise due to atmospheric events (i.e. thunderstorm and lightning activities [4]); riB(t) is the so-called background noise, due to man-made noise and to the resonance cavity generated by the interaction of the ionosphere with the Earth. Its characteristics change according to the Earth geographical co-ordinates of the acquisition site and to the daily evolution of the ionosphere's structure. These signal sources are characterised by different statistical properties that make the task of target detection a very difficult. In fact, while hA(t) may be considered a stationary process only over long periods, riB(t) is a quasi-stationary process with periodic characteristics due to the properties of ionosphere, which change according to the daily evolution of the Earth around the Sun. Finally, the target signal x(t) is a transient signal, with a very low energy, as compared with the other two components. The periodicity of the Shuttle orbital movement is linked to the periodical repetition of this signal component. These considerations suggest that it is impossible to apply the classical models, be them Gaussian or not [5], to the background ELF noise acquired also because these models usually do not address the problem of incompatible temporal constraints as those mentioned above when speaking of the different statistical properties of hA(t) and riB(t). In addition, we are not in presence of correlated noise, and the spectrum has to be carefully estimated to design a whitening pre-filtering step. To these problems, the difficulties of the acquisition process are added, as ELF signals are very difficult to be registered. The two instrument platforms have been exploited as to overcome the problems related to both of them. In fact, a magnetometer system based on traditional coils is usually affected by a non-linear and eventually dynamic transfer function, while the flat frequency characteristic of SQUID is accompanied by random noise that must be deleted by a non-linear adaptive processing. The purpose of this paper is to present a pre-processing and data fusion method to be applied on the acquired signal, which allows a precise modelling of the involved signal, in order to improve the detection probability. 3. NON LINEAR PREPROCESSING Operating under noise additive hypothesis the acquired signal ss(t) may be modelled as ss(t) = s(t) + ~(t) + j(t)
(2)
199 where x(t) and j(t) are random signals representing the intrinsic trend and flux jumps added to the signal s(t) by the acquisition instruments. Concerning the random acquisition noise introduced by SQUID, this noise is adaptively identified and cancelled by means of a non-linear digital filter.
L 5
V
0
50
100 >
Sec
0
Figure 1. Acquired signal ss(t)
50
Sec
)
100
Figure 2. Filtered signal s(t)
The average-slope method has been utilised for the identification of trend parameters. Windows of various size were employed to avoid side-effects. The random nA(t) noise component is also cancelled out after identification of main impulses. Despite nx(t) shows a white spectrum over a long period time (more than 1 hour), it mainly represents a non-white noise source over a shorter period of time (few minutes), considered as appropriate for the required analysis. The improvement in the signal due to the filter is proved by the spectral analysis conducted over the filtered data, as compared with spectral analysis on the original acquired data, reported in Fig.3. 80
~~~OI .
.
.
.
.
dBm 501
2G 0
30
,
L
i
i
:
Frequency, Hz
Figure 3. Power Spectral Density of acquired data
6O
%-
Frequency, Hz
9
> 6O
Figure 4. Power Spectral Density after the random trend has been deleted
It is clear that, by cancelling the SQUID acquisition noise, a better estimation of low frequencies is achieved, since exponential components due to the random trend have been deleted as it can be noticed by comparing Figs. 3 and 4.
200 In addition, cancelling nA(t) impulses allows a better resolution of peaks present in the estimated spectral density function (Fig.5).
Figure 5. Power Spectral Density of filtered data
Figure 6. Daily evolution of the first Schumann's peak
The correctness and usefulness of such a filtering process is also proved by the conformity of background noise nB(t) with the theoretical model [4]. In fact Schumann's resonance peaks, due to the resonant cavity formed by the ionosphere and the Earth, are sharp and clearly visible in the spectrum at frequencies around 8, 14, 21 Hz and higher. This was achieved by analysing few minutes interval, a much shorter one than that required by formerly proposed methods [6]. 4. DATA FUSION The so obtained filtered data So(t) are then used as a reference input for the dynamic calibration of coils. These instruments can be represented by a Transfer Function (called He(f)) and its module can be estimated by
_
so(f)
(3)
The spectnun of the filtered SQUID signal So(t), So(f), plays the role of reference input signal and So(f) is the spectrum of coil acquired signal se(t). After identification of coil frequency characteristics by applying equation (3), and after phase identification, the frequency distortion introduced by coils is eliminated by an inverse filter. Finally, such a compensated signal represents an additional data source available for data fusion and a better understanding of the signal's dynamic behaviour at low frequencies.
201 5. RESULTS By the method presented in the paper, a precise analysis of the ELF signal and its temporal evolution can be performed. The dynamic behaviour of Schumann's resonance peaks can be observed in a time-frequency space like that in Fig. 6, where the daily evolution of the 8 Hz resonance is shown. By analysing the signal corresponding to the time interval going from 4 a.m. to 5 p.m. (local hour) a power increase due to the sun's electromagnetic emissions [7] can be visualised thanks to the growth in width and height of the peak corresponding to the resonance. In dealing with target detection we can exploit the periodical characteristic due to the orbital movement of the target system. If we identify different portions of x(t), corresponding to the closest points of approach of the TSS to the ground-based receiver we can cut and fold them in order to increase the detection probability. In fact, it is possible to consider these portions of x(t) to be different detection opportunities of one single process in such a way that an increase in SNR and detection probability will occur.
6. CONCLUSIONS The proposed method allows to process the acquired signal in an adaptive and automatic way. The signal-to-noise ratio improvement estimated by means of synthetic signals is about 15-29 dB. The proposed pre-processing method has been applied to signals recorded by different acquisition set-up in different signal-to-noise conditions and has always proved to achieve good results. REFERENCES
[1] [2] [3] [4]
[s] [6]
[7]
C.B. Powers, C. Shea, T. McMahan " The first mission of the tethered satellite system "ed. Essex Corporation, 1992, Huntsville, Alabama J.Clarke, "Gli SQUID",Le scienze, No 314, Ottobre 1994 S.A.Kassam, "Signal detection in Non-Gaussian noise ",ed. Springer Verlag, USA, 1988 J.E.Evans, A.S.Griffiths, "Design of a Sanguine noise processor based upon world-wide Extremely Low Frequency (ELF) recordings ",IEEE Transactions on Communications, Vol. Com-22, No. 4, p. 528-539, April 1974 E & G S.V.Czarnecki, J.B.Thomas, "Nearly optimal detection of signal in non-Gaussian noise",Department of Electrical Engineering and computer science, Princeton, 1994 G.Tacconi, S.Dellepiane, L.Minna, C.Ottonello, S.Pagnan, "Campaigns of ground listening to the e.m. emission expectedfrom spaceborne electrodynamic tethered system ", Conference paper, 4th Int. Conf. Tethers, Washington, April 1995 Ya.L.Al'pert, "The near-earth and interplanetary plasma",ed. Cambridge University Press, UK, 1983
This Page Intentionally Left Blank
F DIGITAL PROCESSING OF BIOMEDICAL IMAGES
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
205
A Simple Algorithm for Automatic Alignment of Ocular Fundus Images L. Ballerini a and G. Coppini b and G. Giacomelli r and G. Valli a ~Department of Electronic Engineering, University of Florence, Italy bCNR- Institute of Clinical Physiology, Pisa, Italy r
di Clinica Oculistica, University of Florence, Italy
This paper describes an automatic alignment algorithm for registration of ocular fundus images. In order to enhance vessel structures, we used a spatially oriented bank of filters designed to match the properties of the objects of interest. To evaluate interframe misalignment we adopted a fast cross-correlation algorithm. The performances of the method have been estimated by simulating shifts between image pairs and by using a cross-validation approach. We also propose temporal integration techniques of image sequences so as to compute enhanced pictures of the overall capillary network. 1. I N T R O D U C T I O N Retinal vessels are the only vascular network directly observable from the outside of our body. Many systemic diseases, such as hypertension, arteriosclerosis, and diabetes mellitus, are studied using retinal vessels as index of diagnostic staging and therapeutic efficacy [1,2]. The Scanning Laser Ophthalmoscope (SLO) is an instrument that allows observation of a retinal image on a video monitor [3]. Digital video fundus angiography has several advantages over conventional angiography, including real time access to retinal images, the possibility of computer processing, and increased light sensitivity that makes indocyanine green angiograpy possible [4]. The use of SLO for digital video fundus angiography may improve spatial resolution with respect to conventional photography and provides much more temporal information than conventional angiography [4,5]. This explains the interest of several research groups for SLO imaging. Some authors have studied retinal circulation using fluorescein angiography [3,4,6]. Wolf et al.[6] present quantitative measurements of blood velocities in retinal capillaries and propose a method to evaluate vessel morphology. Tanaka et al.[3] observed the transit of numerous fluorescent dots in the perifoveal capillaries, and used them to identify direction and velocity of blood flow in the retinal capillaries. Nasemann et al.[7] demonstre new diagnostic possibilities in fluorescein angiography obtained with SLO such as the computation of circulation times and the imaging of erythrocytes and leucocites. Van de Velte et al.[8] describe some applications of SLO to microperimetry that attempt to correlate anatomical features or pathologic findings in the fundus with retinal function. Rehkopf et al.[9] developed a
206 method based on indicator diluition theory and image processing technology for estimating total retinal arteriovenous circulation time and the transit time in individual arteries. Alignment of temporal sequences of retinal images is crucial for quantitative analysis. In the analysis of fundus images previous investigators have used several different registration methods, which can be classified into two broad groups: interactive and automated. Automated registration methods may be divided into local and global methods. Local methods use a subset of the image information by extracting distinctive features; registration is performed only on the extracted features. Global methods use all pixel values in order to determine a single best set of transformation parameters for a given image pair [10]. Automated image registration methods that use local information commonly extract the ocular blood vessels and/or their crossing. For example Yu et al.[ll] used the branching points of retinal vessels as registration templates. Sequential fundus image alignment is done by using the sum of the absolute values of the differences method. Hart and Goldbaum [12] describe a method for identifying control points automatically using the branching and crossing points in the retinal vessel network. They propose to use a matched filter to exctact blood vessel segments. Registration is performed with an affine transformation that is computed using the control points. Cideciyan [10] describe a global registration method based on the cross-correlation of triple invariant image descriptors. One of such descriptors is the log-polar transform of the Fourier magnitude, which removes the effects of translation and converts rotation and uniform scaling into independent shifts according to orthogonal directions. Our approach is a global registration method based on image cross-correlation following spatially oriented filtering. 2. F U N D U S I M A G E S Retinal images were taken by a SLO, with a frequency of 25 frames per second following the injection of a bolus of fiuorescein. These images were digitized into 256 • 256 pixel matrices with 256 gray levels per pixel. The retinal region is approximatly 20 • 20 degrees. In a fundus image (see Figure 1) the darker region is the macula and the lighter curvilinear structures are the retinal blood vessels; they branch, cross and become smaller the farther they are traced from the optic nerve. The optic nerve stands out from the retina as a bright, vertically-oval disk with a sharpe edge. In theory the complete macular network of capillaries can be observed. 3. A L I G N M E N T M E T H O D The misalignment is due to changes in the acquisition geometry which may occur in a few milliseconds in the case of sequential frames of a fluorescein (or indocyanin green) angiogram. Misalignment is due both to eye movement and SLO equipment movement. As the patient head is kept fixed during SLO acquisition, we consider constant scaling and no rotation, so we can assume only translatory movement between two subsequentail frames. Images we deal with are projections of a spherical surface, but it can be easily shown that geometrical distortion has a negligible effect.
207
Reference image
Extractsub-imageI
Other images Extract
1
[ Filtersub-image ]
L
Binarizesub-image[
sub-image I
1
Filter
sub-image I
1
Binarizesub-imageI
Computecross-correlation
Figure 1. SLO image of ocular fundus: the darker region is the macula and the lighter structures are the retinal blood vessels.
Figure 2. Flow-chart of our automatic image alignment algorithm.
3.1. Algorithm description We have developed a procedure (summarized in Figure 2) based on the automatic extraction of a vascular feature map amenable to a binary image representation. A simple global threshold is not adeguate to extract the blood vessels from the retinal background because of noise, background variability, the low and space-varying contrast of vessels. Thus we resorted to using spatial filtering to enhance and detect vessel structure. Filtered fundus images were segmented by a trainable threshold unit. The cross-correlation was used as index of similarity between binarized images to compute the needed realignment shift.
3.2. Filtering technique Our filtering technique is based on the optical and spatial properties of the objects to be recognized. We can observe that blood vessels have typical properties such as small curvature and they appear lighter relative to other retinal surfaces. The two edges of a vessel always run parallel to each other; such objects may be represented by piecewise linearly directed segments of finite width. On this ground, we studied two different kinds of filters. In the first case, source images were filtered by a Laplacian of Gaussian (LOG) kernel: 1
(
2
x2+y2) x2+~,2
(1)
where a is the standard deviation. The half-width of the central lobe of LoG(x,y) is w = 2x/~cr and can be adopted as measure of the filter scale [13]. Despite the well known propertiers of LoG filter [14], we observed that, in our case, it tends to produce noisy outputs unless high a's are used. However, in this case vessels
208 are strongly blurred. Thus, we used a different method based on the observation that the grey-level profile of the cross section of a blood vessel is well approximated by a Gaussian shaped curve. Consequently, we used oriented matched filters [15] which model the shape of the intensity profile of the blood vessels by a Gaussian bar:
K(x, y)
= exp(-x2/2o 2)
for
L/2
l Y I-<
(2)
where L is the length of the vessel segment and cr is estimated according the average blood vessel width. We constructed twelve different templates that are used to look for vessel segments along the twelve directions Oi, with 04 = ~ , i - 0...11 Following a simple strategy, we compute an enhanced picture keeping for each pixel the maximum of the twelve kernels. Unfortunatly this approach exibits a poor noise behavior. Thus, we used a one layer back-propagation neural network to integrate the output of such filters. For each pixel, the output of the twelve filters feed the network that should provide a binarized image. The network task is both combining the filtered images and thresholding them. The training set is composed by 500 examples extracted from subimages containing both blood vessels and retinal background. In Figure 3 we report two examples of binarized images. 3.3. C r o s s - c o r r e l a t i o n Afterwards, cross-correlation of binary images is computed so as to estimate the needed realignment shift. Given the original sequence of images, we produce a new image sequence in which all images are aligned shifting each image according to its displacement estimate:
r(m,
= I(m + i,
(3)
+ j).
A classical index of similarity is the two-dimensional cross-correlation function:
p(i ' j ) -
M-1 En=0 N-1 I1( m, It)" I2(i + ?Tt,j + It) Era=0 M-1 En--O N-1 l12(m, It) Em--O M-1 En=0 N-1 I~(rn, n) " ~2m=0
This function attains its maximum
i = 0, ...I j = 0, g
(ma.x{p(i,j)})when z,3
the images are realigned:
(4)
(i,j)
are the needed shift to bring images into alignment (see also Figure 4). It must be pointed out that the high computational cost of this function ( M . N . I - J multiplications) can be re'duced to M - N . I . J integer additions in the case of binarized images. 4. I N T E G R A T I O N
TECHNIQUE
We considered several temporal integration techniques to create enhanced images of vascular networks. Temporal filtering is commonly used to reduce noise in image sequences, examples of such filters are the temporal low-pass filter. The simplest techniques we used are based on the pixel-wise operators, such as average and maximum. Averaging reduces images noise, but it can blur small features. Moreover averaging works well in case of a zero-mean, time-uncorrelated noise, such an assumption is not
209
Figure 3. Examples of ocular fundus images binarization: a) and b) original images, c) e d) corresponding binary images (inverted) showing enhancement of vessel features.
Figure 4. Plot of a typical cross-correlation function.
210 verified, in general, in fundus images. On the other hand the maximum operator keeps small capillaries, however images obtained in this way are lighter and noisier than the originals. This suggested us another integration technique based on both spatial and temporal information: 9 if(max-#)
Io,~t(x, y) = max else where: 1 # = -~ ~']S I i ( m , n ) O" -- r
#)2
i.e. the maximum is kept in the output image only if the differences between its value and the mean of the neighbour pixels is less than the gray level standard deviation which provides a smoothness constraint. In Figure 5 we give an example of the attainable results. Top images are originals from the sequence and they are misaligned by i = -3, j = 4 as computed by our algorithm. The bottom image is obtained by the nonlinear integration procedure applied to realigned frames. 5. C O N C L U S I O N S The performances of the method have been estimated by simulating misalignments between image pairs. When there are more than two images to be registered, the accuracy of the registration method can be quantified using a cross-validation approach. As concerne image sequence integration, the adopted non-linear filtering allows a reasonable trade-off between computational complexity and detection sensitivity/specificity. The alignment algorithm is fully automatic and we hope in the future it could be done in near real-time. Despite its simplicity, the method is accurate enough for the image sequences we deal with, and it is faster in calculation speed as compared to other sequential methods. It can be useful in clinical applications such as the study of retinal blood flow and the analysis of capillary-network morphology. REFERENCES
1. J. J. H Yu, B. N. Hung, and H. C. Sun, "Automatic recognition of retinopathy from retinal images", in Proceedings Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1990, vol. 12, pp. 171-173. 2. L. Zhou, M. S. Rzeszotarski, L. J. Singerman, and J. M. Chokreff, "The detection and quantification of retinopathy using digital angiograms', IEEE Transactions on Medical Imaging, vol. 13, pp. 619-626, 1994. 3. T. Tanaka, K. Muraoka, and K. Shimizu, "Fluorescein fundus angiography with scanning laser ophthalmoscope", Ophthalmology, vol. 98, pp. 1824-1829, 1991.
211
Figure 5. a) and b) original images from the sequence, c) the resulting image obtained after alignment and integration.
212
,
,
10. 11.
12. 13. 14. 15.
D. A. Frambach, M. P. Dacey, and A. Sadun, "Stereoscopic photography with a scanning laser ophthalmoscope", American Journal of Ophthalmology, vol. 116, pp. 484-488, 1993. S. Wolf, H. Toonen, T. Koyama, D. Meyer-Ebrecht, and M. Reim, "Scanning laser ophthalmoscopy for the quantification of retinal blood-flow parameters: a new imaging technique", in Scanning Laser Ophthalmoscopy and Tomography, J. E. Nasemann and R.O. Burk, Eds., chapter 7, pp. 91-95. Mfinchen: Quintessenz, 1990. S. Wolf, O. Arend, H. Toonen, B. Bertram, F. Jung, and M. Reim, "Retinal capillary blood flow measurement with a scanning laser ophthalmoscope", Ophthalmology, vol. 98, pp. 996-1000, 1991. J. E. Nasemann and M. Mfiller, "Scanning laser angiography", in Scanning Laser Ophthalmoscopy and Tomography, J. E. Nasemann and R.O. Burk, Eds., chapter 5, pp. 63-80. Miinchen: Quintessenz, 1990. F. Van de Velte, A. E. Jalkh, O. Katsumi, T. Hirose, G. T. Timberlake, and C. L. Shepens, "Clinical scanning laser ophthalmoscope applications: An overview", in Scanning Laser Ophthalmoscopy and Tomography, J. E. Nasemann and R.O. Burk, Eds., chapter 2, pp. 35-47. Mfinchen: Quintessenz, 1990. P. Rehkopf, T. R. Friberg, L. Mandarino, J. Warnicki, D. Finegold, D. Cappozi, and J. Homer, "Retinal circulation time using scanning laser ophthalmoscope-image processing techniques", in Scanning Laser Ophthalmoscopy and Tomography, J. E. Nasemann and R.O. Burk, Eds., chapter 6, pp. 81-89. Mfinchen: Quintessenz, 1990. A. V. Cideciyan, "Registration of ocular fundus images", IEEE Engineering in Medicine and Biology, pp. 52-58, 1995. J. J. H Yu, B. N. Hung, and C. L. Liu, "Fast algorithm for digital retinal image alignment", in Proceedings Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1989, vol. 11, pp. 374-375. W. E. Hart and M. H. Goldbaum, "Registering retinal images using automatically selected control point pairs", in Proceedings IEEE International Conference on Image Processing, 1994. R. J. Schalkoff, Digital Image Processing and Computer Vision, Singapore: Wiley, 1989. A. Huertas and G. Medioni, "Detection of intensity with subpixel accuracy using laplacian-gaussian masks", IEEE Transaction of Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 5, pp. 651-664, September 1986. S. Chauduri, S. Chatterjee, N. Katz, M. Nelson, and M. Goldbaum, "Detection of blood vessels in retinal images using two-dimensional matched filters", IEEE Transactions on Medical Imaging, vol. 8, pp. 263-269, 1989.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.)
9 1997 Elsevier Science B.V. All rights reserved.
213
Automatic Vertebrae Recognition throughout a Videofluoroscopic Sequence for Intervertebral Kinematics Study P. Bifulco a, M. Cesarelli a, R. Allenb, J. Muggletonb, M. Bracale a a Dept. of Electronic Engineering, University of Naples "Federico II", Via Claudio 21, 80125 Napoli, Italy b Dept. of Mechanical Engineering, University of Southampton, Highfield Southampton SO 17 1BJ, England 1. INTRODUCTION Intervertebral kinematics closely relates to the functionality of the spine and it is associated to various spine pathologies. Particular interest is devoted to lumbar spine mechanics. An "in vivo" analysis of intervertebral kinematics is attempted using a technique which involves videofluoroscopic investigation [ 1]. This method can ensure useful diagnostic data, maintaining radiation exposure low enough to be acceptable for clinical application. Coronal and sagittal images of the lumbar spine section are sequentially grabbed from a videofluoroscopic device during patient's spontaneous movements. On the other hand, low X-Ray dosage results in poor quality time varying image sets. A very precise vertebrae position recognition is required for a reliable estimation of kinematic parameters. An automatic process of vertebrae position recognition, based upon cross-correlation, has been implemented, resulting precise and robust respect to noise. Intervertebral Angle (IVA) and Instantaneous Centre of Rotation (ICR) have been computed as kinematics indices of planar motion. A calibration model was used to asses the accuracy and the precision of the measurement of the kinematics parameters. Error analysis suggests that this method improves the intervertebral kinematics computations.
2. METHOD The estimation of motion is performed from corresponding features observed at a different time. Usually a variety of different features are considered, such as points, straight lines and comers belonging to the observed object. In our case the four point (landmarks) corresponding to the comers of the imaged body of the vertebra were chosen as reference features. Other vertebra components, such as processes, suitable to relate effective landmarks, are not always visible, due to the poor image quality. The high image noise level does not allow the use of standard procedures for automatic landmark location (often based on derivative operators).
214 The manual identification, still in use, of the above landmarks throughout the sequence is insufficiently accurate. Infact large errors in the measurement of the kinematic parameters may result from relatively small errors in the identification of the input spatial landmark co-ordinates [2]. Manual intervention, in particular, is regarded as a major contributor to error [3]. To overcome these drawbacks an automatic process of vetrebrae position recognition has been implemented, which resulted also in being robust in respect to the noise. The use of the crosscorrelation index provides an effective estimation of the position combined with a certain independence from noise. Therefore the recognition procedure is based upon crosscorrelation. Assuming that T and M are two matrices representing a template and a part of the subsequent image respectively, the expression of the crosscorrelafion function is given by:
T(i,j). M(i,j) (1)
i,j
~~ r~(1, j). ~
i,j
M ~(,, j)
From a qualitative point of view the crosscorrelation could be regarded as a "similarity" index. Hence, the crosscorrelation maximum locate the part of the subsequent image most similar to that of the previous image (the template). This template constitutes all the information we know without assuming an a-priori knowledge. A manual landmark selection on the first frame of the sequence is still carried out however by the physician. A template including the entire body of the vertebra is consequently generated (fig. 1).
PLATE
landmarks
I
!
" :...~.. i
,,
k_)
;-, co. . e"" :/i . .
i..~... :.
=
template
.. I
Current Image
Subsequent Image
Figure 1 Use of templates for the recognition procedure
215 Evaluating the maximum value of the crosscorrelation between the template and the subsequent image, the position of the vertebra centre is estimate. Subsequently, through successive rotation of the template, crosscorrelation approximately computes the angles of rotation and refmes the centre location. This procedure leads to a rough vertebra location. Since we are interested in the identification of the four landmarks, further crosscorrelation involving the four comer templates (Fig. 1) are used, in combination with the previous calculated parameters. As a result of these stages the landmark coordinates of each vertebra are automatically detected.
Figure 2: Vertebral landmarks automatically identified in two subsequent images. The detected landmarks do not respect the assumption of rigidity we hold for the vertebrae [4]. A restoration of the rigidity is performed minimizing the sum of the square of the distance from the detected landmarks, which are considered a good estimation of the correct position. Finally, once the absolute kinematics indices are calculated, the intervertebral angles and the instantaneous centres of rotation are computed for vertebrae pairs. These parameters totally characterize a planar motion. Fluoroscopic image sequences of the bending of a calibration model (Fig 3), consisting of L3 and L4 lumbar vertebrae, have been used for validity assessment. The two vertebrae are linked at the disk level by a universal joint and their motion have been constrained to a known values using a system of goniometers. The automatic detection procedure have been tested using these sequences to asses the accuracy and the precision of the measurements. The L4 vertebra of the calibration model was rotated, respect to L3, in a series of 5 degree angles. Consequently, the kinematic parameters, are successive intervertebral angle of 5 degree and instantaneous center of rotations all placed on the middle of the universal joint (see results paragraph).
216 A repetitive calculation of kinematic parameters was also performed to assess the precision of the method, in terms of repeatability and sensitivity to the landmarks manually identified in the first frame. For each vertebra, not only the four landmark pixels selected by the operator, but also each possible combination of the 8 neighbouring pixels were considered as inputs. This produced 94 (6561) input sets for analysis. The kinematic parameters automatically extracted from these inputs were analysed statistically. 3. RESULTS Six images 512x512 pixels, 256 grey level, extracted from a coronal sequence of the bending of the calibration model, have been used to asses the accuracy of the measurements. The kinematic parameters computed are the Intervertebral Angle, reported in Table 1, and the Instantaneous Centre of Rotation, shown in Fig. 4 as white dots. file name rotation
C 1.tif -10 ~
C2.tif -5 ~ 1st step
RESULTS
] 5.15 ~
C3.tif 0~
2nd step
I 5.14 ~
C4.tif 5~
3rd step
[ 5.11 ~
C5.tif 10~
4th step
[ 5.34 ~
C6.tif 15~
5th step
I 4.96 ~
]
IVA
Table 1
Figure 3" An image of the coronal videofluoroscopic sequence of the calibration model
Figure 4: Enlargement of the universal joint with the computed ICR throughout the sequence
217 Typical parameter distributions, computed using all the possible combination of the landmark neighbouring pixels, in two subsequent frames of the calibration model coronal sequence, are presented in Fig. 5 and Fig. 6. Table 2 shows the parameter distributions means and standard deviations. They refer to the six possible landmark couples on which compute rigid kinematics. couple of landmarks 1 2 3 4 5
Vertebral Angle [degree]
X coordinate of ICR [pixel]
Y coordinate of ICR [pixel]
Mean
STD
Mean
STD
Mean
STD
-5.69 -5.69 -5.69 -5.69 -5.69 -5.69
0.05 0.06 0.06 0.08 0.10 0.05
242.35 242.35 242.47 242.33 242.37 242.35
0.59 0.95 2.20 1.42 1.35 0.87
302.62 302.61 302.60 302.62 302.65 302.61
2.00 2.35 2.02 1.80 2.70 1.71
1 pixel corresponds to 0.25 mm
Table 2
Figure 5 a vertebral angle distribution [degree]
Figure 6: a ICRs distribution [pixels]
4. CONCLUSION The described method automates the tedious and imprecise manual landmark selection providing the computation and visualization of the kinematic parameters. Errors analysis suggest this method improves the accuracy of mtervertebral kinematic calculation. The use of tracking algorithms and an extensive error analysis could improve process speed and the feature extraction capability.
218 The procedure has been applied to image sequences of healthy and pathological subjects and the extracted kinematic parameters are under study. Future work will be based on an a-priori knowledge of the three-dimensional anatomic structure of the vertebrae, attempting to provide a full three-dimensional motion analysis.
Acknowledgement We would like to thank the University of Naples "Federico II" (Programme of International Exchange) for generously funding part of the research carried out in the University of Southampton by italian researchers.
REFERENCES
1. Breen, A. C., Brydges, R., Nunn H. ,Kause J. and Allen R.; Quantitative Analysis of Lumbar Spine Intersegmental Motion. European Journal of Physical Medicine and Rehabilitation. Vol. 3 n. 5 Dec. 1993 2. Panjabi, M. Centers and Angles of Rotation of Body Joints: A Study of Errors and Optimization. Journal of Biomechanics, 1979, 12, 911-920 3. Panjabi, M., Chang, D., Dvorak, J., An Analysis of Errors in Kinematics Parameters Associated with in vivo Functional Radiographs. Spine, 1992, 2 - 200-205 4. Simonis C., Allen R., Breen A. Rigid Model Fitting Technique: an Alternative in the Selection of Landmarks on Spinal Images. Proceedings of the V Symposium on Biomedical Engineering, Santiago de Compostela 1994, 2 -pp 103-104
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
219
An evaluation of the auditory cortex response to simple non- speech stimuli through functional MRI A. Pepino a, E. Formisano a, F. Di Salleb, C. Saulino c, M. Bracale a Dept. of Electronic Engineering, University of Naples "Federico II", Via Claudio 21, 80125 Napoli, Italy
a
b Radiological Clinic, University of Naples "Federico II", Via Pansini 5, 80100 Napoli, Italy c Audiological Unit, Dept. of Neurosciences and of Human Communication, University of Naples "Federico II", Via Pansini 5, 80100 Napoli, Italy
I. INTRODUCTION
Functional Magnetic Resonance lmaging (fMRI) is a new tool for the exploration of the brain functions: images of activated brain areas are formed by detecting the indirect effects of neural activity on local blood volume, flow and oxygen saturation. Oxygen delivery, cerebral blood flow and cerebral blood volume all increase with local activation, induced by a senso-motor or cognitive stimulus. This leads to a decreased deoxyhemoglobin concentration and, because of the paramagnetic nature of deoxyhemoglobin, to a lower local magnetic susceptibility, if compared to a resting state (i.e. with a state of no local activation). [ 1] This effect may affect the signal intensity of Magnetic Resonance images, if T2* weighted sequences are used for imaging: indeed, for these images the gray level of a voxel depends on the intravoxel magnetic field homogeneity. Specifically, after a delay of 4 - 8 secs. from the onset of the stimulus an increase of the signal intensity is observed for voxels interested by the locally induced cerebral activity. Experimental results showed that the amplitude of signal changes, between the resting and the activation states, increases with the intensity of the longitudinal magnetic field. Thus, fMRI results are improved by high static field intensity (1.5 Tesla or more).[2],[3],[4] This study attempted to identify specific brain areas interested by a simple nonspeech acustic stimulus. The importance of the localization of these specialized cortical regions is related to: 1) understanding the different role of auditory cortex in the elaboration of simple as well as more complex sounds (like words, etc. );
220 2) the possibility of a correct intrahemisferic distinction between primary and secondary auditory cortex and a precise mapping of their involvement; 3) the lateralization of primary cortical perception; 4) the study of the tonotopical organization of the human auditory cortex. fMRI provides these aims with very intersting features, which consist in an higher spatial and temporal resolution, compared to Positron Emission Tomography (PET), and in a much higher spatial resolution, when compared to neuro-eleclrical brain mapping methods. Furthermore, using fast MRI techniques, such as Echo Planar Imaging (EPI), the collection of data for the reconstruction of a planar image is faster. Therefore, it is possible to follow, with a sufficient temporal resolution, the timecourse of the slow hemodynamic responce for a multislice region of interest. [4]
2. METHOD 2.1 Subjects and acustic stimulus The present study has been conducted on ten healthy young volunteers, who previously underwent audiometric examination resulting normal, by means of an highfield (1.5 Tesla) MR superconducting unit (MAGNETOM Vision, Siemens Medical System, Edangen, Germany) equipped with an active shielded gradient coil (25 mT/m), a standard head coil and with an echo-planar device which allowed the complete acquisition of a 64 x 64 pixels image (interpolated to 128 x 128) in 123 msec. Auditory stimuli were played at precise intervals using a waveform generator (EM 2 Audiostimulator Mercury) and delivered to the subject via air conduction through a rubber tube. The sound conducting tube was 10 meters long and at the subject edge a Y - connector split the tube for binaural stimulation through a tightly fitting headset with occclusive earplugs to further reduce scanner noise effects. Reported results are relative to auditory stimulation with a pulsed tone with a center frequency at 1500 Hz (3 ms rise time, 34 ms plateau and 54 ms ISI). The sound pressure level at the end of the pipe is 100 dB, about 20 dB over the level of the background scanner noise. 2.2 Image Acquisition Scanning procedure began with acquisition of conventional 256 x 256 pixels T1 weighted images to be used as anatomical reference of the explored brain function. Ten slices, with 5 mm thickness and a field of view (FOV) of 180 x 180 millimeters leading to voxel dimensions of 0.7 x 0.7 x 5 mm, have been located along oblique planes, parallel to the plane acrossing the anterior and posterior "white commissure". A Gradient Echo (GRE) sequence, with TR = 500 msec. TE = 16 msec. FA ~ = 90 ~ has been used for the acquisition of the anatomical volume, in order to have a good cerebro-spinal fluid (CSF) / nervous tissue differentiation.
221 Successively, a series of 64 functional volumes of 10 slices, with the same geometrical parameters of the anatomical reference, were collected using a GRE Echo planar sequence, with TE = 66 msec. FA = 90 ~ and an interscan temporal spacing (TR), of 4 seconds. The functional series began with 4 baseline images (16 seconds interval) allowing magnetic resonance signal equilibrium to be reached, followed by 60 images during which activation altemed with baseline every 5 acquisitions, i.e. 20 seconds (10 images per cycle, 40 seconds per cycle, six cycles).
2.3 Data Analysis The cerebral regions responding to the stimulus were identified by an algorithm that correlates the data, obtained from the multiple alternating periods of baseline and activation, with a periodic reference vector based on the stimulation paradigm. [5] Activation maps were formed giving to every pixel the value of the cross-correlation coefficient between the pixel's intensity time-course and a "box-car" ideal vector which is "OFF" during the resting period and "ON" during stimulation. The possible different delays of the induced signal enhancement from the stimulus application were taken into consideration by shifting the reference waveform of either one or two samples (4 or 8 seconds) considering, then, in the activation map, the maximum of the cross-correlation coefficients, evaluated for all the considered shifts. Activated pixels have been selected imposing a threshold for the obtained coefficient and a minimum spatial extension for the regions. Only those pixels with an associated value greater than a threshold ( [ccl > .45, comsponding to a significance level of P<.0005 for a time-series of 60 samples, assmning the noise to have a Gaussian variance about the reference vector) and belonging to a cluster of at least 4 pixels which verify the same condition, were considered as responding to the stimulus. The last assumption allows to consider only functional homogeneous cortical regions with a minimum planar extension of 8 mm 2 and, while only slightly reduces the sensitivity of the method to detect activated areas, leads to a higher robustness to the phisiological noise. The obtained maps were then converted to polichromatic images and, following interpolation to 256 x 256 matrices, have been superimposed onto anatomical reference images to give a visual information of the relation between anatomy and function. 3. RESULTS The EPI images showed a good signal-to-noise ratio (SNR), without evident head motion and noticeable artifacts. All subjects showed significant activation in broad areas of the primary auditory cortex, precisely located on the surface of the Superior Temporal Gyrus (Fig. 1). Maximum values for the coefficient have been reached, almost in every regions, when
222 the reference waveform has been shifted of one sample, according to the expected latency of the emodynamic response of 4-8 seconds. The functional activation determined a signal increase of the selected pixels, during stimulation, ranging from 5% to 10% over the baseline signal value. In Fig. 2 the timecourse of the average of the positively activated pixels is shown. Minor activation foci characterized by a lower signal increase, a worse corrrelation index and a reduced spatial extension were noticed at the level of the calcarine region, of the frontal cortex and of the basal ganglia. In some subjects, some negatively correlated pixels, in the immediate contiguity of the activated Heschl gyri or at the level of non activated regions of the insular cortex and of striate and extrastriate occipital cortex, were observed.
Figure 1. Activated areas of the primary auditory cortex for a modulated tone at 1500 HZ (in white).
8
4 "t
0
-.__
-4 0
,
50
100
150
200
Time(sees.) Figure 2. Average time-course of"hot pixels" in an activated area.
223 4. DISCUSSION The auditory cortex has been analysed by several methods, including cytoarchitectural mapping [6], behavioural-anatomical correlation in brain-injured persons [7], evoked potentials [8], and radionuclide blood flow studies [9]. The application of fMRI to the evaluation of the human auditory cortex could appear ideal, particularly because of the high spatial resolution potentialities, the absence of radiation hazard and the easy repeatibility of the examination. The main drawback of this application consists in the necessity of a good suppression of the background noise, which has a very high amplitude and a broad frequency range, expecially in the case of Echo-planar acquisitions. The background noise can induce a constant stimulation of the auditory areas, thus affecting fMRI results via a double mechanism: the induction of a neurosensorial masking effect on the tone perception, and a vascular response impairment. The former is based on the lower responsiveness of either the receptors or the post-receptorial pathway, if the frequencies of the noise overlap the stimulus. The latter might exist even in the absence of frequency overlapping, and can be reconducted to the presence of a common vascular supply between the neurosensorial systems devoted to slightly different tones. On the basis of these considerations it appears that the background noise suppression is a primary problem for the neuroacoustical applications of fMRI. Although the best solution to this problem is probably related to the development of an active noise control, we obtained a good average attenuation (20 dB SPL) by using closely fitting headphones, as also suggested by others [ 10]. In this study we have verified the reliability of an fMRI study of acoustic cortex, which is in agreement with a previous report, focused on the brain processing functions induced by more complex acustic stimuli [10]. Considering the background scanner noise as a costant throughout all baseline and activation periods, the application of a n0n-speech stimulus induces an additional neuronal activity which can be easily localized by means of a simple correlation method and that can reach a good reproducibility being effective even for clinical purpose. Altough the reference vector used for the detection of activation is only a rough approximation of the induced signal enhancement, it is useful for a fast identification of the voxels, whose time-course is charcterized by a strong periodic component with the same period of the stimulus. Furthermore, the use of a minimum spatial extension, joined to the imposed intensity threshold, allows the formation of brain-function images that are almost artifact-free, at least when Echo Planar Images are of good quality and are not affected by gross motion of the subject's head. Improved results will come by the use of a re-alignment algorithm to remove from the functional data the movement related effects [ 11] and by the use of a more precise model of the emodynamic response [12]. Our data suggest, also, that a precise analysis of both the spatial and the temporal features of fMRI activation could be performed, leading to a better knowledge of the tonotopical organization of the human auditory cortex and to the extraction of a
224 temporal intra and interhemispheric network, responsable of the further processing of the auditory information.
REFERENCES
1. Ogawa S, Lee T, Nayak A, Glynn P - Oxygenation-sensitive contrast in magnetic resonance image of rodent brain at high magnetic fields. Magn Reson Med 1990; 14: 68-78. 2. Belliveau J.W., Kennedy D.N.Jr, McKinstry R.C., Buchbinder B.R., Weisskoff R.M., Cohen M.S. Vevea J.M., Brady T.J., Rosen B.R. - Functional mapping of the human visual cortex by Magnetic Resonance Imaging. Science 1991; 254:716-719. 3. Kwong K.K., Belliveau J.W., Chelser D.A., Goldberg I.A., Weisskoff R.M., Poncelet B.P., Kennedy D.N., Hoppel B.E., Cohen M.S., Turner R., Cheng H.M., Brady T.J., Rosen B.R.- Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation. Proc. Natl., Acad. Sci. USA 1992; 89: 5675-5679. 4. Kwong K.K. Functional Magnetic Resonance Imaging with Echo Planar Imaging. Magnetic Resonance Quarterly, 1995; 11: 1-20. 5. Bandetfini PA, Jesmanowicz A, Wong E, Hyde JS (1993) Processing Strategies for time-course data sets in fMRI of the human brain. Magn Reson Med 30:161-173 6. Galaburda A., Sanides F. Cytoarchitectonic organization of the hmnan auditory cortex. J Comp Neurol 1980; 190: 597-610. 7. Tanaka Y., Yamadori A., et al. Pure word deafness following bilateral lesions: a psychophysical analysis. Brain 1987; 110: 381-403. 8. Romani G. L., Williamson S.J., et al. Characterization of the human auditory cortex by the neuromagnetic method. Exp Brain Res 1982; 47:381-393. 9. Mazziotta J.C., Phelps M.E., Carson R.E., Kuhl D.E. Tomographic mapping of human cerebral metabolism: auditory stimulation. Neurology 1982; 32: 921-937. 10. Binder J.R., Rao S.M., Hammeke T.A., Yetkin F.Z., Jesmanowicz A., Bandettini P.A., Wong E.C., Estkowski L.D., Goldstein M.D., Haughton V.M., Hyde J.S. Functional Magnetic Resonance Imaging of Human Auditory Cortex. Ann Neurol 1994; 35: 662-672. 11. Friston K.J., Williams S., Howard R., Richard S., Frackowiack J., Turner R. Movement-Related effects in fMRI Time-Series. Magnetic Resonance in Medicine 1996; 35:346-355. 12. Friston K.J., Jezzard P., Turner R., Analysis of functional MRI time -series. Human Brain Mapping 1994; 30:161-173.
G MOTION ESTIMATION
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
227
T e m p o r a l P r e d i c t i o n of V i d e o S e q u e n c e s U s i n g a R e g i o n - B a s e d I m a g e Warping Technique N. Herodotou, and A.N. Venetsanopoulos ~ ~Digital Signal & Image Processing Laboratory, Department of Electrical and Computer Engineering, University of Toronto, M5S 3G4, Toronto, CANADA, E-mail: [email protected], URL: http://www.comm.toronto.edu/~ dsp/dsp.html A region-based image warping technique is introduced for the temporal prediction of video sequences. At the encoder, a set of control points are selected from the previous frame and their corresponding best matched points are determined from the current frame. The selection process is achieved by segmenting the previous frame into different regions using a colour segmentation and thresholding technique and the control points are subsequently chosen along region boundaries. The spatial offset of these points between the previous and current frame are represented as motion vectors. At the decoder, the same control point selection algorithm is used with the motion vectors in order to find the region boundaries of the predicted frame. An affine transformation is finally used to determine the remaining points and form the predicted frame. This technique produces results that are free from blocking artifacts as in the conventional block matching method~ and requires less overhead information to be transmitted with only a moderate increase in computational complexity. 1. I n t r o d u c t i o n Image compression schemes can significantly reduce the bandwidth and storage requirements of digital video by effectively taking advantage of the spatial and temporal redundancies in the data. Many of these techniques utilize motion compensation to remove the temporal correlation that exists between frames while employing transform or waveform coding methods to reduce the spatial redundancies. As a result, several international standards such as H. 261, MPEG 1/MPEG 2 have been developed for video coding purposes. Recent advances in mobile communications and Internet related technologies, have led the way to newly emerging applications such as mobile video communications, video email, and video databases. These applications in particular, demand coding techniques at very low bit-rates ( < 64 kbits/s) whereby frame rates on the order of 5-10 frames/sec are transmitted or stored. In this case, the decoder at the receiving end must adequately reconstruct the skipped frames from the available ones in order to yield satisfactory motion
228 rendition. An effective motion compensation scheme must be used to predict these missing frames. Conventional motion compensated prediction methods rely on standard block matching approaches where displacement vectors are estimated over rectangular blocks of the image. This approach is favourable due to its straightforward approach, however, it fails to adequately model object motion which is non-translatory (i.e. object rotation, deformation, or change of scale). This scheme also suffers from annoying blocking artifacts when components of an image feature are assigned different motion vectors. In order to alleviate these problems, several approaches based on digital image warping have been introduced in the past [1-4]. In these schemes, the predicted frames (also referred to as the current frames) are formed by geometrically transforming or "warping" the previous frames. These methods have been shown to yield better prediction results than the conventional block based approaches [1]. In this paper, we employ this warping concept along with the techniques of image analysis to form the predicted frame. This is achieved by segmenting the previous frame into arbitrarily shaped regions based on the colour information and selecting a set of control points to represent these regions. Each of these regions are subsequently transformed to form the predicted image. Thus, the motion compensated warping prediction scheme described here consists of three stages: i) Segmentation and control point selection, ii) Motion vector assignment of the selected control points, and, iii) Image warping of the previous frame. 2. C o n t r o l P o i n t S e l e c t i o n
The first step in the coding system described above is the selection of a suitable set of control points in the previous frame so that they may be used to predict the current frame. Both, the previous and current frames are available at the encoder while at the decoder side only the former of the two is available. Thus, the function of the encoder is to select a set of control points so that the previous frame is partitioned into a set of regions that can be "warped" into the corresponding set of regions in the current frame. The spatial offset of these control points (i.e. motion vectors) is determined by using the previous and current frames. The encoder then transmits the previous frame along with the computed motion vectors. The decoder finally receives the previous frame and uses the same control point selection algorithm as in the encoder, along with the motion vectors, to extract these points and spatially shift them to their appropriate positions in the predicted frame. The overhead in the form of side information in this approach consists of only the transmitted motion vectors as in the conventional block matching method. In [1], the above scheme is known as a forward matching technique and has the advantage that the control points can be selected based on the contents of the image. A simple way of selecting the control points is by forming a rectangular mesh that partitions the previous frames into a uniform set of non-overlapping blocks. However, a uniform spacing of the selected control points can lead to inaccurate motion vectors which can cause geometric distortions. In order to prevent this from happening, the selection algorithm used chooses control points that reside on the edges of segmented regions. The techniques of colour segmentation and thresholding [5] are first used to segment the image into the appropriate regions. Here we focus our attention on a specific application of a
229 head and shoulders videoconferencing scene in a controlled environment (i.e. lighting, background) where an image database contains all the relevant colour information of the different persons/objects that are to appear in the scene. (i.e. background, clothing, facial skin colour, etc. ). Proper thresholding along with pre and post-processing techniques can lead to quite acceptable segmentation results. The novel approach taken in [5] is also used here for facial feature extraction, that is, pixels within the facial region that do not fall within a specified region of chromaticity (i.e. corresponding to colours of skin) are classified as non-skin-coloured. This allows us to segment the eyes, mouth, and nostril areas. Once the image is segmented into the appropriate regions, a sufficient number of control points are then selected along the edges of these regions. In order to simplify the selection algorithm we choose every 5th pixel on each region border and also place a control point at the centre of mass (COM) of each region. In this way, each region can be broken up into triangular patches formed by the edge points and the CO M which can be individually "warped" via transformation equations. Control points are also selected at all corners and midpoints of each side of the image frame. These points, however, are stationary and are not spatially offset in the predicted frame.
3. Motion Vector Assignment The assignment of motion vectors for each selected control point is once again determined at the encoder where both, the previous and the current frames are available. A block matching technique, similar to the one in [1] is also used here where a 21 • 21 window size is employed in which the central pixels within this block structure are more heavily weighted. Unlike the scheme in [1], however, the square block used in the matching process here, is not subsampled. Futher to this, the mean squared error (MSE) criterion using the Euclidean distance measure is utilized due to the colour information. The best match of each selected control point is determined by finding the minimum MSE value within a search space of + or - 15 pixels. These motion vectors are transmitted to the decoder as overhead information~ 4. I m a g e W a r p i n g When the decoder receives the previous frame along with the transmitted motion vectors it must predict the current or skipped frames. This is accomplished by "warping" the triangles in each of the regions of the previous frame to the corresponding triangles in the predicted frame. The vertices of the triangles in the predicted frame are found by using the same control point selection algorithm as in the encoder and spatially shifting these appropriately using the motion vector information. Once these are found, then the triangle to triangle mapping follows using an affine transformation [6]. As a result of this, the points within each triangle are geometrically transformed to their corresponding positions. Bilinear interpolation is used when non-integer positions are found~ 5. Results The performance of this scheme was evaluated using frames 1 and 5 of the Claire QCIF sequence. A comparison of the warping prediction was made with the conventional block
230 matching motion compensator. In Figures la) and lb) we present frames 1 and 5 of the original sequence. In Figure 2a), the results of the segmentation and control point selection are shown while Figure 2b) illustrates the performance of the point matching process. It is also worth noting that prior to the segmentation process, a pre-processing step on frame 1 was performed in which the Vector median filter (VM) [7] was used to smoothen the image. This was followed by a post-processing stage where the VM was applied again to eliminate any misclassified pixels. The standard block matching results are shown in Figure 3a) where 18 x 18 size blocks were used. The blocking artifacts in this figure are clearly evident. The transmission overhead required for this 360 x 288 pixel image amounts to 320 motion vectors. The predicted image using the region based warping approach is finally shown in Figure 3b) and does not suffer from the annoying artifacts, as in the conventional case. In addition to the improved subjective quality, only 236 motion vectors are required to be transmitted as side information. Future improvements to this technique can focus on predicting the motion of the eye and mouth area more accurately. 6. C o n c l u s i o n s A region based-warping technique was examined for the temporal prediction of video sequences. In this scheme, the encoder determines a set of control points that reside on the edges of different regions that are obtained by colour segmentation and thresholding. A block matching method is then employed at the encoder in order to assign the appropriate motion vectors to these points. The decoder uses the same control point selection algorithm to spatially offset these points to their proper position in the predicted frame and "warp" the remaining pixels through an affine transformation. A significant subjective improvement is found in the predicted image using this technique when compared to the conventional block matching approach. Furthermore, the improved visual quality can be achieved at a smaller transmission overhead cost with only a moderate increase in the computational complexity of the coding scheme. REFERENCES 1. J. Nieweglowski, J. Campbell, P. Haavisto, 'A novel video coding scheme based on temporal prediction using digital image warping', IEEE Trans. on Consumer Electronics, vol. 39, no.3, pp. 141-150, 1993. 2. G. Sullivan, 'Motion Compensation for video compression using control grid interpolation', IEEE Int. Conf. on ASSP, pp. 2713-2716, 1991. 3. V. Seferidis, M. Chanbari, 'Generalized block matching motion estimation', Visual Communications and Image Processing, SPIE vol. 1818, pp. 110-119, 1992. 4. J. Nieweglowski, T. Moisala, P. Haavisto, 'Motion compensated video sequence interpolation using digital image warping', IEEE Int. Conf. on ASSP, vol. 5, pp. 205-208, 1994. 5. T.C. Chang, T.S. Huang, C. Novak 'Facial feature extraction from color images', Proceedings of the 12th International Conference on Pattern Recognition, vol.3, pp. 39-43, 1994. 6. G. Wolberg 'Digital image warping', IEEE Computer Society Press, Los Alamitos, California, 1990. 7. J. Astola, P. Haavisto, Y. Neuvo, 'Vector Median Filters', Proceedings of IEEE, April 1990.
231
Figure 1. a) Original Claire 1 frame, b) Original Claire 5 frame.
Figure 2. a) Segmentation and control point selection, b) Matching of control points.
232
Figure 3. a) Conventional block matching prediction, b) Region-based warping prediction.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) (~) 1997 Elsevier Science B.V. All rights reserved.
233
High P e r f o r m a n c e Gesture R e c o g n i t i o n Using Probabilistic Neural N e t w o r k s and H i d d e n M a r k o v M o d e l s G. Rigoll, A. Kosmala, M. Schuster Gerhard-Mercator-University Duisburg Faculty of Electrical Engineering, Dept. of Computer Science Bismarckstr. 90, D-47057 Duisburg, Germany e-mail:[email protected]',.de ABSTRACT In this paper a fast method for image sequence recognition is presented. The method is based on a discrete statistical model consisting of a vector quantizer and a special probabilistic neural network, which allows to classify image sequences without applying rules depending on the content of the sequence. The simple feature extraction also allows the classification with discrete Hidden Markov Models, and therefore results in a direct comparison between neural net and HMM techniques for gesture recognition. As an application we present results from a test conducted for the classification of various gestures done by human beings in front of a video camera for both classification methods, which gave promising recognition results in real time. The system obtained 90.0% recognition rate for the person-independent classification of 10 different gestures. We recently improved the system substantially by augmenting it for the classification of 15 gestures while keeping the recognition rate of 90% due to the use of improved Hidden Markov Models. We consider this as surprisingly high rate for such a complex task and believe that our system belongs to the most powerful gesture recognition systems.
1. INTRODUCTION Understanding moving image sequences is often done with a complex rule based model specially designed for the given application [ 1]. The recognition of moving images can also be understood as a complex dynamic pattern recognition problem, which therefore could possibly be solved with methods used for example in speech and hand-written character recognition. This was tried with the system proposed here. The data corpus of altogether 150 sequence samples was collected in a video recording room at our laboratory.We recorded 10 different gestures (Tab.l), 3 samples each, done by 5 different test persons, with a regular video camera and a frame grabber card.
234 Table 1 Table of used gestures
.
2. 3. 4. 5. 6. 7. 8. 9. 10.
Gesture HAND-WAVING ROUND TO-RIGHT TO-LEFT STOP CLAPPING KOTOW NOD-YES NOD-NO COME
Comment wave hand "bye" "O" over head point to right point to left both hands
only head moving only head moving both hands
2. FEATURE EXTRACTION The resolution of the raw movies was 192x144 pixel, 24 bit RGB color space, recorded with 12 frames per second. Each sample was about 130 frames (10 sec) long. First every frame was spatially quantized taking the mean of a square of 8x8 (16xl 6) pixel RGB values, for each of the RGB value planes separately. From the original 192x144 pixel this gave new resolutions of 24x 18 pixel and 12x9 pixel respectively. Then the subsampled frames were cut in horizontal and vertical stripes. Concerning light conditions (contrast) and camera position no special feature transformations were used. To capture just the movements in the sequence, we also calculated for every subsampled movie sequence a velocity movie by just taking the difference of each of the RGB planes of neighboring frames on the time scale. Out of those samples another set of feature vectors was extracted in the same way as for the regular subsampled sequences.
3. CLASSIFIER The classifier is subdivided in two major parts, a vector quantizer for the feature vectors (VQ), transforming the feature vectors to a label stream, and a discrete classifier, classifying this label stream.
3.1. Vector Quantizer As the vector quantizer we used the k-means algorithm [2] with some improvements for high speed performance and the LBG algorithm [2], both with the regular Euclidean norm. 3.2. Discrete Classifier As the discrete classifier we tested two different models, our own probabilistic neural net approach (PN) [3], and a regular discrete Hidden Markov Model (HMM) approach for verification.
235
3.2.1. PN Model The probabilistic neural net [3] proposed here has a feed-forward structure with one layer, mapping the discrete labels (X) emerging from the VQ to the different classes (C), here the ten different gestures. The weights are calculated with a one-step algorithm giving an estimation for the a posteriori probabilities P(CIX) at the output neurons during classification. Because the training is done in one step, training time for the PN classifier (about one second for all 100 (120) training sequences) can be neglected. The recognition procedure can be interpreted as calculating a weighted average of the a posteriori probabilities of the discrete labels determined during training. 3.2.2. HMM Model Alternatively, we used discrete HMM's for decoding the image sequence VQ streams. HMM's offer superior sequence processing capabilities. For each sequence there was one discrete HMM model, three states, no skips, trained and reestimated with Baum-Welch reestimation, as it is widely used for phoneme models in speech recognition, where the chosen HMM structure may not necessarily be a proper structure for the underlaying classification problem.
4. EXPERIMENTS & RESULTS 4.1. Experiments We conducted two different tests for the data corpus with different partitions of the corpus taken as training and test data and different feature vector sets. 4.1.1. Test 1 (person dependent) For the first test we used 100 sequences out of the 150 sequences for training (two per person per gesture) and the remaining 50 for recognition, so every person was represented in the training and test data, making it a "person dependent" test. For this test we used the feature vectors extracted out of the regular subsampled movies. 4.1.2. Test 2 (person independent) For the second test we used all three samples of four (chn, jmr, keiko, stella) of the five test persons as training data (120 sequences) and the remaining samples of the last person (gustl) as test data (30 sequences), making it a "person independent" test. Here we used the feature vectors extracted out of the velocity movies, so just the movement information was used for classification.
4.2. Recognition Results We conducted both tests with various codebook-sizes for the k-means and the LBG algorithm for vertical and horizontal feature vectors. Recognition results for training and test data for both tests for the k-means-VQ with the PN classifier are shown in Tab.2. Tests with the HMM classifier gave similar results to Tab.2 and are not shown here.
236 Table 2 Recognition results in percent with k-means for 8x8 (16x 16) pixel spatial quantization, black & white feature vectors, for the PN classifier (note that the feature extraction for the person dependent and independent test are different) CODEBOOK-SIZE 100 vert. person dep. 300 vert. person dep. 500 vert. person dep. 1000 vert. person dep. 2000 vert. person dep. 100 horz. person dep. 300 horz. person dep. 500 horz. person dep. 1000 horz. person dep. 2000 horz. person dep. 100 vert. person indep. 300 vert. person indep. 500 vert. person indep. 1000 vert. person indep. 2000 vert. person indep. 100 horz. person indep. 300 horz. person indep. 500 horz. person indep. 1000 horz. person indep. 2000 horz. person indep.
TEST 70.00 (80.00) 90.00 (92.00) 94.00 (92.00) 96.00 (88.00)
TRAIN 71.00 (86.00) 97.00 (96.00) 98.00 (99.00) 100. O0 (100.00)
98.oo (9o.oo)
100. O0 (100.00) 86.00 (85.00) 97.00 (98.00) 98.oo (98.oo) 99.00 (100.00) 100. O0 (100.00) 69.17 (65.83) 85.83 (85.00) 91.67 (90.83) 96.67 (97.50) 99.17 (100.00) 85.00 (84.17) 93.33 (92.50) 92.50 (95.83) 98.33 (100.00) 100.00 (100.00)
76.00 (86.00)
88.oo (86.oo) 92.00 (88.00)
9o.oo (9o.oo) 92.00 56.67 60.00 70.00
(90.00) (53.33) (56.67) (56.67)
7o.oo (6o.oo) 7o.oo (6o.oo) 66.67 73.33 76.67 86.67 86.67
(66.67) (90.00) (90.00) (90.00) (90.00)
4.3. Training and Recognition Time The overall training time depended mainly on the codebook-size of the vector quantizer, since spatial quantization and the PN classifier are very fast. With our fast version of the kmeans-VQ, training time was less than nine hours for a 2000 vector codebook for 232904 feature vectors with 24 components, calculated on a regular Pentium-90 PC. Total recognition time was a few minutes for all 50 (30) test sequences, depending mainly on the VQ. If the HMM system was used, the recognition time was slightly higher. For the online version of the recognizer which included spatial quantization, a vector quantizer and the PN classifier recognition time was about two seconds for a sequence of ten seconds length.
5. FURTHER IMPROVEMENTS We recently improved our system by increasing the number of gestures from 10 to 15, adding gestures as e.g. "turn around", "turn right", "turn left", etc.. Concentrating our activities on the person-independent recognition mode, the recognition rate for this more complex task dropped from 90% to 80% using the probabilistic neural net recognizer. By using improved HMM's, with a more complex structure and increased discriminative
237 capabilties, it has been possible to obtain 90% recognition rate for this complex personindependent image sequence recognition task. This indicates the superior capabilities of HMM's for such a task, and we will therefore concentrate our future activities on this approach. Both the neural net and the HMM approach are implemented in a real-time demonstration system, which is believed to be among the most powerful gesture recognition systems available so far.
6. DISCUSSION
We presented a fast method to recognize video sequences with statistical methods without implementing special rules depending on the content of the sequence. Surprisingly, the system could learn to classify with good results for the person dependent and person independent test while ignoring great parts of the information included in the movie sequences (for example the ordering of the discrete labels), although classification of the given gestures doesn't seem to be trivial, especially for the gestures NOD-YES and NOD-NO where only the head of the person is moving. In case of a more complicated classification problem we believe it is no problem to add more features to the model to improve its performance. The results show that the feature extraction works quite well. We also observed the tendency of superior performance for the Markov models compared to the neural net approach. This will be further investigated in the future. Also, the number of samples is to be increased to make the results statistically more reliable and the recognition more robust. With improved feature extraction in the preprocessor the system proposed here could be a good base for complex future applications, classifying maybe hundreds or thousands of sequences for person independent tasks in real time.
REFERENCES
Brand, Essa: "Causal Analysis for Virtual Gesture Understanding", Proc. AAAI Fall '95, Symposium on Computational Models for Integrating Language and Vision. .
.
Linde, Buzo, Gray: "An Algorithm for Vector Quantizer Design", IEEE Trans. Comm., Jan. 1980. M. Schuster, G. Rigoll: "Fast Online Video Image Sequence Recognition with Statistical Methods", Proc. IEEE-ICASSP-96, May 7-10, Atlanta, GA.
238
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
Image segmentation using motion estimation Klaus Illgner and Frank M iiller ~ Institut ffir Elektrische Nachrichtentechnik Aachen University of Technology (RWTH), 52056 Aachen, Germany In this paper an approach is described for segmenting frames of image sequences into homogeneously moving regions. The segmentation criterion is based on the motion estimate only. Therefore, the model assumption of homogeneously moving regions is included in the motion estimation criterion. The resulting motion estimate is feasible to discriminate differently moving regions. By using a hierarchical approach a segmentation of increasing accuracy is obtained. Further advantages of the approach are robustness and low computational complexity. 1. I N T R O D U C T I O N Segmenting frames of image sequences into differently moving regions is an important issue in many applications. On the one hand there is an increasing interest in regionoriented coding methods. These are especially useful e.g. for multimedia communications, database retrieval, and editing, which is one major topic in MPEG-4. On the other hand the problem is of general interest in the field of image analysis. The aim here is to identify regions related to real world objects, which share the same motion, rather than to minimize prediction errors. The recording camera maps real world 3D-objects onto 2D-regions of the image plane. Assuming rigid objects, which is almost always valid, the corresponding regions are characterized by homogeneous motion. The term homogeneous motion is specified later. Since the motion of different objects varies, the motion is feasible to discriminate differently moving regions. If neighbored objects share the same motion, it is useful 1 for motion oriented image analysis to merge the corresponding regions. The problem of estimating the segmentation of the image plane into regions with respect to motion is ill-posed. The segmentation depends on the motion description, which needs to be estimated as well and depends itself on the segmentation. The possible solutions depend on the type of motion description. Parametric motion descriptions with few parameters- typically 6 or 8 - are too restrictive to represent complex motion. Therefore, assigning simple motion parameters, e.g. a single displacement vector, to very small atom regions, e.g. a pixel, is more flexible. Homogeneity of the motion means, that the motion parameters of neighbored regions are similar, not necessarily equal. The final segmentation is then a composition of these atom regions. The approach developed by Stiller [5] jointly estimates motion and segmentation on basis of a dense motion vector field, which leads to accurate segmentations and motion
239 estimates but is computational expensive. The aim of the approach described in this paper is to achieve a suitable motion based segmentation with low computational complexity. Therefore, blocks are used as atom regions and the motion is represented by a single displacement vector. A further simplification is to estimate motion and subsequently the segmentation and iterate the two steps. However, in contrast to other techniques (e.g. [1]), in the proposed approach an iteration to refine the motion estimate is not necessary. The reason is, that a homogeneity constraint for the motion vectors is included in the motion estimation criterion. Since the aim is to achieve a segmentation into homogeneously moving regions, the segmentation is restricted to using the motion vector field only. Therefore, there is no single parameter set to represent the motion of a region. 2. M O T I O N E S T I M A T I O N The motion estimation scheme relies on the well established block matching technique [3]. The attraction comes mainly from the simplicity and robustness of this technique. The current frame g~ is partitioned into a set of disjunct rectangular blocks {b(i)}, i denotes a position on a 2D-grid. The motion of each block b(i) is described by a single motion parameter, the motion vector g(i), modeling the motion within a block as translational motion. In a coding context the vectors address the most similar image area in the previous frame gn-1 according to a distance measure. Although the underlying assumption of translational motion of the image blocks is very restrictive, the technique is also appropriate for image analysis. It is assumed, that the motion of the regions is characterized as homogeneous motion. Regarding the regions as a composition of atom areas, homogeneous motion means, that the motion parameters of these small atom areas do not differ very much from the parameters of neighbored areas. The further assumption, that the complex motion of the regions can be locally approximated as translational, will hold, if the atom areas are sufficiently small compared to the whole region to be described. Consequently, a motion vector field obtained for small rectangular blocks is suitable to approximate the motion of regions. Therefore, in contrast to the aim in a coding context, the motion estimation criterion must be designed such, that the homogeneity of motion vectors is considered. The statistically formulated approach developed by Stiller [4] in a coding context minimizes for each block b(i) the displaced frame difference e(i) constraint to the similarity of the motion vector g(i) to the neighbored vectors. Hence, this criterion is also suitable for image analysis, but is interpreted slightly different. The frames g~, gn-1 are defined at the positions x = (x, y) of the lattice s = { (x, y) Ix = 0 , . . . , Nx - 1, y = 0 , . . . , Ny - 1}. Furthermore, a partition {b(i)li e B} of the frames into blocks of size Bx x By is defined on the sublattice B c s The vector field is denoted by {g(i)}. As a measure for the local similarity of the image contents the mean squared error e(i) = ~
DFD 2(x, g(i))
(1)
xEb(i)
of the displaced frame difference DFD(x, g(i)) = ]g,(x) - g n - l ( X - g(i))]
(2)
240 is used. Since the assumption of pure translational motion within an image block is very simple, using overlapped block motion compensation [2] instead of (1) e(i) -
~
(
xEb(i)
~
w~(j). DFD(x, g(j))
)2
(3)
jcAfi
leads to a more reliable criterion. Furthermore, the dependency on neighbored blocks strengthens the vector field smoothness. The similarity between neighbored motion vectors is formulated as the weighted sum of the vector differences C(i) - ~ cj Ig(i) - g(J)l jc~fi
(4)
taking into account the vectors of a second order neighborhood system .M/. The weight cj equalizes then influence of horizontal/vertical and diagonal neighbors. Combining both aspects (1), (4) according to [4] results in the motion estimation criterion
g(i) -
argmin {log(e(i,g))+ A . C ( i , g ) } gcV
)~ - const.
(5)
which is evaluated for each block b(i). Instead of a full search the vectors are selected from a set of test vectors V. The weighting function log(.) coming from the statistical formulation decreases the influence of small differences, caused for instance by noise. The factor A controls the strength of the smoothness. Since the aim according to the assumptions is a high homogeneity of the motion vector field, while a minimized DFD is of secondary interest, a higher value for ~ is chosen compared to a coding environment. Due to the dependencies between neighboring vectors (5) can not be evaluated for each block independently. Therefore, to find the optimum solution deterministic relaxation is used. The block partition of the frame is subdivided into 4 sub lattices. For optimizing one sub lattice each block of that sub lattice is regarded as statistically independent due to the model and can be optimized without considering other blocks. Constraining the motion estimate to the motion of the neighbored blocks results in a few iterations in a motion vector field, which is smooth within regions and still preserves edges between differently moving regions. This property is employed in the following section to segment frames into differently moving regions. 3. S E G M E N T A T I O N
OF M O T I O N V E C T O R
FIELDS
Due to the basic assumption, the motion vectors are expected to be only locally similar. Motion vectors belonging to the same region can be very different, depending on the spatial distance. Therefore, segmentation criteria, which cluster the data around an average parameter vector are not feasible. Furthermore, the task is not to find a single parameter describing the motion of the region. Instead, the local similarity between neighbored vectors, as used also in the motion estimation criterion, is employed. Two neighbored motion vectors are elements of the
241 same region, if the amplitude of the difference vector A~(i,j) is less than a fixed threshold
T~ IA~(i,J)l
=
Ig(i) - g(J)l,
j c Af~
< T,.
(6)
The decision is bound to vectors of the first order neighborhood system .hf/. Implementing this segmentation criterion requires only two scans of the motion vector field. In the first scan motion vectors can only be labeled on basis of their causal neighborhood. In the second scan the segmentation is verified using the complete neighborhood. Each relabeling obviously requires a relabeling of the entire segment. 3.1. H i e r a r c h i c a l S e g m e n t a t i o n R e f i n e m e n t
The size of the blocks determines the accuracy of the motion estimate and hence also of the segmentation. Furthermore, the motion estimate needs to be refined at the boundaries of the segmentation, since the content of a block might belong to two differently moving regions. However, motion estimation using small block sizes is sensitive to noise. Therefore, the estimates are refined in a hierarchical fashion, which also reduces the computational load. First a course resolution vector field is calculated using large block sizes. The image is segmented into regions of homogeneous motion according to this motion estimate. At the refinement levels the block sizes are halved. The motion estimates are initialized by the course motion vector field. Motion estimation needs only to be refined along the contours of the segmentation obtained at the next lower resolution level. 4. S I M U L A T I O N R E S U L T S The feasibility of the approach was verified with several parameter sets on motion vector fields with varying smoothness and complexity. As an example for strong local motion the sequence salesman was used. In contrast the sequence foreman contains strong global motion. On the top left in Fig.1 the motion vector field of frame ~6 from the sequence foreman in CIF is shown, obtained with a block of size 16 x 16pel. The corresponding segmentation is depicted on the top right. The lower row shows the refinement of the central region with 8 x 8 blocks along the segmentation boundaries. In Fig.2 the vector field and the segmentation for frame ~:16 of the sequence salesman is shown. Mapping the regions onto the original frame (Fig.3) shows a good match with the contours of the object. 5. S U M M A R Y The outlined algorithm segments frames of image sequences into homogeneously moving regions. Although the segmentation is based on a motion vector field only, the scheme leads to accurate segmentation results. Furthermore, due to the smoothing property of the motion estimation criterion, an iteration between motion estimation and segmentation is not necessary. Hence, the computational complexity is low, also since a block-based approach is used.
242
Figure 1: Estimated motion vector field (top left) and segmentation (top right) of frame #6 from the sequence salesman. Bottom row: Coarse segmentation (left) refined (right) along the segmentation boundaries.
Figure 2: Estimated motion vector field (left) and segmentation (right) of frame #15 from the sequence salesman.
243
Figure 3: Mapping of all regions onto the original frame. The accuracy of the regions depend on the block size, which can be increased easily using a hierarchy of block sizes. Since block-based motion estimation schemes are common for image sequence coding, this approach allows also for region-oriented coding in standard coding environments (MPEG, H.263). The shape can efficiently be described e.g. by a quadtree structure. Note, that although the description is based on a block-oriented approach, the principle is not restricted to the block structures. REFERENCES
1. T. Ebrahimi, H. Chen, and B. G. Haskell. Joint motion estimation and segmentation for very low bitrate video coding. In Proceedings SPIE Conference on Visual Communications and Image Processing, vol. 2501, pp. 787-798, Taipeh, Taiwan, May 1995. 2. K. Illgner and F. Miiller. Motion estimation using overlapped block motion compensation and Gibbs-modeled vector fields. In Proc. 9 th Workshop on Image and Multidimensional Signal Processing (IMDSP 96), pp. 126-127, Belize City, Belize, March 1996. 3. J.R. Jain and A.K. Jain. Displacement measurement and its application in interframe image coding. IEEE Trans. Commun., 29(12):1799-1808, December 1981. 4. C. Stiller. Motion-estimation for coding of moving video at 8 kbit/s with Gibbs modeled vectorfield smoothing. In Proceedings SPIE Conference on Visual Cornrnunications and Image Processing, pp. 468-476, Lausanne, Switzerland, October 1990. 5. C. Stiller. Object-oriented video coding employing dense motion fields. In Proceedings IEEE Intern. Conf. on Acoustics, Speech and Signal Processing ICASSP'9g, Adelaide, Australia, April 1994.
244
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
A Phase Correlation Technique for Estimating Planar Rotations * L. Lucchese, G.M. Cortelazzo and M. Rizzato Department of Electronics and Informatics, University of Padova, Italy Phase correlation is a frequency domain method for estimating planar translations. The frequency domain approach allows global methods exploiting the whole information and leading to unsupervised techniques. This work extends to the case of planar rotations phase correlation, whose robustness for estimating planar tranlations is well known. The resulting algorithm proves to be very effective and robust, as the presented examples show. 1. I N T R O D U C T I O N Phase correlation is one of the robustest methods for estimating planar translations: it is a global approach because it operates in the frequency domain using the whole image information and not just a selected subset of the image as the feature-based methods do. Thegood characteristics of phase correlation for estimating planar translations were found to be a strong motivation for its extension to the estimation of planar rotations. An original algorithm for estimating planar rotations inspired by phase correlation is presented in this work. Practical experimentation on real imagery confirms the expected robustness of the method. This paper has four sections. Section 2 adapts and extends the theory motivating the original phase correlation for estimating translations to the case of rotations. Section 3 presents an algorithm for implementing angular phase correlation derived in the previous section. Section 4 draws the conclusions. 2. E S T I M A T I O N OF P L A N A R R O T A T I O N S BY M E A N S OF P H A S E C O R RELATION Phase correlation has been developed in order to estimate planar translations between images pairs and has been shown to be one of the most robust methods for estimating planar translations [1-3]. This method relies upon the following concept: let gl(x), x e 7~2, be an image and g2 (x) = gl ( x - t ) , t e 7~2, be a translated version of gl (x). Denote as Gi(k) - ~'[g,(x)] k], k - [k~ ky]r, the 2-D cartesian Fourier transform of gi(x), i = 1, 2; then G2(k) = G l ( k ) e -j2~krt *This work was supported by European Community Project MAVI-CHRX-CT94-0625.
(i)
245 differs from G1 (k) only in a phase shift. Therefore from a~(k) G2(k) e_y2,kr, t Q(k) = IG1 (k) G2(k)l =
(2)
and q(x) " ~--l[Q(k)I x I - 8 ~ ( x -
t)
(3)
one can estimate translational vector t as the coordinates of the peak of impulsive signal q(x). This work extends this idea to the case of planar rotations; in this case let fl(x) and f2(x), x E 742, respectively denote an image and its version rotated by an angle c~, i.e., in cartesian coordinates f2(x)-
fl(R(~)-lx)
(4)
where R(cp)=
sin~
cos
and SO(2) is the group of 2 x 2 special orthogonal matrices. Some examples of images pairs fl (x) and f2(x) are shown in Fig.1. Equation (4) can be conveniently rewritten in polar coordinates x=pcos0 y=psin0
p>_0 0_<0<2:r
(6)
aS
(7)
f 2(P, O) = f l (P, 0 -- r where f~(p,O) " f~(pcosO, psinO)
i-
1,2
(8)
Let Fi(k) - 9r[fi(x)[ k] be the 2-D cartesian Fourier transform of fi(x), i = 1, 2; denote polar coordinates in frequency domain as
k= = kpcosko ky--kpsinko
kp >_ 0 0_
(9)
and define the magnitudes of Fi(k) in this reference system as
Mi(kp, ko) " IFi(k; cos ko, kp sin ko)[
(10)
It can be easily proved [5] that functions Mi(kp, ko) are related ~ images fi(x), i.e.,
M2(ko, ko ) = M~(kp, k o - ~ )
(11)
However there is a difference between relationships (7) and (11) concerning the periodicity of the functions: functions fi(P, 0) have 2~r as periodicity whereas functions M~(k;, ko) have 7r as periodicity, owing to the hermitian symmetry of the Fourier transform.
246
Figure 1. a) Image fl(x); b), c), d)images f2(x), versions of fl(x) rotated respectively by ~p = 1~ ~ = 20 ~ qp = 40 ~.
Let mi(k o, a), i = 1, 2 , be two auxiliary functions, along a generic circumference of radius kp, defined as
mi(kp, a) =
F Mi(k o, ko) e-J2'~k~
(12)
7f
Notice that functions mi(kp, a) are the partial angular Fourier transforms of Mi(kp, ko), i.e., mi(kp, a) - ~ ' [ M i ( . , ko) la]. From the translational theorem of the Fourier transform we obtain
m~(k~, ~) = m~(ko, ~) r
(13)
247 In order to be independent from the specific circumference, one can build two further auxiliary functions
#,(a) "
i = 1,2
kam,(ka, a)dk;
(14)
where R is a fixed radius. Functions #~(a) are reciprocally related similarly to functions
.2(Ct)
=
/0 /%m2(~p, ct)
=
~-J~
d~p
=
/0 ~pml (~p, ~)e -j2~'~~
m~(k;, a). In fact
d~ o =
k , ~ ( k , , ~) dk, = ~ ( ~ ) ~ - J ~
(15)
Rotational angle ~ (in the exponent of equation (15)) can be computed from normalized product
~1 ((~)* t22(O~) e_J2~ro~p Q(~) = I~,(~),~(~)1 =
(16)
Indeed inverse Fourier transform of Q(c~) gives
7-~[Q(~) I a] = *~(a - ~) "- q(a, ~)
(17)
which, for reference convenience, is called angular phase correlation. The above mentioned hermitian symmetry of the Fourier transform makes possible estimating angle qD only within [7r/2, 7r/2). A simple way to extend the estimate beyond this limit is to backrotate the second image by qo and qp + 7r and to subsequently use translational phase correlation in order to disambiguate the correct back rotation [4] 3. A N A N G U L A R P H A S E C O R R E L A T I O N A L G O R I T H M An algorithm for estimating rotational displacement ~ can be structured as folllows:
M~(k~, k~), i = 1,2, of the Fourier transforms in cartesian coordinates of the two images fi(x);
i) compute magnitudes
ii) interpolate magnitudes
M~(k~,kv)on
a polar grid in order to obtain M~(kp, ko);
iii) compute mi(ka, o~) by means of equation (12); iv) evaluate functions #,(cz) and normalized product Q(c~); v) compute
angular phase correlation q(a, qo);
vi) estimate rotational angle ~ as the location of the highest peak in
q(a, ~p).
In practice image noise and border effects will not lead to a perfectly impulsive function q(a,qr However extensive experimentation has shown that angular phase correlation functions q(a, qo) exhibit a distinctive peak in correspondence of the true rotational angle q~. Fig.2 shows functions q(a, ~) relative to the images of Fig.l: the estimates are very accurate. This kind of performance is typical of the proposed algorithm.
248
1 .................
9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
q(a,~) o. .........
-80
1
q(~,
~)
-60
-40
20
0 a
~ ...............................
20
40
~ ....
60
80
.........................................................................................
0.8
.....................................................
o.,
..................................................................................................
0.4
""':": . . . . . . . . "::. . . . . . . . . . ":" . . . . . . . . . ?. . . . . . . . . . ":': . . . . . . . . . ." . . . . . . . . . . ?. . . . . . . . . . . .:' . . . . . . . . . ":'"" : . :
o. I ..;
..... ; .......
-98 O
.-6O
*
: ..........
.,.4O
:
-2O
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 i ....... ~. .......
0
2O
; ......
4O
6O
8O
a
, i ......... i ......... i ......... i ......... ~ .......... i ......... ~......... ~ ......... i
0.8 ...".......... - ~.........".......... - ~.........-.".........~................. ~......... -.: q(a, ~)
0.6 ...~...........
~.......... ~............ :.,.......... :: .......... :, ..................
i. ..........
o.,....:- ......... ~-......... ~ ........ i ....... :. ........ i ............................
0.2
.
.
-.80
Figure
2.
.
.
.
-60
i .
-40
.
.
.
-20
.
"
~,
"'"
: .
.
20
Angular phase correlations q(a, ~p) r e l a t i v e
d) o f F i g . 1 . ; t h e t r u e r o t a t i o n a l bottom)"
.
.
.
.
.
.-: i
.
40
60
80
to functions
angles and their estimates
pairs a)-b),
are respectively
~ = 1 ~ a n d ~5 = 1 ~ , ~ = 20 ~ a n d ~ = 20 ~ , ~ = 40 ~ a n d ~ = 40 ~ .
a ) - c ) , a)-
(from top to
249 4. C O N C L U S I O N S This work presents an extension to the rotational case of the well known translational phase correlation. Robustness and accuracy of the proposed algorithm have been tested and experimental evidence of these characteristics is reported. The proposed algorithm retains the good properties typical of the translational phase correlation. The only drawback of the method is that it requires cartesian to polar coordinates conversion which is a numerically delicate and computationally intensive operation. This drawback can be bypassed by resorting to an alternative algorithm for estimating rotations proposed in [5]. REFERENCES
1. C.D. Kuglin and D.C. Hines, "The phase correlation image alignement method", Proc. IEEE 1975 Int. Conf. Cybern. Soc., 1975, pp. 163-165. 2. S. Alliney, "Digital Analysis of Rotated Images", IEEE Trans. Pattern. Anal. Machine Intell., Vol. 15, No 5, pp. 499-504, May 1993. 3. E. De Castro, C. Morandi, "Registration of Translated and Rotated Images Using Finite Fourier Transforms", IEEE Trans. Pattern. Anal. Machine Intell., Vol. PAMI9, pp. 700-703, 1987. 4. L. Lucchese, G.M. Cortelazzo, C. Monti, " Estimation of Affine Transformations between Images Pairs via Fourier Transform", Proc. of ICIP'96, Lausanne, Switzerland, Sept. 1996, Vol. III, pp. 715-718. 5. L. Lucchese, G.M. Cortelazzo, C. Monti, "A Frequency Domain Technique for Estimating Rigid Planar Rotations", Proc. of ISCAS'96, Atlanta, Georgia, May 1996, Vol. 2, pp. 774-777.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
250
Tracking by Cooccurrence
Matrix
Lorenzo Favalli, Paolo Gamba, Andrea MarazzP, and Alessandro Mecocci b ~Dipartimento di Elettronica, UniversitS~ di Pavia, Via Ferrata, 1, 1-27100 Pavia bFacolt~ di Ingegneria, UniversitS~ di Siena, Via Roma, 77, 1-53100, Siena
Tracking many targets in complex environments is a problem usually solved by means of the correspondence method, by finding out a suitable set of features for the identification of the targets in motion. However, establishing a correspondence between the different representation of the same object in successive frames of a sequence require a very accurate step of feature matching, which is the most problematic part of the whole procedure. In this paper we propose a different solution to the tracking problem for a completely general tracking system, without any need of feature matching, employing the so called 'correspondence matrix'. The system presented is able to localise and to track a generic target in motion in a real scene without any information about it. 1. I N T R O D U C T I O N The target tracking problem solved with the correspondence based method can be partitioned in: (a)finding out an appropriate set of features for the identification of the targets in motion; (b) establishing a correspondence between the representation of the same objects at neighbouring times by the analysis of the features; (c) finally, analyzing the motion of the different tracked targets. The first problem is well known in literature because of its relationship with pattern recognition. Anyway, when the sequence of images represents a real scene, with objects in movement (like vehicles or persons), many techniques used in pattern recognition don't work efficiently for our purposes. The main reason is that the object in movement appears with so big pictorial and morphological differences in two consecutive frames that its features cannot be used in the correspondence algorithm. If there are different objects moving in the scene another big problem arises: two or more objects could hide each other (occlusion problem) or come so close to be observed like one (collision problem). The second problem strictly depends on the set of features used and usually is solved by means of cost functions [1]-[4]. In this article a new method is proposed, exploiting the so called 'cooccurrence matrix'. Once the moving objects are extracted, their trajectory is found, and a comparison between the actual image and its prediction, based on the previous motion of the objects, is made. The cooccurence matrix explicits how the actual targets and their predictions match, and give us important (and, by the way, very simple) hints on solving the problem of occlusion and splitting between objects.
251
2. E X T R A C T I O N OF T H E F E A T U R E S In our method, the very first feature of a moving target is its shape, as obtained by observing the frames and extracting the differences between the image with the moving objects and the same image without the objects in motion. Therefore, we need a reference image, built by means of the Discrete Gray Level Follower (DGLF) algorithm shown in the following; the aim of the procedure is to obtain a binary image with black regions -the objects in motion- on a white background. 2.1. The Discrete G r a y Level Follower algorithm The DGLF algorithm takes as input a sequence of original images {Ok}. It processes this sequence extracting the moving objects and giving as output a single reference image R(i,j), representing the environment without the objects in motion. The algorithm works initally setting R(i,j) Vi,j to an intermediate gray value (for instance, 128 for a 8 bit image), and then, iteratively for every new {Ok},R(i,j) is compared with the corresponding pixel value Ok(i,j): if R(i,j) > Ok(i,j) then R(i,j) = R ( i , j ) - m, else R(i, j) = R(i, j) + m, where m is an integer in the order of unity. If this operation is done for each frame in the sequence, R(i,j) approach the gray value of an image without moving objects in a finite number of step, depending upon the value of the constant m. With m = 2, we found that R(i,j) could be considered an actual reference image after 50-60 h'ames.
2.2. The detection of moving regions Given R and {Ok }, the following step is to find the objects in motion and to show them as white blobs on a black background. This is done by low pass filtering both R and {Ok}, to have a better result in the following step, computing the difference between R and {Ok}, to create a sequence of images with high value gray level only in the zone of the moving objects, thresholding the sequence Dk(i,j) = I I ( R ( i , j ) - {Ok}ll, to obtain a binary image, with black moving blobs on a still white background, and morphologically filtering the resulting binary image in order to eliminate spurious white pixels. In fig 1 a sequence frame and the resulting binary reference image are shown.
3. Motion Analysis The binary image gives information on the regions of motion: each blob represents a whole target or a part of it, and can be identified by some simple features, like the position of its centroid, its area, its centroidal profile, and so on (see [5], [6]). As already observed, each blob is generated from a moving object, but there is sometimes no biunivocal correspondence between blobs and objects. In fact, we may have fragmentation.: a single object can produce two ore more blobs, due to its pictorial conformation (for instance, a person with a dark hat, dark trousers and a white-gray sweater that walks on a white-gray ground, generates two blobs: one in correspondence of the hat and one in correspondence of the trousers). Moreover, there are the collision problem (two or more objects can come so close to generate a single blob), the occlusion problem (an object can hide or partially overlap another object) and the disappearance problem (an object is so similar with the background that does not generate any blob). Because of all these problems it is not
252
Figure 1. left: a frame representing a street pavement with many people walking, seen by a camera positioned over it; right: the blobs extraced (representing moving targets or parts of them).
possible to consider only simple correspondence algorithms like [1],[2],[41, since they do not take into account the complexity of the different situations that can occurr. The algorithm proposed here is based on the assumption that all the motion phenomena have a temporal correlation; in fact, due to inertia, the movement of a physical entity cannot change instantaneously, and if the frame rate is high enough to avoid dramatic changes between two consecutive frames, a very reasonable assumption is that the objects' motion don't change very much. With these premises, if the motion of a blob is well known in two consecutive frames, it is easy to make a prediction of its position in the following one. The simplest prediction is to set the expected vector of motion equal in direction and intensity as the previous one; if ambiguities occur, we can solve them by exploiting the history of the blobs, to discriminate, for instance, between the fragmentation of a blob from the separation of two or more previously collided ones. This kind of problem is efficiently faced with the aid of the so called 'cooccurrence matrix' and the miniinisation of a suitable, simple cost function. In fact, with the assumption of temporal correlation it is possible to build an expected image, by finding the expected position of each blob through a shift equal to the displacement of the previous movement. Then we build the intersection between the expected blob set {xbl,xb2,... ,xbM} and the actual image blobs {abl, ab2,...,abM,}, building the M ' x M cooccurrence matrix C(i,j), whose rows represent the blobs in the new image, while columns represent the blobs in the expected image: the matrix element C(i,j) = (abi. xbj)/N ~, where N is the maximum gray level value in the image, represents the intersection between the new blob i and the expected one j. The study of this matrix allows to find the correspondence between blobs, the fusion of two or more expected blobs in a new, the separation of an expected blob in more new blobs. The complete algorithm is described in the following.
253
4. C o o c c u r r e n c e A n a l y s i s Given the cooccurrence matrix and the position of the centroid and area of each blob, the algorithm is able to classify tile blobs of the new image in four classes: (1) blobs corresponding to a blob of the expected image; (2) blobs splitted from a blob of the expected image; (3) blobs generated fi'om the fusion of two or more blobs of the expected image; (4) blobs entering for the first time in the vision field. The algorithm is composed by three different levels of analysis. At each level, classified elements are removed from the matrix C and, if more non-zero elements are found, the procedure moves to the following step. 4.1. First level The search is initially set. for the following cases: 9 Rule AI: simple correspondence between an expected blob i and a new blob i'
C(i',i) -r 0 and
C(j,i) = 0 V j # i'
(1)
9 Rule A2: fusion of two expected blobs i,j in a new blob i' C ( i ' , i ) , C ( i ' , j ) r 0 and
C ( k , i ) , C ( k , j ) = 0 V~, r i'
(2)
9 Rule Aa: splitting of two new blobs i', j ' from an expected blob i
C(i', i), C(j', i) r 0 and
C(i', k), C(j', k) = 0 V k # i
(3)
9 Rule A4: appearance of the blob i' near the border of the image
C(i',k)=O
Vk
(4)
9 Rule A5: disappearance of the blob i, previously near the border of the image, C(k, i)
:
0 V/,:
(S)
Rule A6: correspondence between expected blobs and new blobs not overlapping any other one: this correspondence is established if an expected blob and a new blob that does not overlap other blobs and that does not satisfy conditions A4 and A5 minimises the cost fimction F = wlA0 + w2Ai where F is applied to the center of the blobs, A0 is the difference in direction, and Ai is the difference in intensity between the last displacement vector of the trajectory and the displacement vector existing between the expected blob and the new one. 4.2. S e c o n d level Next step is to look for the following cases: 9 B1) Fusion of second level: if an expected blob does not overlap anything and does not satis~- rules A4 and AS, we considered it fused into the new blob minimizing F. 9 B2) Splitting of second level: if a new blob does not overlap anything and does not satisfy the cases A4 and A5, we consider it splitted from the expected blob minimizing F.
254 4.3. T h i r d level The non-zero elements that remains at this step in the matrix probably refer to blobs with multiple overlapping. For these blobs the correspondences can be found out using the minimisation of cost functions like [1],[4]. A different method is here proposed: 9 Rule CI: search for rows of C with only one element different from zero. Let this element be C(i,j). If the j-th column has other non-zero elements, they are cancelled if they are not alone in their row:
C(k,j) = O
~
C ( k , j ) r O and C(k,l) = O V lsCj
(6)
9 Rule C2: search for column of C with only one element different from zero. Let this element be C(i', j'). If the j'-th row has other non-zero elements, they are cancelled if they are not alone in their column:
C(i', k') = 0
~
C(i', k') 5r 0 and C(l', k') = 0 V l' r j'
(7)
9 Rule C3: search for the rules A1, A2 and A3 with a first level analysis. 5. T a r g e t t r a c k i n g a n d E x p e r i m e n t a l r e s u l t s After the cooccurrence analysis, next step is to analyze the correspondences found and give a label to each blob of the image. Then we can track the trajectory of each target during its motion along the frame sequence. Here we define a strategy to deal with the many different situation that can be found: - a single blob generated by the fusion of multiple expected blobs: all the labels of the expected blobs are written in the new blob label vector, putting at the first position the label of the biggest expected blob; - an expected blob which label vector has only one element (i.e. that corresponds to a blob that wasn't generated by fusion in the past) splitting in multiple new blobs: the biggest new blob has the label of the expected blob from which is generated; the other ones have new labels; - the splitting of multiple new blobs from an expected blob which vector label has more than one element (i.e. a blob that was generated by fusion of othe ones): each new splitted blob is assigned to one of the labels of the expected blob. The matching is based on the minimisation of the function F, defined between each pair from the set of the fused blobs and of the splitted ones. The complete procedure has been applied to image sequences taken from a camera at a three frames per second rate. In particular, in fig. 4, we present four frames of a road traffic scene, where different vehicles can be observed and distinguished. The blobs extracted through the first step of the algorithm are labelled following the above detailed procedure, and continuously tracked along the sequence. In the sub-sequence here presented, two blobs fuse in a single one, and, successively, it splits again in two parts. This effect, dur to two cars moving in the opposite directions and overlapping in the image, is correctly aknowaledged. The results of the algorithm, shown in fig. 5, are not easy to be observed in these very few frames, but allow to observe that all the moving objects in the scene are correctly tracked. The cpu-time required by the analysis is very short, and this recommends the implementation of the method for the real-time analysis of sequences with small frame rates. It allows to consider its use in traffic control systems for many applications, like, for instance, automatic road traffic controllers.
255
Figure 2. Four frames of a road sequence showing cars moving in opposite directions.
Figure 3. The blobs extracted from the frames of fig. 4: they are correctly labelled following the correspondence algorithm proposed.
REFERENCES
1. I.K. Sethi, R. Jain, "Finding trajectories of features point in a monocular image sequences," IEEE Trans. on Pattern Anal. Mach. Intell., Vol. PAMI-9, No. 1, pp. 56-73, Jan. 1987. 2. V. Salari, I.K. Sethi, "Feature point correspondence in the presence of occlusion," IEEE Trans. on Pattern Anal. Mach. Intell., Vol. PAMI-12, No. 1, pp. 87-91, Jan. 1990. 3. F.P. Ferrie, M.D. Levine, S.W. Zucker, "Cell tracking: a modelling and minimisation approach," I E E E Trans. on Pattern Anal. Mach. Intell., Vol. PAMI-4, No. 3, pp. 277-291, Mar. 1982. 4. J.P. Gambotto, "Segmentation and interpretation of infrared image sequences," Advances in Computer l/%ion and image processing, JAI Press Inc., 1988. 5. Davies, Machine Vision, Academy Press, London, 1991. 6. Davies, Three-dimensional Computer l~
256
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
Robust pose estimation by marker identification in image sequences* L. Alparone a, S. Baronti b, A. Barzanti a, A. Casini b, A. Del Bimbo c, and F. Lotti b aDipartimento di Ingegneria Elettronica, University of Florence, via di S. Marta, 3, 1-50139 Firenze, Italy. blstituto di Ricerca sulle Onde Elettromagnetiche "Nello Carrara"- CNR, Via Panciatichi, 64, 1-50127 Firenze, Italy. CDipartimento di Sistemi e Informatica, University of Florence, via di S. Marta, 3, 1-50139, Firenze, Italy.
Digital image processing techniques are regarded as the most effective and flexible tools in applications aimed at self-locating mobile robots for indoor navigation. A viable approach consists of equipping a robot with a single TV camera and placing some known objects, or markers, into the environment. Once one or more markers have been recognized, an estimate of the position of the camera is derived by finding the spatial transformation which links each of the imaged markers with a reference model stored in a data base. This work presents a complete procedure which has proven itself to be robust in recognizing polygonal markers and simultaneously estimating the location of the imaging device. The method relies on an inverse perspective transformation, and on a low-level procedure capable of extracting significant vertices from polygons, and effective also in conditions of low SNR and poor acquisition.
1. PROBLEM STATEMENT Among the various tasks of a moving robot, people agree that position estimations aimed at self-referencing to correct drifts of trajectory, as well as target identification should be accomplished by vision tools [1]-[3]. The complexity of the environment in which the robot moves plays an important role, and tentative solutions usually need a number of simplifying assumptions. Nevertheless, it is important to investigate feasible solutions as general as possible, but still consistent when simplifications are progressively removed [4][5]. In this work we report about a procedure capable of finding the location parameters of a robot, not only when it moves within a structured environment, in which light conditions are under control and the occurrence of possible obstacles is somewhat limited, but also under more general work conditions. The imaging device is a standard B/W TV camera rigidly secured to the moving robot; frames of the video signal are digitized at a constant rate, and *Work partially supported by a grant of the National Research Council (CNR) of Italy within the framework of a Special Project on Sensors and Image Processingfor Robot Navigation.
257 processed in order to recognize one or more objects within the scene, whose locations and geometry are known (markers), from which a consistent estimate of the position of the imaging device may be derived [6]-[8]. Markers should be designed in order to expedite detection, recognition, and georeferencing; which implies that the choice of the most suitable shape and size to be adopted, especially for a single marker [6], can be made only after the algorithms for segmentation and location have been assessed, or at least defined. To ensure an easy detection, markers should appear as evident objects; this suggests maximizing the contrast between a marker and the surrounding background, e.g., black markers pasted on a white background. The recognition task actually determines the shape of a marker. Since the object vertices are fundamental features of most of the algorithms employed for referencing [6],[9], we adopted polygonal markers. Depending on the marker's shape, this choice also allows some simplifications to be introduced, as well as some constraints to be imposed, whose objective is to enable some check-of-fitness of the results. In the case of a diamond-shaped marker, for example, symmetry can be used to check the positions of the vertices; if such a marker is placed within the scene (e.g., on a wall) with diagonals aligned horizontally and vertically, the apparent difference in length of the diagonals leads to a direct estimate of the rotation angle by which the camera is viewing the scene. However we would like to stress that the algorithm adopted for recognition [7] is able to work for any planar objects represented by means of the coordinates of their vertices.
2. M A R K E R RECOGNITION
Many algorithms are effective for segmentation when light conditions are kept constant and the contrast between objects and the background is high. In the presence of a poor signalto-noise ratio (SNR) (i.e., either low contrast or high noise, or both), an approach based on Laplacian of Gaussian (LOG) filtering followed by detection and validation of zero-crossing (ZC) points first, and then of the whole contours, resulted to be an effective and reliable procedure for vision tasks in general, and the present application in particular [ 10]. In order to find out significant vertices of the extracted objects, which are needed by the successive procedure, the Teh and Chin's algorithm [9] has been adopted to produce polygonal descriptions of the contours. The high-curvature points (the so called dominant points) which delimit the piecewise linear curve are taken as candidate vertices. Redundant dominant points are pruned by a recursive procedure which is based on the following steps: 9 dominant points are selected by thresholding their absolute curvatures to about 30% of their absolute maximum, to reduce the computational effort of the following steps, while maintaining all the real vertices; 9 the minimum allowable distance between consecutive dominant points is established; if two or more points occur within this interval, only the one having the highest curvature survives; the threshold should be related to the shape of the marker, being also proportional to the square root of the embraced area, to account for variable distance; 9 the alignment of dominant points is checked for: if more than two points are likely to belong to a unique straight line, only the extremes are retained; the confidence interval depends on the accuracy of the segmentation step, and eventually on the SNR workconditions, as pointed out in [10]. Once a suitable set of likely vertices have been obtained, they are compared with the vertices of each of the reference markers in the data base. This is accomplished through the
258
Inverse Perspective Transformation Z)t~%'%%% (IPT) algorithm [7]. v=(o,o,o With reference to Figure 1, let us define two coordinate systems. The first, (X,Y,Z), is centered on the o =(o,o,t) , % PX camera, the Z axis being coincident with its optical axis in such a way that j the imaging plane of the camera lies on the plane z = f. The second, (Xo, Yo,Zo), is centered on the object with the X o and Yo axes lying within the object plane, and Z o normal to the plane itself. We shall identify the slant angle ts as the angle between the Z Zo and Z o axes, and the tilt angle x as the angle between the X axis and the Figure 1 - Reference system: slant (o) and tilt (x). projection of Z o onto the image plane. If C is the distance between the camera and the point defined by the intersection of the Z axis with the object plane, the object-camera distance d is a function of the view angle and of C. Introducing the above notation, the IPT will be stated as:
,,
y e=(o,o,c)~
X 0
"
C
~
Yo = C.
f-
(x.cosx + y-sinx)/costs x-tano-cosx - y-tano-sinx
f-
-x-sinz + y-cosx x.tano-cosx - y-tano-sinx
(1)
We can note that C, and consequently d, modify the coordinates of the transformed object points in a linear manner: hence, also the dimensions of the object are linearly affected and its shape is preserved and not warped, thus making recognition to depend only on the tilt and slant angles: C will be assumed equal to f without loss of generality. Eq.(1) can be simplified by observing that in standard work conditions the size of the object is much smaller than its distance from the camera (x and y are very small when compared to C and 35, thus becoming x-cosx + y-sinx X 0
coso
(2)
Yo = -x.sinx + y-cosx
Now, let us consider the problem of recognizing an object. The object will be described by means of the ~F(q) function of the contour of the shape [4][9] where q is the distance of a genetic point of the contour from a reference starting point, measured along the contour itself. By adopting these descriptors the recognition algorithm is independent of possible rotations and translation of the markers in the object plane. The recognition problem can be stated as follows: given the shape of an object A in the image plane, verify whether its shape can be related to a reference shape B by means of an IPT. To do this, a new object C is obtained through an IPT by known slant and tilt angles
259 applied to object A; a set of equations relate variations of the segment lengths and of the polygon angles of objects A, B, and C, with the slant angles ~B and ~c. The angle ~8 is the unknown slant between objects B and A; ac (usually taken equal to 65 ~ [7]) is the slant by which C is obtained when an IPT is applied to A. For a large range of c and x, variations of segment length and angular amplitude in the object and in the image plane can be approximated with quadratic functions of ~ [7]. Thus, for a given tilt ~ and for two nonzero slant angles ~B, C~c, and for each vertex k of the polygon, it holds that Aql,(~B) r
2
=
Aql,(~c) 2 CYc
;
A~(qI,(CYB)) A~(ql,(~c)) 2 CY8
=
2 (Yc
(3)
Each of the pair of Eq.(3) is solved for cyB. If B is the object which originated object A, all the values found for ~8 are theoretically the same. Actually, the values of ~8 are distributed around a mean with a small standard deviation. For objects with different shapes, the estimated standard deviation is much greater (at least two orders of magnitude). This fact suggests adopting the standard deviation as the parameter which drives the recognition. Up to now, we supposed the tilt angle be known; this is not completely true. Actually, the algorithm is iterated for a number of values of x (every 5~ if recognition is established, the value of x which gives the minimum S is assumed as tilt. The georeferencing process lasts by inverse perspective transforming the object A with the ~ and x angles previously found; its apparent dimensions provide an estimate of its distance from the imaging point.
4. RESULTS AND DISCUSSION In order to assess the robustness and the accuracy of the recognition procedure, many scenes of known geometry have been imaged. Results are specifically presented for a diamond-shaped marker -the most favorable along with the star shape- whose images have been recorded at distances ranging from 50 to 290 cm, for angles between 0~ and 75 ~
Figure 2 Imaged sample marker
(SNR = 20 dB), its LoG zero-crossings, and detected object (left-to-right).
Figure 2 displays a sample image of a star-shaped marker taken with a 20 dB SNR, its LoG zero-crossings achieved with a LoG sigma equal to 2, and the contour of the object detected after applying the procedure reported in [10]. Figure 3 reports the standard deviation S occurred in the estimation of the slant angle as
260 a function of the distance, for several angles. It is apparent that S < 0 . 8 always, and in most cases S < 0.1. When different objects are compared with the reference marker, S > 3 always. Therefore, the shape recognition stage is not critical: objects that are not markers are discarded by simply thresholding S. Further experiments confirmed that the position / of the marker with respect to the optical axis does ._~~ \,--~-~m\ o /.I.."//, not affect the recognition performance. This fact expedites the task of verifying the results achieved ~o~ F" " by the algorithm concerning the estimation of slant ~= .... "-. ./ ./ / and tilt angles and distance, since the marker can be t ,,. ,,.," / / 0.1 l ' " . . . . . . . . . "" " - " " Y"" placed at the same height as the optical axis of the camera, without losing generality. With this ol assumption the tilt angle is 0 ~ and the slant angle 50 60 70 90 110 1 0 1 0 2 0 290 can be directly obtained by comparing the apparent marker-camera distance (cm) length in the image plane of horizontal and vertical segments of known real length. The apparent length Figure 3 - Standard deviation in slant angle estimation as a function of camera-marker of vertical segments yields a prompt estimate of the distance for several acquisition angles. distance of the objects from the imaging point. Comparison of the results obtained with the procedure described in [7], and with those deriving from the simplifying assumptions of Eq.(2) allows the performance of the proposed recognition procedure to be evaluated with respect to two possible error sources: the first due to the discrete nature of the images used to recover geometry information; the second to be charged to misfunctioning of the recognition procedure. Throughout the case study, the two different estimations have always been in accordance with a maximum difference of 3 ~ for slant. The recognition algorithm always under-estimates the slant; this is a systematic effect which occurs in all tests, probably due to the quadratic approximation of the variations of angles and lengths with the slant cy. Also the tilt angle was always correctly estimated as 0 ~ In Figure 4 the slant estimated for various ~.t (dq.) 67.5 . . . . . ,, , ~ acquisition angles is plotted versus the distance | ~ . . . . .;,. . . . . ~. . . . . _ ~ ~ . . ~ .......... between camera and marker. Estimation errors ~.5' .....:......:..... ......- .....: .....i.......... -L-...., .....:......:..... are independent of the distance, while there is a certain dependency on the acquisition angle, , 0 1 ~ . ~ . . . . ~ . . . . . . . . . . ~ . . . . . . . . . . , . . . . . ~ . . . . . ~. . . . . . ~. . . . which is roughly the same for each distance. A further remark concerns the value of the 7.s ..... "..... slant used to construct the object C necessary for the application of the method. Pizlo and 50 70 90 110 130 150 170 190 210 230 2SO 270 290 d ~ (cm) Rosenfeld [7] suggest using a value of r e f e r e n c e slant (Jc = 65~ We investigated this point by examining the performance of the algorithm Figure 4 - Estimated slant angle as a function of the when this value is varied from 3 7 . 5 ~ to 70 ~. camera-marker distance for several acquisition angles. While marker recognition is still correct, the error affecting the estimation of slant and tilt angles becomes larger. Tilt angle results to be in some cases x = -5 ~ The trend of slant error to the reference angle is shown in Figure 5. Indeed, crc = 6 5 ~ is the best value, except for an acquisition angle of 75 ~ representing the limit case [7], due to perspective bindings. ,.-
[
03
._o
N
O0
45
30
15
'/":/
~ l .
~.5:
.....
: ...........
: .....
: .....
22.5
.....
:. . . . . .
:.....
" .....
" ...........
" .....
i .....
! .....
9
..... i .... ---.---4-
: ...........
i .....
~. . . . . .
: .....
? .....
i......
i....
:. . . . . . . . . . .
" .....
:" . . . . .
:......
:....
i .....
i .....
- - i
" .....
261 For what concerns the accuracy in distance estimation, we have found that the maximum error ranges from 0.5 to 2.5 cm at distances of 50 and 290 cm, respectively, and lower than i% in most of cases, as shown in Figure 6. Since the percentage error is somewhat independent of the distance, the absolute error is approximately proportional to the distance. Note that at the distance of 170 cm, and for angles ranging between 15 ~ to 45 ~ the error is zero, since these measures have been taken as references to calibrate the camera. 24
slant error (deg.) .
1 .
.
~0.6o ~- 0 . 4 r 0,2-
20 16
0-
12
/ ,.
8
~-o.4 -o,6;-0,8 -
0
O_
-1
-
.4
-1,2 -1.4 -
-8
-1.6 -1.8 -
15
3O
45
6O
75
/:
W
\
50
lqr--~
/ /,; I/\/
\
//,,/
~'-0,2 -
4,
-12 0
/\
0.8-
.
,...,?"
-~,
...~.........
z. "-.,.-\/ \
\
/
" 1,0
t
o
9
,5
60 60
70 90 110 marker-camera
130 170 210 290 distance (cm)
nominal angle (deg.)
F i g u r e 5 - Slant error versus acquisition angle, at a distance D=110 cm for different reference slant angles 6c-
F i g u r e 6 - Percentage error in distance estimation varying with distance D (cm) and acquisition angle (degrees).
REFERENCES
1. J. W. Courtney, J. K. Aggarwald, "Robot guidance using computer vision", Pattern Recognition, Vol. 17, pp. 585-592 (1984). 2. S.Y. Chen, W. H. Tsai, "Determination of Robot Locations by Common Object Shapes", IEEE Trans. Robotics Automat., Vol. 7(1), pp. 149-156 (1991). 3. M. R. Kabuka, A. E. Arenas, "Position Verification of a Mobile Robot Using Standard Patterns", IEEE Trans. Robotics Automat., Vol. 3(6), pp. 505-516 (1987). 4. D.H. Ballard, C. M. Brown, Computer Vision, Englewood Cliffs, NJ: Prentice Hall (1982). 5. R. M. Haralick, L. G. Shapiro, Computer and Robot Vision, Vol. I, Reading, MA: Addison-Wesley (1992). 6. M. F. Augusteijn, C. R. Dyer, "Recognition and recovery of the three-dimensional orientation of planar point patterns, CVGIP, Vol. 36, pp. 76-99 (1986). 7. Z. Pizlo, A. Rosenfeld, "Recognition of planar shapes from perspective images using contour-based invariants", CVIP: Image Understanding, Vol. 56(3), pp. 330-350 (1992). 8. T. S. Huang, A. N. Netravali, "Motion and structure from feature correspondences", Proc. IEEE, Vol. 82(2), pp. 252-268 (1994). 9. C. H. Teh, R. T. Chin, "On the detection of dominant points on digital curves", IEEE Trans. Pattern Anal. Machine Intell., Vol. 11(8), pp. 859-872 (1989). 10. L. Alparone, S. Baronti, A. Casini, "A novel approach to the suppression of false contours originated from Laplacian-of-Gaussian zero-crossings," Proc. ICIP, I, pp. 825-828 (1996).
262
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
Markov random field image motion estimation using mean field theory A. Chimienti, R. Picco, M. Vivalda Television Study Centre of the National Research Council Strada delle Cacce 91, 10135 - Torino - Italy
Abstract The estimation of a dense displacement field from image sequences is an ill-posed problem because the data supply insufficient information, so constraints are needed to obtain a unique solution. The main advantages of Markov random field modelling of the displacement field are its capacity to regularize the motion vector field, smoothing it while preserving motion discontinuities, and its power to easily integrate information derived from gradient based and feature based motion constraints, obtained by the introduction of other fields in the model. The configuration of the fields are computed by a deterministic iterative scheme derived from the mean field theory and saddle point approximation. The algorithm is well suited for a multigrid approach to obtain more regular results and to speed up the convergence of the iterations.
1. Markov random fields A random field is a family of random variables defined on a lattice, a regular set of sites with a neighbourhood system. In this work the nearest-neighbour system was considered, so the neighbourhood of the position (i,j) is the set ~(i,j) = { (i,j - 1), (i,j + 1), (i - 1,j), (i + 1,j)} . A subset of sites formed or by only one site or by more sites each of which is neighbour of all the others is called a clique. A random field is said a Markov random field [ 1] if each of its variables defined over a site is influenced directly only by variables defined on neighbour sites, that is p(xn = anlxm = a m , V m # n) = p(xn = anlXm = a m , m ~ ~n) 9 The Markov property causes the probability distribution of the field to be a Gibbs distribution, p = ~1 e -flu where fl is a constant, Z, called partition function, is the probability distribution normalization constant and U, called potential function, is a sum of basic potential functions defined on cliques.
2. Mean field approximation of the optimal configuration of variables In the problems expressed by Markov random fields, the solution lies in computing the configuration of variables which maximizes the probability distribution. The classical algorithm for this kind of problems is the simulated annealing [ 1], a stochastic procedure which converges statistically to the optimal solution, but presents a great computational complexity. To overcome this problem other techniques have been developed; one of these is the iterative conditional mode [2], a deterministic procedure that converges more quickly but can also be stuck in a suboptimal solution, a local minimum of the potential U. Another technique, known as mean field approximation [3, 4], solves the problem in a different way: instead of computing the optimal solution, which is the configuration k that mimimizes the potential U, the mean field is computed
263 x 2 e1 -~U(x) ,
~=
{x} where the sum is over all the values taken by each variable defined on each site. The mean field 2is an approximation of J~, in fact lim ~ = J~ ; in pratice ~is a good approxi-
fl--.
+oo
mation of J~ for sufficiently high values offl.
3. Model for the motion vector field determination The computation of the motion vector field from image sequences is an ill-posed problem because images supply insufficient information. Some constraints are so needed to regularize the solution; the classical constraint [5] is the smoothing of the displacement field, but it produces poor results at objects boundaries. With the Markov random field framework [2, 3] it is possible to handle motion discontinuities, smoothing the displacement field inside the objects while preserving motion boundaries, by the introduction of the motion discontinuity fields in the model; furthermore this model can easily integrate information derived from gradient based motion constraints, valid in the regular regions of the images, with other knowledge brought by feature based constraints, valid in the edges of the images. The motion vectors and the other variables introduced in the model to handle motion discontinuities and the uncovered areas of the image are considered Markov random fields. The displacement field and the uncovered background field are defined on the lattice of the positions of the pixels, while the motion discontinuities are defined on lattices formed by intermediate positions. The fields of unknown variables are: d = (a,b), the displacement field; h and v, the so called line processes, pointing out the motion discontinuities of the d field in horizontal and vertical directions respectively; they are binary fields, but for reasons that will be explained later they are considered real fields taking values in the interval [0, 1]; s, a field that shows the uncovered areas of the image, the points for which there is not a correspondence with the previous image; it is a binary field too, but is considered real as h and v. The fields of known variables, called observations, are: x t- 1 and x t, the two images at time t - 1 and t ; f t - 1 = (V__~t-l, V_ty-1),the gradientof the i m a g e a t t i m e t _ 1; d, a field defined only on the edges of the image x t, formed by the projection of the displacement d along the direction normal to the edge, computed in an image preprocessing phase. The probability distribution of all the fields is a Gibbs distribution, p ( a , b , h , v,s) = 21e -flU(a,b,h,v,s) with the potential U is so defined
U = 2{~f.. tj
fij
(1-sij)(1-~ij)+ +
(l -
+
The potential
.. __ (xt
F j
(1-hij)+D vij (1-vij)]+
2hrelij Hij ~,ij+ l~d[D h
ij
X(_I
t-aijj-bij
-
)2
causes the minimization of the displaced difference between the two images x t and x t/
1in the
regular areas and creates un uncovered area indicator (s ij = 1 ] in a site where the displaced dif\ /
264 ference is greater then the threshold T, a parameter that balances Fij, the price paid to set sq = 1. This potential is not applied in those sites of the image, the luminance edges, identified by the parameter binary field Yij = 1. In these sites the second potential is enabled, Hij = (aijnxij+bijnYij-
-dij) 2
which causes the minimization of the difference between the projection of the motion vector
(aij•bij)•n••the•nit•e••••(nxij•n••j)•••ma•t•theedgeandthep•e••••s••••••••a•edp••je•tion d-ij ; this potential is weighted by relij, a parameter taking value in the interval [0, 1] that indicates the reliability of the estimated dij. The potential
Dh..tj = 1 - 2e
-
,
where fld is a scalar, and its vertical companion D.V. smooth the d field in absence of motion distj - - 0 and vii = 0 respectively) and set motion discontinuities when the poten-
continuities
(hij
tials values are high. These potentials take values in the interval [ - 1, 1], so they are self-balanced. The horizontal potential Eq.= 1 IJ
2
ij
'
t - lj
where e is a small value that prevents E.h. to go to infinity, and the vertical one E~. inhibit the cretJ tJ ation of motion discontinuities between almost equal intensity pixels. They occur in fact at object boundaries, the edges of an image. The horizontal potential G.h. is so defined tJ
Gh.. = 1 - 2e-fl'(s,,-s,-,,)2 lJ
;
for this potential and the vertical one G v. the same considerations made for D h. and D v. hold; as IJ lJ lJ only difference they smooth the s field. The weights/~h a n d / t v. ,J inhibit (if positive) or excite (if negative) the creation of the motion discontinuities everywhere; in this work they are set to zero but they are considered for their usefulness in the analytic approximation that follows. The scalars 2f, 2 h , 2 d , a/and2s are weights for the related potentials. The introduction of the potential Hij, which offers a complementary information regarding to
Fq ' is derived from [2]"' the other potentials are taken from [3], but G.h.and GV.have been moditJ tJ fled because s is now considered a real field and other potentials that cause the self-interaction of h and v have been eliminated to make possible the analytic computations present in the mean field approximation.
4. Edge matching The optimal configuration of the motion vector field is the one which minimizes the displaced difference between the two images x t and x t- 1. The gradient ~ t - 1 is used to compute the motion field. Given an estimate of the vector dij, the new estimate is reached moving in the same direction of the gradient in the position (i, j) - dij or in the opposite one, depending on the sign of
265 the image displaced difference. This procedure is generally correct, but it suffers for two drawbacks. The first one, known as the aperture problem, is the updating of dij in the direction of the gradient that causes the determination of the displacement vector projection along this direction instead of the right vector. Furthermore the choice of the direction based on the sign of the displaced difference may cause to move in the opposite way of the actual one in the luminance peaks or dips of the images. The first disadvantage is an intrinsic limit of the displacement field computation, which is an ill-posed problem, and is overcome, at least partly, by the regularization constraint that considers the co-operations of adiacent positions, the second one however does not get sufficient benefit from this procedure. Since this problem occurs at the luminance edges of the images, in points that can be put in correspondence in two consecutive images because they are insensitive enough to the noise, it is necessary to find these correspondences. Two discontinuity maps, one for each image, are created, formed by the points whose gradient norm is higher than a suitable threshold. The positions of these points are intermediate with respect to those of the image pixels. For each of these points the direction of the gradient is quantized and stored; eight directions have been chosen, the four of the two Cartesian axes and the four of the two bisectors of the main quadrants. For each point (i, j) of the map of x t, the correspondent one is searched in an area of x t - 1 centered in (i, j) that belongs to the map of x t - 1 and is characterized by the same gradient direction. At this point the reliability of the correspondence is evaluated, computing the energy of the difference between a block of the image x t near (i, j) and the block of x t - 1 positioned in the same way respect to the correspondent point of (i, j). To handle motion discontinuities, for each point of the map of x t two blocks are considered, set in opposite sites respect to (i, j) in the direction of the gradient, because areas on opposite sites respect to an edge may have different motion. In this way the image points near (i, j) are characterized by two motion vectors, perhaps different. Each vector is the more reliable one among all the candidates. If the reliability is higher than a suitable threshold, the edge match is considered valid and for the image points near (i, j) the projection -dij of the motion vector along the gradient direction is stored. In these points (n, m) the potential Humis enabled, as said before, instead of F n m which is unreliable there. To improve the precision with which dij is known, it is possible to consider displacement taking real values, measured in distances between pixels.
5. Optimal configuration computation For the line processes h and v, the mean field approximation is used. The partition function Z is the normalization constant of the probability distribution, so the following equation holds
Z
-- Z Z Z Z Z e-fl U(a,b,s,h,v); {a} {b} Is} {hi
Iv}
then h ij can be expressed as 10lnZ
-
hu =
.
~ o~ h
tj
the equation above shows the analytic usefulness of the weight/t,h. tj even if it is set to zero, as said before. Therefore the partition function must be calculated; the two sums Z ~ puted because h and v are non interactive discrete fields and so Z becomes
can be com-
{h } Iv}
Z = Z Z ~ e -flV(a'b's) , {a} {b} {s}
where V is a suitable potential; the three sums Z Z ~ {a} {b} Is}
are not computable analytically, so an
266 approximation is made, called saddle point approximation: each sum is replaced by the product of a constant and the contribution brought to it by the saddle point (perhaps the minimum point) of the potential V
Z = ~ ~ ~ e-#
V(a,b,s)~'k
e-~ v(?,,~,~),
{a} {b} {s}
where k is a scalar and ~, b and ~ satisfy the equations
oV - 0 , c9~.. tj dV - 0 , Obij ix
OV a?~ij
o
vij.
Then the mean value h~j becomes
this equation shows the reason why h and v are considered real fields. The saddle point is assumed as an approximation of the optimal configuration for a, b and s fields. It leads to the iterative scheme 1 0V , b!n + 1) __ b!n.) 1 0 V , s!n. + 1) = s!n.) 1 OV a!n + 1) __ a!n)
tj
tj ka da ij tj tj k b db ij tj lj ks OS ij ' where ka ,k b and ks are suitable scalars. The values hij and v~j are updated after each iteration m
of the previous equation system. A detailed description of the algorithm can be found in [6]. i
Implementation improvements and results 0 V ~0 V and Osij OV the displaced difference between In the derivatives -~/j,
Xt
and
X t-
1 and the ele-
ments of the gradient ~ t - 1 appear. These variables in the edges of the images may take values greater by one or two magnitude orders than the values assumed in the regular areas, and this causes poor motion estimation and the creation of wrong motion discontinuities. So, to limit their dynamic range, the components of the gradient are replaced by their signs and the displaced difference of the images is clipped by an exponential-like compression function. This procedure improves greatly the computation of the motion vectors and avoids the creation of wrong motion discontinuities and uncovered area indicators. The algorithm is well adapted for a multigrid approach useful to obtain a more regular displacement field in images where wide motion is present. The image is split in levels: level 0 corresponds to the original image, each successive level corresponds to an image obtained from the one at previous level by low pass filtering and subsampling. The algorithm is applied first to the images at the coarsest level, then the displacement field obtained, appropriately adapted, constitutes the starting configuration for the iterations at the finer level. This strategy gives more regular results and speeds up the convergence of the algorithm. To improve the uncovered areas handling it is possible to make a little change in the edge matching procedure, useful especially in synthetic images. The edge matching algorithm gives two motion vectors for each edge luminance point, each with its own reliability; if one is greater than
267 the other, the lower one is no longer considered valid and uncovered area indicators with motion discontinuities are set in a suitable region near the point, because this region could belong to an uncovered area. If this hypothesis is wrong, the uncovered area indicator and the motion discontinuities will disappear during the iterations without consequences for the final configurations. The algorithm was tested on a subsampled version of the sequence Mobile & Calendar, rich of details and complex motion, on a subsampled version of the sequence Brazil, of medium complexity, on the sequence Miss America, of low complexity and on a synthetic sequence formed by a square moving with known uniform motion on a steady background. Two different criteria have been used to evaluate the correctness of the computed motion fields. For the real sequences, since the true motion field is unknown, it has been considered the difference between the actual image and the displaced previous one. The results are compared with the ones obtained by the standard technique used for image difference energy minimization, the block matching, with 8 • 8 pixels wide blocks and with precision of 0.5 pixel in the motion vector components. For the synthetic images, on the contrary, the mean squared error of each component of the motion field with respect to the true motion field is considered to evaluate the results; it is computed on all the image with the exception of the uncovered areas. The results obtained are shown in the following tables. In both cases they show a great improvement given by the dynamic compression mentioned above. Mobile & Calendar
Brazil
Miss America
block matching
30.76
10.34
4.19
MRF
14.42
5.00
2.16
MRF; no compression
33.15
8.85
5.27
synthetic images
m.s.e, dx
m.s.e, dy
m.s.e, tot
MRF
0.11
0.05
0.16
MRF; no compression
0.24
0.17
0.41
mean squared error
References [1]
[2] [3] [41
S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell., vol. 6, no. 6, pp. 721-741, Nov. 1984. E Heitz, P. Bouthemy, Multimodal estimation ofdiscontinous opticalflow using Markov random fields, Artif. Intell., IEEE Trans. Pattern Anal. Machine Intell., vol. 15, no. 12, pp. 1217-1232, Dec. 1993. J. Zhang, G. G. Hanauer, The application of mean field theory to image motion estimation, IEEE Trans. Image Processing, vol. 4, no. 1, pp. 19-33, Jan. 1995. D. Geiger, E Girosi, Parallel and deterministic algorithms from MRF's: surface reconstruction, IEEE Trans. Pattern Anal. Machine Intell., vol. 13, no. 5, pp. 401-412, May 1991.
[5]
B. K. P. Horn, B. G. Schunck, Determining optical flow, Artif. Intell., vol. 17, pp. 185203, Aug. 1981.
[6]
M. Vivalda, Calcolo del moto tra immagini mediante modellizzazione con campi stocastici di Markov, technical report CSTV-CNR, 96.04, Aug. 1996.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 268
9 1997 Elsevier Science B.V. All fights reserved.
M o v i n g O b j e c t D e t e c t i o n in I m a g e S e q u e n c e s U s i n g T e x t u r e F e a t u r e s Frank Miiller ~, Michael HStter b and Rudolf Mester ~ ~Institut fiir Elektrische Nachrichtentechnik, Aachen University of Technology (RWTH) D-52056 Aachen, Germany bRobert Bosch GmbH, Research Institute for Communications (FV/SLH) P.O. Box 777777, D-31132 Hildesheim, Germany r for Applied Physics, Johann Wolfgang Goethe-University Frankfurt Robert-Mayer-Str. 2-4, D-60054 Frankfurt am Main, Germany In this paper, an algorithm for detection of moving objects in image sequences is presented. The proposed method calculates so called clique texture features for each block (detector cell) of a still image and detects significant changes between corresponding features of successive frames. When a moving object enters a detection cell, the texture feature will change its value and this change can be detected. The method is insensitive to small movements of strongly textured areas like trees moving in the wind. This property is advantageous compared to methods which calculate the pixelwise difference of successive frames or fit the image contents blockwise with low order polynomials. The computational load to calculate the texture features is low. For detection, i.e. decision whether a significant change has occurred, the features are considered as samples of a random process in time. The statistics of these processes (one for each detector cell) are estimated online and statistical hypothesis testing is employed for the decision. 1. I N T R O D U C T I O N Detection of moving objects in video sequences is an important task in a variety of applications. The employed methods vary depending on the specific requirements of the application. The method for object detection presented in this paper is characterized by low computational complexity (allowing inexpensive real time application) and robustness against false alarms caused by diffuse motion is typical for outdoor scenes. The presented algorithm is therefore well suited for surveillance applications. While automatic visual object detection in buildings can often be performed by simple change detection algorithms, the situation is more difficult in case of outdoor scenes. There, the illumination can not be controlled; wind may cause camera vibrations and diffuse motion of non-significant objects. Conventional change detection algorithms will often yield false alarms in these situations, for instance in areas which contain leaves moving in the wind. In a typical outdoor environment, a moving object detection algorithm must deal with varying weather conditions as snow, rain and sudden illumination changes as well as
269 with influences from wind (moving trees and leaves, camera vibrations). Hence, the desired algorithm must have the ability to discriminate between significant changes in the scene caused by moving objects and normal scene changes which can be explained by the aforementioned environment. In other words, a classification of presumptive detection events into significant and non-significant ones is required in order to suppress false alarms. Previous approaches to object detection can be categorized into mainly two classes. First, there are temporal difference based algorithms which evaluate differences between a local luminance value of corresponding areas in successive frames. These approaches differ in the way how the features employed for detection are obtained. They can be obtained pixelwise by low pass filtering [1], or blockwise by simple averaging. Another method is least squares fitting of polynomials (usually second order) to the contents of a block [2]. In the latter case, the parameters obtained from the fitting are regarded as local luminance features. Subsequently, we will denote these types of object detection as
change detection (ChD). Secondly, there are motion based algorithms, in which a motion flow field between successive frames is estimated and subsequently segmented into regions [3]. Regions containing motion are then regarded as belonging to moving objects; the other regions are classified as background. Previous approaches to introduce more robustness into ChD approaches have achieved this on the cost of higher computational complexity. These improved algorithms either need extensive pixelwise computations, or postprocessing of the change mask [1][2][4]. The key idea of our approach is that the proposed algorithm uses tezture features which are obtained blockwise from the frames of the image sequence. The object detection itself is essentially a temporal change detection algorithm, detecting changes of corresponding texture features between successive frames. It suffices to use simple features, which reflect the local covariance structure of the scene. The subsequent change detection algorithm operates on a small number of these features computed per block; i.e. on a significantly reduced amount of data. This approach results in an efficient object detection scheme with low computational complexity. Since the features are related to the texture of a block, the detection scheme is insensitive against pixelwise luminance changes as they might be caused e.g. by slow motion. For example, in case of vegetation in front of the sky, the texture (and the related feature values) will essentially remain unchanged even if wind moves the branches and leaves. An object entering or leaving a block (detector cell) however will cause a change of the feature in almost any cases. Thus the reliability of the object detection can be increased by using suitable texture features. 2. T E X T U R E
FEATURES
Besides being easy to compute, the features should be shift invariant to a certain extent. Then the value of the feature will change only slightly in case of small movements of strongly textured objects as e.g. bushes or trees. The features we use are closely related to the autocovariance function of the image signal inside a block.
270
2.1. Description of Texture Features We use features which are computed from the gray value signal values of pixel pairs whose position relative to each other is fixed. Such a pixel pair is usually called a clique, therefore we denote the features clique texture features. Let us denote with i and j two sites with the coordinates (xi, yi) and (xj, yj) respectively. A clique set is defined as a set of sites for which xi - xj - dx and yi - Y j - d y holds for a fixed displacement d - (dx, dy) under the condition that both sites are located in the regarded block. The vector d determines the clique type. For instance, if dz - O, the clique is denoted as a vertical clique; if dy - 0 it is a horizontal clique. If the sites i and j belong to a clique, we call the difference of the signal s~ 8j at the sites i, j a clique differential, and the squared term -
z-
(1)
forms a squared clique differential. The summation of all squared clique differentials of a given clique type inside of a given image block yields a texture feature
z-
(2)
that characterizes (partially) the texture content of the regarded block. Regarding the image signal as realization of a temporal random vector, the feature Z becomes a random variable, whose statistical distribution can be computed from the distribution of the original gray value vector"
where E[.] denotes expectation.
Assuming stationarity of the image signal in the regarded area, E[si] becomes ms and E[s~] becomes a s2+ ms2 for all i. The covariance cov(si, sj) between two pixels belonging to a clique depends only on the clique type" cov(si, sj) = cij,
(4)
where cij denotes the (i, j ) - t h element of the covariance matrix. For the expectation of the squared clique differential we obtain" E [ ( s i - s j ) ~ ] - 2 (a~ - cij).
(5)
2.2. Properties of Texture Features The described features are invariant to a constant luminance shift, since only differences of the values of certain pixels are involved. As a result, the algorithm is much more insensitive against illumination changes, than a grey level based change detection. The features efficiently discriminate between different textures. In figure 1, a test image (256 • 256 pixels) containing patches of Brodatz textures are given on the left side. In the middle and on the right side, corresponding texture features (based on 8 • 8 detector cells) are depicted.
271
Figure 1. Test image and texture features. Horizontal clique feature.
Middle: Vertical clique feature.
Right:
3. D E T E C T I O N A natural scene usually contains areas of differing image content. Therefore the features of different detection cells are processed individually, allowing for spatially varying texture characteristics of the scene. The time sequences of feature values (one sequence per cell) are regarded as realization of a (vector valued) stochastic process indexed by time. Consequently, the decision rule to tell, whether the scene content inside a particular detection cell hassignificantly changed at a particular time, is based on statistical analysis of the corresponding stochastic process. For each detector cell, the feature statistics are approximated by a mean value and a variance for each feature. Both parameters (mean and variance) are estimated recursively from the past. An IIR lowpass filter accomplishes this task with low computational and memory requirements. Using such an IIR filter results in a time recursive estimator for the mean and variance, which handles slowly varying scene conditions automatically. The estimated values rh, 6 2 are continuously updated as long as no object detection occurs. If the current feature value belongs to the interval [rh - a5 2, rh + a6 2] with a predefined constant a, the detector will decide that no object is present; otherwise a detection event will occur. For a symmetric unimodal distribution of the regarded feature this decision rule is equivalent to significance testing using the estimated values of mean and variance. A more detailed description of the used time recursive estimators can be found in [5]. 4. S I M U L A T I O N
RESULTS
In figure 2 two images from a test sequence are shown together with a corresponding difference image. The sequence shows an outdoor scene with a road and trees in the foreground and in the background. On the second image a person can be seen who entered the scene during the time interval between both exposures. A human observer easily detects the person walking along the road. However, the images differ from each other as well in regions where tree branches move due to wind. A human observer can detect these changes only by close examination of the pictures.
272
Figure 2. Two images from the test sequence "walk" and corresponding difference image
An object detection algorithm based on the evaluation of temporal pixel differences will almost always "see" these differences and output a false alarm.
Figure 3. Absolute temporal texture feature differences for sequence "walk" with varying detector cell size. Whereas in such regions particular pixel values may change drastically between different frames, texture features show much less variation. At the same time, the temporal variations of the features are high in the detector cells, which the person enters. To show this, in figure 3 absolute texture feature value differences are depicted for varying detector cell sizes. Even if cell sizes of 32 x 32 pixels are used, the difference between the features is much higher at the person's location than in the other areas. The difference of the feature values between successive frames just gives an idea of the performance of the algorithm. In the "tree" areas, the time recursive estimation procedure estimates much higher values of the variance than in the area where the person walks. As result the detection algorithm in all cases detected the person without any false alarms. The presented method operates with low computational complexity. Calculation of
273 the texture features involves computations of the same order as calculation of a squared difference criterion (as is used in plain ChD algorithms). The time recursive estimation of mean and variance of the features takes less than computation of the features, due to the data reduction in the feature extraction step. Sharing the low complexity with earlier detection methods, the presented algorithm additionally can deal with complex textured scenes and temporarily varying image signal statistics. Therefore it is very well suited for outdoor applications, where weather conditions and illumination change continuously and the characteristics of the observed scene cannot be controlled. The robustness and emciency of the proposed method has been extensively tested in offline simulations as well as in (realtime) online processing of numerous scenes; for further information regarding these evaluation, the reader is referred
to [5]. REFERENCES
1. T. Aach, A. Kaup, R. Mester: Statistical model-based change detection in moving video. Signal Processing 31 (1993) 165-180. 2. Y.Z. Hsu, H.-H. Nagel and G. Rekers: New likelihood test methods for change detection in image sequences. Computer Vision, Graphics and Image Processing. 26 (1984) 73-106. 3. J.H. Duncan and T.-C. Chou: On the detection of motion and the computation of optical flow. IEEE Transactions on Pattern Analysis and Image Processing 14 (1992) 346-352. 4. T. Aach, A. Kaup, R. Mester: Change detection in image sequences using Gibbs random fields: A Bayesian approach. Proceedings of International Workshop on Intelligent Signal Processing and Communication Systems, Sendal, Japan (1993) 56-61. 5. M. H5tter, R. Mester and F. Miiller: Detection and description of moving objects by stochastic modelling and analysis of complex scenes. Signal Processing: Image Communication 8 (1996) 281-293.
274
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
Determining velocity vector fields from sequential images representing a salt-water oscillator A.Nomura ~ and H.Miikeb ~Department of Cultural and International Studies, Yamaguchi Prefectural University, Sakurabatake 3-2-1, Yamaguchi, 753 Japan bDepartment of KANSEI Design and Engineering (Department of Perceptual Sciences and Design Engineering), Yamaguchi University, Tokiwadai 2557, Ube, 755 Japan With the gradient-based method we can determine 2 dimensional velocity vector fields from an image sequence. The method employs a basic constraint equation and additional constraint equations. In this paper, as an additional constraint equation, periodic behavior of salt-water oscillation is taken into account to determine its flow fields. The proposed method is applied to an artificially synthesized image sequence and to the real one representing the salt-water oscillation. Through analysis of the image sequences the usefulness of the proposed method is confirmed. 1. I N T R O D U C T I O N An image sequence represents projection of time-varying 3 dimensional (3D) scene onto an image plane. Since various kinds of visualized phenomena are captured by a TV camera and an image acquisition board, digital image processing has broad applications in the fields of scientific measurement, computer vision (the study to realize an artificial vision system), medical image analysis, industrial measurement and so on. For example, the determination of an optical flow (an apparent velocity vector field of a brightness pattern) brings meaningful information on 3D scene as follows. In computer vision, shape and structure of a 3D object in a static scene is recovered from an optical flow by the theory of motion stereo. Horn and Schunck proposed a famous gradient-based method to determine an instantaneous optical flow [1], since their goal was to realize an artificial or robotic vision system. In fluid measurement, Imaichi and Ohmi applied conventional image processing techniques to an image sequence representing fluid dynamics visualized by small particles and slit light illumination [2]. Because of the slit light illumination, depth information is known already. Thus, with the techniques 2 dimensional distributions of physical variables of a fluid flow are obtained. Our interest in this paper is focused on to determine 2D fluid flow fields. The methods to determine the velocity fields are divided into two categories: the matching-based method and the gradient-based one. The former method determines a velocity vector by tracing a brightness pattern between two sequential images. On the other hand, the latter determines a velocity vector field by minimizing an error function
275 employing a basic constraint equation and additional ones. The basic constraint equation represents a relationship between spatial and temporal partial derivatives of brightness distribution of the sequence and two unknown components of a velocity vector. Several additional constraint equations such as local constancy of a velocity vector field [3] and smoothness of that [1] have been proposed to determine the unknown variables. While the basic constraint equation can be used under most situations except for special situations such as non-uniform illumination, the additional constraint equation should be selected appropriately for each situation. In this paper, our goal is to determine a 2D velocity vector field with high accuracy. Especially, we focus on the determination of 2D slice of a three dimensionally distributed velocity vector field observed in a salt-water oscillator. We already confirmed that it had a periodic nature and its oscillation period was almost constant. Since we can utilize the periodic characteristic as an additional constraint equation in the gradient-based method, we propose a new method to determine the optical flow field having a rigid periodic behavior. We confirmed the usefulness of the proposed method through the analysis of an artificially synthesized image sequence and real one. 2. A D D I T I O N A L C O N S T R A I N T E Q U A T I O N TO THE GRADIENT BASED METHOD
AND ITS A P P L I C A T I O N
Horn and Schunck derived the following basic constraint equation by tracing a brightness pattern [1],
Of
-b7
Of
Of
+ v N = o,
(1)
where f(x, y, t) is spatio-temporal brightness distribution of an image sequence and (u, v) are two components of a velocity vector. Since the brightness distribution is measured by a TV camera, f ( x , y , t ) is a known variable. On the other hand, u and v are unknown variables to be estimated. In spite of two unknown variables only one constraint equation is available at a pixel in principle. Therefore, more additional constraint equations are necessary for obtaining full solution on the velocity vector components. Fukinuki proposed an additional constraint equation assuming spatial local constancy of a velocity vector field [3] and Horn and Schunck did a smoothness constraint equation [1]. Nomura et al. proposed one assuming temporal constancy of velocity vector fields (stationary velocity vector fields) [4] as an additional constraint equation. Consequently, the basic constraint equation obtained at a fixed spatial point along a time coordinate is assumed to have the same velocity vector (the temporal optimization method). Since several additional constraint equations have been proposed, we have to select proper method representing characteristics of an image sequence from all of them. Let us focus on the image sequence representing a salt-water oscillator. Its velocity vector field is oscillating at a constant period. Temporal changes of velocity vector components observed in simple oscillating and translating velocity vector fields is shown in Figure 1. From the figure, same velocity values are observed at the constant interval L(frame) as follows,
u(t) = u(t + L) = u(t + 2L) = u(t + 3L) . . . . .
u(t + N . L),
(2)
276
v(t) = v(t + L) = v(t + 2L) = v(t + 3L) . . . . .
v(t + N . L),
where the oscillation period L is assumed to be constant over N(cycle) and time variable is denoted by an integer number t which is defined in the range of 0 < t < (L - 1). (In this case, t corresponds to phase.) Equation (2) can be used as an additional constraint equation. Since eq.(2) does not spatially constrict the field, the field determined by the proposed method employing eq.(2) will be expected to have high spatial resolution. The period is assumed to be constant and to be known as previous knowledge. From this, we need a method to estimate the period. This additional constraint equation is developed from the one of temporal constancy we proposed [4]. Now we propose a new gradient-based method employing the basic constraint equation eq.(1) and the additional one eq.(2). If an observation point is fixed, basic constraint equations obtained at that point at the time intervals L(frame) have the same velocity components. Consequently, we can estimate the two velocity components by minimizing the set of the obtained basic constraint equations as follows, E=~
-~n--O
+UTxx +v x,y,t+nL x,y,tTnL ~
,
x,y,tTnL
(3)
where partial derivatives are evaluated by the method described in the literature [1] and the least squares method is utilized for its minimization. In the matching-based method the additional constraint equation eq.(2) is difficult to be taken into account. This is the reason why we use eq.(2) as an additional constraint equation of the gradient-based method. Velocity[pixel/frarne]
0.4 L
1.0
L
L
0.8
0.6
t)
02
0.0
0
10
20
30
Time [framel
40
Figure 1. Temporal changes of velocity vector components u(t) and v(t). oscillating at a period of L(frames).
They are
3. A N A L Y S I S O F A N A R T I F I C I A L L Y S Y N T H E S I Z E D I M A G E S E Q U E N C E To confirm usefulness of the proposed method, it is applied to an image sequence. The image sequence is synthesized artificially by translating a brightness pattern on an image plane. The pattern is synthesized by,
f o ( x , y ) = A{1 + sin ( ~ - ~ ) . sin ( ~ - ~ ) },
(4)
277 where 2A = 254 is the maximum brightness value of the pattern and A = 30(pixel) is the wave length of that. The two components of the translating velocity vector are modulated with time t as,
---
~
+
U2COS
~
(5)
v
where T = 10(frame) is the period of the temporal sinusoidal changes of the velocity components and other parameters are as follows: u0 = 0.4, U 1 - - " 0 . 6 , U 2 - - - 0 . 0 , V 0 - 0.3, Vl -- 0.1, v2 = 0.1(pixel/frame). The image sequence having 100 frames (10 cycles) with the spatial resolution of 100• 100 (pixel) is analyzed by the proposed method and by the ordinary temporal optimization method [4] assuming constancy of velocity vector fields during 10 frames. A velocity vector field determined at every frame is spatially averaged. Temporal changes of two components of the averaged velocity vector are shown in Figures 2 and 3. From the figures the two components determined by the proposed method are almost on the true lines. Consequently, the usefulness of the method is basically confirmed through the analysis. 1.2 Velocity[pixel/frame]
0.8
. . . . . .
§
1.2 Velocity!pixel/frame]
"
0.8
§
0.4
0.4
0.0
0.0 0
2
4 6 Time[frame]
8
10
Figure 2. Temporal changes of two velocity components. The solid lines represent true velocity. Symbols represent velocity components determined by the ordinary temporal optimization method from the artificially synthesized image sequence.
0
~
2
4 6 Time[frame]
8
10
Figure 3. Temporal changes of two velocity components. The solid lines represent true velocity. Symbols represent velocity components determined by the proposed method from the artificially synthesized image sequence.
278 4. A N A L Y S I S OF A R E A L I M A G E S E Q U E N C E R E P R E S E N T I N G WATER OSCILLATOR
A SALT-
The proposed method is applied to the real image sequence representing a salt-water oscillator. Its experimental setup is shown in Figure.4. Around the small hole in the bottom of the inner vessel an oscillating fluid flow is observed [5,6]. The fluid flow is visualized by small particles and laser slit light illumination. The visualized 2D slice of the flow is captured by a TV camera system and an image acquisition system. Sampling frequency is 30 Hz, the spatial resolution of an image plane is 40 • 40 (pixel) and brightness is quantized into 256 levels. The acquired image sequence consists of 7923 frames (19 cycles). The image sequence is analyzed by the proposed method and the spatio-temporal optimization one. The latter one assumes constancy of the velocity vector field during 19 frames. In addition, both methods assume local constancy of a velocity vector field, where its local size is 7 • 7 (pixel). When we can use more grid points (more basic constraint equations) in the optimization, it is expected to reduce noise. Temporal changes of two velocity components determined at the center (x = 20, y = 20) by the proposed method are shown in Figure 5. The u(t) component is almost zero, while, the v(t) component has a down flow. The peak of v(t) is observed at around t = 100 (frame). After the passage of the peak, the flow velocity is decreasing with time. These changes are consistent with what we observe in the real experiments. On the other hand, such characteristics is not clear in the temporal changes of velocity vector fields determined by the spatio-temporal optimization. 5. C O N C L U S I O N S In this paper, an additional constraint equation representing the characteristics of periodic velocity vector fields was proposed. The gradient-based method employing the basic constraint equation eq.(1) and the proposed one eq.(2) was applied to an artificially synthesized image sequence and to the real one representing salt-water oscillation. Through the analysis of two kinds of image sequences the usefulness of the proposed additional constraint equation was confirmed. The proposed method focused on the analysis of salt-water oscillation. However, we often observe oscillatory flow in several situations. For instance, Karman vortex [2] is one of the typical examples of the oscillatory flow field. Consequently, we can expect that the proposed method is effective to fluid flow field analysis. We have proposed the generalized gradient-based method [7]. The method introduced a generalized basic constraint equation representing the effect of non-uniform illumination. Since we can apply the additional constraint equation proposed in this paper to the generalized gradient-based method, we can determine oscillatory velocity vector fields under non-uniform illumination frequently observed. ACKNOWLEDGEMENTS The authors thank to Prof.K.Yoshikawa (Nagoya University) and his student Miss.M.Okamura for their experimental help. This work was partly supported by the Grant-in-Aid of Ministry of Education, Science and Culture of Japan.
279
Figure 4. Experimental setup of a salt-water oscillator.
Figure 5. Temporal changes of two velocity components determined at the point (x, y) = (20, 20) from the real image sequence (see Figure 4). They are determined by the proposed method employing local constancy of the velocity vector field in addition to the assumption of eq.(2), where the local size is 7 x 7 pixels.
REFERENCES
1. 2. 3. 4. 5. 6.
B.K.P.Horn and B.G.Schunck, Artificial Intelligence 17 (1981) 185. K.Imaichi and K.Ohmi, J. Fluid Mechanics 129 (1983) 283. T.Fukinuki, Technical Report of IECE, IE78-67 (1978) 35 (in Japanese). A.Nomura, H.Miike and K.Koga, Pattern Recognition Letters 12 (1991) 183. S.Martin, Geophysical Fluid Dynamics 1 (1970) 143. K.Yoshikawa, S.Nakata, M.Yamanaka and T.Waki, J. Chemical Education 66 (1989) 205. 7. A.Nomura, H.Miike and K.Koga, Pattern Recognition Letters 16 (1995) 285.
This Page Intentionally Left Blank
H
TRACKING AND RECOGNITION OF MOVING OBJECTS
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
283
"LONG-MEMORY" MATCHING OF INTERACTING COMPLEX OBJECTS FROM REAL IMAGE SEQUENCES A. Tesei*, A. Teschioni*, C.S. Regazzoni*, and G. Vemazza**
*Department of Biophysical and Electronic Engineering (DIBE), University of Genova, Via all'Opera Pia 11A, Genova, Italy **Department of Electrical and Electronic Engineering (DIEE), University of Cagliari, Piazza d'Amfi 1, Cagliari, Italy
1.
INTRODUCTION
Computer-assisted surveillance of complex environments is getting more and more interesting, thanks to recent significant improvements ha real-time signal processing. The main role of automatic computation in such systems is to support the hulnan operator to perform tasks such as detecting, interpreting, logging or giving alarms. In the surveillance research field applied to public areas, crowding monitoring is very useful but presents particulm'ly complex problems. Recognizing objects and persons and tracking their movements in complex real scenes by using a sequence of images are among the most difficult tasks in computer vision [1][2]. Object and human motion tracking in 3D real scenes can be achieved by means of Kahnan filtering [3][4]; a suitable lnathematical model for describing objects and persons and a refined dynamic model for tracking them while moving and reciprocally interacting are needed. Such approaches can provide accurate and robust results even in uncontrolled real-life working conditions. In [4] a method for tTacking only a single moving person was presented."
"The work has been partially supported by the European Communities under Conlracl ESPRIT-P8433PASSWORDS
284 In this work, this limithag assumption decays, and morn general and more complex situations " are considered, as several objects (persons and not) moving and interacting in real scenes are treated. It is mainly addressed to present the two main phases at the basis of object recognition and tracking: 9
the selection of a set of image features characterizing each detected mobile object or group of objects, and consequently allowing the system to distinguish an object with respect to another; the matching procedure which allows one to recognize a certain object even after various frames in which it disappeared completely or pm'tially ("long-memory" matching).
Thanks to real-time functioning, accuracy and robustness, the method can be used in real-life surveillance systems.
11
SELECTION AND EXTRACTION OF BLOB INTERNAL CHARACTERISTICS
From each image of the sequence to be analyzed, the lnobile areas of the image (i.e., the blobs) are detected by a frame-background difference [4] and analyzed by extracting numerical characteristics (e.g., geometrical and shape properties). Blob analysis is performed by the following modules:
1.
Change detection (Fig. l b): by using statistical morphological operators [5], it identifies the mobile blobs present in the original b/w image (Fig. l a) that exhibit remarkable differences with respect to the background.
.
Focus of attention (Fig. lc): by means of a fast image-segmentation algorithm [6], it detects the minimum rectangle bounding each blob in tile image (corresponding to single or multiple objects, or parts of an object).
285
Figure 1. (a) Original image, (b) Change detection image, (c) Focus of attention image (surveillance of a metro station).
3.
Measure extractor: it extracts from each blob its perimeter, area, bounding box area, height and width, mean grey level value, bounding box centre 2D position on the image plane.
3.
MOBILE BLOB MATCHING
The module labels with the same number each blob corresponding to the same object, object part or group present in the sequence over time. On the basis of the matching result, each blob can be tracked over time, hence providing a further blob characterization by means of cinematic parameters. Matching is perfonned in two steps: A. A first rough result is roached by comparing the lists of blob characteristics refening to the current (time step k) and previous (time step k-i) frames. Blob correspondences are organized as a graph: all nodes of each level are the blobs detected in each tYame, and the relationship among blobs belonging to different adjacent levels are represented as arcs between the nodes. Arcs are inserted on the basis of the superposition of blob areas on the image plane: if a blob at step (k-l) overlaps a blob at step k, then a lhak between them is created, so that the blob at step (k-l) is called "father" of the blob at time step k (its "son"). Different events can occur: 1)
If a blob has only one "father", its type is set "one-overlapping" (type o), and father
label is assigned to it.
286 2)
If a blob has more than one "father", its type is set to "merge" (type m), and a new
label is assigned. 3)
If a blob is not the only "son" of its father, its type is set to "split" (type s), and a new
label is assigned. 4)
If a blob has no "father", its type is set to "new" (type n) and a new label is assigned.
B. Blob matching is refined by substituting, if possible, the new labels with the labels of some blob either belonging to a time step previous than (k-I), or belonging to the (k-l) frame and erroneously labelled in phase A. This processing phase is based on the comparison between each current blob not labelled with "o" with the set of recent previous blobs, whose label was inherited by no successive blob, collected in a "long-memory" blob graph. This approach should be useful for recover some situations of temporal~j wrong splitting of a blob (corresponding to a single object) into more blobs, because of image noise or static occlusions, or of temporary merging of two overlapping objects. Comparison is performed on the basis of the blob shape characteristics which have been tested to be approximately time/scale-invariant. Blob matching provides as output the final graph of blob con'espondences over time, in which matched blobs have the same label over time, and each blob is classified with one of the described types.
4.
EXPERIMENTAL RESULTS
Extensive tests on real image sequences were performed in the context of CEC-ESPRIT Project PASSWORDS, in which two main surveillance application sites were selected: a supermarket and an underground station. Figure 2 shows the result of the blob matching algorithm on a test sub-sequence: each image contains the detected blobs (resulting from blob detection) with their numerical label and type (obtained from blob matching). This exmnple points out in particular the capability of the module to:
287 1. assign the same label to two blobs before and after their temporary overlapping on the image plane (hence to consider them as the same mobile object) even after several frames (see frames 2 and 5); 2. assign the correct label of a blob which was erroneously classified as new during the matching phase A (see frames 2 and 3).
Figure 2. A sequence of images showing critical cases of blob splitting, merging and displacement.
REFERENCES 1. W. Kinzel, and Dickmanns ED, "Moving humans recognition using sl)atio-temporal
models", XVII-th Congress Int. Soc. Photogrammetry and Remote Sensing, 1992. 2. F. Cravino, M. Delucca and A. Tesei, "DEKF Ls3'stemfor crowding estimation by a
multiple-model approach", Electronics Letters, no. 30, vol. 5, 1994, pp. 390-391.
288 3. Y. Bar-Shalom, and T.E. Fortmann, "Tracking and Data Association", Acade~rfic Press, New York, 1988. 4. A. Tesei, G.L. Foresti, and C.S. Regazzoni, "Human Body Modelling for People
Localization and Tracking j)'om Real Image Sequences", IPA'95, Heriot-Watt University, UK, July 1995, pp. 806-809. 5. J. Serra, "Image Analysis and Matt~ematical Molphology. 2: Theoretical Advances", Academic Press, London, 1988. 6. D.H. Ballard, and C.M. Brown, "Computer Vision", Prentice Hall, New York, 1982.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
Spatial and temporal
289
g r o u p i n g for o b s t a c l e d e t e c t i o n in a s e q u e n c e of
road images Sandra Denasi and Giorgio Quaglia ~ ~Istituto Elettrotecnico Nazionale "Galileo Ferraris" Strada delle Cacce 91, 10135 Torino, Italy E-mail: [email protected] Computer vision systems devoted to driving assistance of vehicles moving on structured roads require the fulfillment of two essential tasks: the localization of the road boundaries and the detection of obstacles on the road. The present paper proposes an algorithm for early detection and tracking of vehicles based on active perception of the most significant structures pointed out in correspondence of the road in the image sequence. Perceptual persistence of some structures is used to start up object hypotheses, then a model based grouping process integrates these vague hypotheses along the sequence until a real obstacle is recognized by the description of its main structures. 1.
INTRODUCTION
Among the different contributions that computer vision equipments can provide for increasing the safety in car driving, lane keeping is the most feasible, and is helpful mainly in long highway trips. However this aid could be the cause of lack of attention, so lane keeping equipments become more complete if they are coupled with obstacle detection devices. Regarding as "obstacles", in a first instance, vehicles moving on the same lane, two kinds of obstacles can be met on this lane: faster overtaking vehicles that will disappear on the horizon and slower or still vehicles that will be reached after a lapse of time. The first ones can be pointed out analyzing their salient structures, while the latter are hardly perceivable when they appear on the horizon and become more evident as they approach, as shown in figure 1. In the present paper we face the problem of detecting obstacles as soon as possible. However, early detection conflicts with reliable detection, because structures of vehicles far away from the camera mingle with the background. So we propose an approach based on integration of edge segmented images along the sequence and on perceptual and geometrical grouping of segments to form meaningful structures for obstacle recognition. Perceptual persistence of peculiar structures is used to start up object hypotheses, then a model based grouping process searches for meaningful groups of segments related to the outline and to parts of a vehicle in order to distinguish an approaching obstacle from other patterns such as road signs, patches or shadows.
290 2.
THE PROPOSED
APPROACH
The obstacle detection process focuses its attention on the area of the image that corresponds to the road. Details about the road boundary detection algorithm can be found in [1]. Since only the rear or front side of cars can be seen from a vehicle moving on the same road, parts such as the windscreen or the rear window, the number plate, the bumper, the lights, the wheels, and also the shadow under the vehicle between the wheels, could be looked for to detect and recognize the vehicles. These parts are usually pointed out by the analysis of horizontal and vertical contours [2,3]. However edge segmentations of road scenes are strongly cluttered, namely in those areas where an early detection of obstacles is important: that is, far away, near the horizon. Moreover, because of segmentation noise, edges corresponding to parts of the vehicle appear and disappear along the sequence and obstacles are nearly imperceptible. Different strategies must, then, be used to analyze different situations. While approaching a vehicle, three phases can be distinguished: the attention phase monitors the end of the road and detects far away objects that appear, then the tracking phase follows these object hypotheses in order to verify their persistence in successive frames, finally, when the objects are closer and their shape is better visible, the recognition phase validates the hypotheses searching for structures of the objects that could be matched with known models of vehicles, estimates their position and warns about obstacles.
Figure 1. Images of far and close vehicles and their segmented lines.
3.
DETECTION
OF OBSTACLE HYPOTHESES
Analyzing segmented images frame by frame, as soon as the likely vehicle appears, does not allow its recognition. Structures of vehicles are too small and confused with other structures to give reliable indications. Instead, something that suggests an object appears when we observe the entire sequence of frames. Therefore, an object cannot be identified considering a snapshot of its structures, but taking into account the persistence of these structures along the sequence of images. That is because, when a vehicle is far away, at the end of the visible road, its position and dimension do not change significantly from a frame to the next one. An initial area of attention (AOI) is centered around the end of the road, at the intersection of the left and right road borders previously localized (figure 2a). Because a "loose" model of a vehicle is sufficient to detect it, a far away vehicle is described simply by a group of short horizontal lines, as shown in figure 2b. Persistence
291 of structures is computed using a grid mapped to the image and considering a sequence of Natt frames. All the horizontal lines in the AOI of each image are rasterized and their pixels contribute with a vote to the total count on the grid. A circular frame buffer is used to update the sum along the sequence, removing votes of the oldest image and adding votes of the current frame. When one or more pixels in the grid reaches a sufficient amount of votes MINatt, a good probability does exist that an object has appeared in that area, then a first hypothesis of obstacle is instantiated. At present, the following values are considered to be suitable thresholds for detecting 10 and MINatt = 8. Figure 2c shows the reliable persistence of structures: Natt resulting integration. =
Figure 2. (a) Road borders and AOI, (b) H and V lines, (c)line persistence.
4.
TRACKING
OF THE HYPOTHESES
The detected structures must be tracked for a while, to verify their reliability to belong to a single object and not to be part of the background. Very noise sensitive features, such as corners, and small elements, such as vertical line segments, hardly can give reliable indications. Then again a rough model is used in this phase. In particular, an obstacle is modeled as a parallelepiped, whose projection on the image plane is a rectangle, which must include enough structures to characterize a vehicle. For each frame, firstly the position and size of the AOI are updated according to the detected obstacle position and the known road boundaries. Since lines belonging to a vehicle are peculiar for their symmetry, a likely vehicle can be pointed out by a peak on their projection profile. Therefore, the nearly horizontal lines in the AOI are projected and accumulated to a horizontal buffer and the maximum value of the resulting profile is computed. Let Xl and x2 be the endpoints of a line, and its horizontal projection be defined as
P(x)-
1 ifxl<_X<_X2 0 otherwise
then the projection accumulation is computed as Ozines
Ace-
y~ P~(x) i--1
292
Figure 3. (a) Image, (b) horizontal lines, (c) projection accumulator, (d) clusters.
as shown in figure 3c. Finally, analyzing the distribution of those lines that intersect the abscissa of the peak, clusters of close lines are localized and tracked along the sequence of frames. If a line is inside the bounding box of an existing cluster, the line joins to that cluster and update it, otherwise it starts a new cluster. Clusters that have not been updated with new lines in the latest frames are removed, while clusters that have reached a sufficient consistency in the sequence are considered to correspond to a reliable obstacle hypothesis and are passed to the last phase for validation (figure 3d).
5.
OBSTACLE RECOGNITION
Clusters of segments can be originated by different patterns, not always corresponding to vehicles, therefore their recognition is fundamental for the validation of the initial hypotheses. Here, recognition means matching an approximate symbolic description of the structures found in the AOI with a vehicle model, emphasizing those structures that can discriminate vehicles from other patterns. Far away obstacle hypotheses have been formulated on a statistical basis. As soon as objects become closer and their structure is more defined, a recognition process can be started if structures having a high probability of belonging to parts of the vehicle (windows, bumpers, wheels, number plate) are detected. However uncertainties and segmentation errors hinder a reliable recognition and again time integration must be exploited. Because both the vehicle with the camera and the likely obstacle move, integration becomes hard and the individual segments must be tracked along the sequence. The proposed strategy is inspired to human visual system and tries to be simple enough as required by this real time application. The accomplishment of this integration can be outlined in two main steps: the extraction of the main features from the segments belonging to a cluster and the integration and tracking of these structures until a match with the vehicle model is possible.
293 5.1. M o s t salient structure detection As outlined in previous works [4,5], the most salient features are those that either are characterized by length and contrast above a prefixed threshold, or are arranged in such a way to be perceived as a single group of segments which forms more complex structures that are unlikely to be put together by chance, such as corners, open C-shaped structures and closed almost rectangular chains of segments. These complex structures are built by algorithms that use rules of collinearity, proximity and parallelism between segments. The segments that fall within the AOI are analyzed to point out salient structures Si. Each one is stored in an accumulator image I ~ and is awarded with a vote Pr~(Si) that depends on its saliency. Single segments are granted with lower votes, which are proportional to their length and contrast strength, while votes rise for corners, C-structures and rectangles. The resulting votes are regarded as directly proportional to their probability of belonging to a real obstacle.
5.2. Perceptual integration and tracking Each structure of I~s becomes the root of an integration process and is compared with the salient structures of the current image I c ~ . A search area is defined around the bounding rectangle of each structure. In order to compensate the perspective changes of the image due to motion, its position and size are linked to the motion parameters and to the AOI position within the image. Each structure of Ir~s is searched for correspondence in the Ic~r~ image, while the new segments of I c ~ are aggregated to existing structures I ~ , if they comply with the grouping rules.
Figure 4. Images and AOIs (a), most salient edges of the current images (b) and of the integrated images (c).
294 Simple rules are used to update the votes assigned to each segment that is stored in the resulting image: - if correspondence is found, the segments of I~s are replaced with the corresponding ones of I c ~ and their vote is updated to P~s(Si) = max(P~s(Si), Pcu~(S~))+Gss,, where Gss, is a correspondence grant, - all segments of the current image that have not been involved in a correspondence are placed in the resulting image at their lowest vote, - segments with a low vote are risen to a higher one if they contribute to form salient structures, - segments that cannot be updated lower their vote - segments whose vote is lower than a minimum value are rejected from the resulting image. The segments with higher votes are taken into account to evaluate new motion parameters and to formulate obstacle hypotheses to be matched with the vehicle model. Incomplete hypotheses are allowed, since the recognition of the most characteristic parts (windows, number plate, bumper) is sufficient to validate the obstacle hypotheses. 6.
RESULTS
Figure 4 shows some preliminary results of the proposed strategy related to an approaching vehicle. Three different distances have been taken into account: 26, 17 and 11 meters. Images of column b show the most salient segments detected in the current image, while images of column c represent the resulting integrated images. An improvement has been achieved, mainly at far and medium distances. While at the lowest distance, the resulting image is confused because of the large displacements existing between the succeeding images due to the proximity of the obstacle. However, further improvements will be reached using hypotheses formulated at medium distance to guide the correspondence search and the recognition of the obstacle. REFERENCES
1. S. Denasi, G. Quaglia, et al. "Real-time system for road following and obstacle detection". SPIE Machine Vision Applications, Architectures, and Systems Integration Ill, Boston, pp. 70-79, 1994. 2. A.Meygret and M.Thonnat, "Objects detection in road scenes using stereo data". Proc. on Prometheus Pro-Art Workshop on Vision, Sophia Antipolis, pp. 119-130, 1990. 3. M.Xie, L.Trassoudaine, J.Alizon and J.Gallice, "Road obstacle detection and tracking by an active and intelligent sensing strategy". Machine Vision and Applications, 7:165177, 1994. 4. S. Denasi, G. Quaglia and D. Rinaudi, "The use of perceptual organization in the prediction of geometric structures". Pattern Recognition Letters, vol. 13, n. 7, pp. 529-539, 1992. 5. S. Denasi, P. Magistris and G. Quaglia, "Saliency-based line grouping for structure detection". SPIE Intelligent Robots and Computer Vision XIII: A19orithms and Computer Vision, Boston, pp. 246-257, 1994.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
295
A t t i t u d e of a vehicle m o v i n g on a s t r u c t u r e d r o a d Antonio Guiducci and Giorgio Quaglia ~ ~Istituto Elettrotecnico Nazionale "Galileo Ferraris" Strada delle Cacce 91, 10135 Torino, Italy E-mail:[email protected] The evaluation of the lateral position of a vehicle, moving within a lane (lane keeping), is one of the practical computer vision techniques developed as assistance to safe driving on structured roads and the present technology is able to reliably solve it in real time. This paper presents a fast algorithm suitable to detect and track lane boundaries based on a region growing process driven by the a priori knowledge of the scene. An accurate algorithm to calibrate the vision sensor and evaluate the vehicle position within the lane is also presented. I. I N T R O D U C T I O N
Computer vision techniques are raising the interest of car manufacturers which are planning the development of effective tools to carry out active safe driving. Among the applications presently developed we can mention lane keeping and obstacle avoidance. Lane boundary detection is a fundamental function for the fulfillment of the above tasks because it makes possible the evaluation of the attitude of the vehicle and restricts the area where other vehicles must be detected to avoid collisions. This paper proposes a real time implementation of an algorithm for detecting the lane boundaries and controlling the trajectory of a vehicle that moves on a structured road (such as a highway). The process has been split into two independent tasks: (1) accurate localization of the lane, (2) determination of the position and the heading direction of the vehicle within the lane. Output data can be used either to warn the driver with acoustic signals or control the steering gear in order to keep the vehicle in the proper trajectory. A novel approach, that makes use of the a priori knowledge about the environment, has been developed. It rises from the remark that a driver perceives the road as the main gray region in front of the vehicle and bounds automatically its manoeuver space in correspondence of the boundaries of this region. So a knowledge based region growing algorithm proved to be very effective and gives reliable lane boundaries even in presence of signs, writings, patches and shadows. The displacement of lane boundaries is used to both calibrate the camera and determine position and heading of the vehicle with respect to the road. The proposed technique has been implemented in real time on the mobile laboratory MOBLAB [1] built with the financial support of the National Project on Transportation of C.N.R.
296
Figure 1. (a) Search areas; (b) detected lane boundaries.
2. B O U N D A R Y
DETECTION
Since the scenarios of a highway are restricted and simple (road surrounded by unstructured background) it is possible to draw the following models to drive the boundary detection algorithm: Scene model: the scene displays in its central lower part the region "road", and in the upper part the horizon and the vanishing point of the road boundaries. R o a d model: the road and the lane have a triangular shape, their luminance is quite homogeneous, their boundaries are smooth, they do not change significantly between two succeeding images, and they are straight in their lower part. According to the previous remarks the boundary detection algorithm has been organized in three modules: the first performs the region growing process and detects the boundaries in the first image, while the second tracks these boundaries in the sequence. A boundary consistency check module completes the process and controls the correctness of the detected borders, starting a new search if wrong results are detected. The lane boundaries coincide with the white lane marks depicted on the road and the strong luminance changes can be used to evaluate their position. Positive changes point out the left boundaries, while the negative changes point out the right ones. Exploiting the a priori knowledge about the horizontal road signs and the road pattern, the image is analyzed from the bottom (since the bottom fines usually correspond to the road area). Each line is scanned from its center towards the extrema looking for strong changes in luminance. Once a point is detected, the search continues in the previous lines, reducing the search area to a short interval centered on the previous point. Only segments whose length is above a prefixed threshold are considered. Continuous fines are tracked to their end. In presence of a divided hne, the slope of the lowest detected segment is used both to extrapolate the border to the image frame, and to start a bhnd search inside a triangular area of the next segment. The search is repeated for all the succeeding segments. Figure la shows the displacement of these search areas for a typical section of highway. Because of temporal continuity along the sequence, the boundaries cannot change from image to image except for a small displacement due to the lateral or heading shift of the vehicle. The boundaries detected in the first image are then used to reduce the search area in the following images to a narrow strip. -
-
297 Some checks are carried out in order to overcome possible troubles due for example to arrow marks, lane marks near the exit ramps, and writings painted on the road surface. Boundaries are always checked in order to evaluate their correspondence with the road model: that is, the left and right boundaries cannot cross, the lane width on the image plane can only decrease from the bottom to the top of the image, and the boundary displacement cannot significantly differ from the previous ones. If any of the previous checks fails, the boundary is validated only up to the last correct point. Too short boundaries or boundaries that do not follow the lane model are rejected and the process restarts looking for an image with proper boundaries. In the meantime, the last proper boundary is used to control the vehicle, assuming that no abrupt changes are possible. Figure lb shows the detected boundaries in a complex scenario. 3. A T T I T U D E
EVALUATION
The determination of the vehicle attitude requires the determination of the position and orientation of the camera reference frame O X Y Z (Figure 2a) with respect to the road. This can be accomplished by first linking the camera reference to the vehicle reference ~, ~, ~h (with ~h in the vehicle heading direction and ~ parallel to the ground plane), and then giving the position and orientation of the vehicle reference with respect to the road reference ~zR,vl:t, CvR (Figure 2b) with ~hR along the road direction and vR orthogonM to the road plane. (The intrinsic calibration of the camera, that maps each pixel to a unit vector in the camera reference, is assumed known by any of the standard procedures such
T i's [2].)
w
F Z
/
Y
\
w ~
uR
Y
a
Figure 2. Camera, vehicle and road references.
The vehicle reference ~, ~, ~b is fixed with respect to the camera and coincides with the road reference uR, vR, ~hR when the vehicle is still and with its heading direction along the road. When the vehicle is running, its attitude changes and can be specified by giving the rotation matrix between world and vehicle references, and the position of the latter
298 with respect to the road (that is, the height H of the camera above the ground and its distance Wt and IV, from the left and right borders of the road). The determination of the attitude is performed in two steps. The first step is performed off-line, that is once and for all, and consists in the determination of both the orientation of the vehicle triad with respect to the camera reference, and of the height H of the camera. The second step consists in the determination, frame by frame, while the vehicle is running, of the orientation of the vehicle triad with respect to the world triad and of the distances Wt and W, of the camera from the borders of the road (and then, also of the road width W = Wl + W,). 3.1. V e h i c l e r e f e r e n c e a n d c a m e r a height When the vehicle is horizontal, the unit vector ~ is perpendicular to the horizontal plane and then to all the planes that have the horizon line on the image plane as their vanishing line. Then if the equation of the horizon line on the image plane is ax + by + c = O, = (a, b, c)~/~/~2 + b~ + c~. The horizon line on the image plane call be determined by framing a sequence of images of a straight horizontal road, while the vehicle performs a slow turn. The horizon line is then the straight line through the vanishing points (extrapolated intersections) of the road boundaries in the sequence. For what concerns the unit vector ~b (heading direction of the vehicle), it is determined by the focus of expansion (FOE) as computed in a straight motion, hence: Co - (:r,fOE, YfOS, 1)r /v/X2FO E + Y~OS + 1. The position (XPOE, YPOE) of the FOE on the image plane is given by the average of the vanishing points of the road boundaries in a sequence of images taken while the vehicle is travelling along a straight horizontal road.
M
^
O /H i " ~ ~ ! _....g..-"''~-- ] ~ w0
~
left boundary ~vox ~a, ^ ^ .-~ y ,TTV, w~xnr ..... X~~ boundary / H "~
I,
"
W0
Figure 3. Determination of camera height.
At last, the height of the camera can be determined knowing the position on the image plane of the horizon and of the left and right boundaries of a straight horizontal road of known width W0. Indeed, let al~e + bly + cz = 0 and a,x + bry + c, = 0 (al > 0, a, > 0) be the equations of the left and right boundaries on the image plane and R = (xR, YR) their (extrapolated) intersection, that is the vanishing point of the road.
299 Then zbn - ( z R , yR, 1) T/~/z~ + y~ + l is the unit vector in the road direction, and z2R - ~3 • ~bR is the unit vector in the direction of a road cross segment. The unit vectors orthogonal to the planes through the focal point O and the two road boundaries are ht - (al, bt, cl)-r /~/a~ + b~ + c~ and r -(a,.,b,.,c,.)-r/~/a~ + b~ + c~. With reference to figure 3 the following relations can be easily obtained:
P~V -
- H z2-R" (zbR•• h~) nl)z2R, v. (~bR
N~Q : H ~2R" (zbR•• ~h,) ~,. (zbR ) ~zR,
Wo - I P N
-.
+ NQI -.
(1)
From (1), knowing W0, H can be determined. 3.2. I n s t a n t a n e o u s vehicle position, heading direction and road width The determination of the attitude of the vehicle requires the computation, frame by frame, while the vehicle is running, of the world reference triad ~2R, ~3R,zbR, (that is of the heading of the vehicle), of the road width W, and of the position of the vehicle inside the road (distance from the left border Wl). We neglect here the changes of the height H of the camera above the ground.
~a
iw wri W
(~) Figure 4. Attitude evaluation.
(b)
While running, the vehicle shakes on its suspensions and the instantaneus horizon (vanishing hne of the ground plane) does not coincide with the calibrated horizon (plane perpendicular to ~3). To compute the instantaneous horizon (~3R) we use the vanishing point R of the portion of visible road nearest to the camera (see Figure 4). If the radius of curvature and the change of slope of the road are not too great, the portion of road not framed by the camera can be considered planar and straight, and R hes on the instantaneous horizon, in the heading direction (see Figure 4a). For small oscillations we have: ~2R = ~3R • zbR _~ ~3 • ~bR where zbR is the unit vector in the direction of the
300 intersection of the two road boundaries, and z3is known from the calibration phase. Hence the instantaneous horizon can be approximated by: ~
= ~ . • ~ . ~_ ~ . • (~ • ~ . ) =
~_ (~.
(2)
~)~..
Recalling the equations (1), the position of the vehicle inside the road and the instantaneous road width W (Figure 4b) can be written as: r w, - - H ~~2R. . ((~bR ~ •x ~,),
(z~R • W~ - H ~R" ~. (~ •
~,)
~),
W - Wz + W~.
(3)
In these equations Wt (distance from the left road border) is lower than zero if the vehicle is on the left of the left border and W, (distance from the right border) is lower than zero if the vehicle is on the right of the right border. Hence, as written, W is always positive. At last the angle 0 of the heading direction ~b with respect to the road direction ~bR is given by: cos 0 = zbR. ~b, sin 0 = (~)a • ~bR). ~b.
'
0.8 0
6
0
4
0
2 0
'
r
'
'
'
'
0
'
.
400
800
1200
1600
9
2000
Figure 5. Plot of the position of the vehicle within the lane in a test sequence of 2000 images.
4. R E S U L T S The proposed technique has been implemented on a real time image processor IMAGING TECHNOLGY 15040 installed on the mobile laboratory MOBLAB. The best performances are obtained processing 12 images per second. Figure 5 shows a plot of the vehicle position within the lane in a test sequence of about 2000 images. REFERENCES
1.
A. Guiducci, G. Quaglia, et. al. , "MOBLAB: a Mobile Laboratory for Testing RealTime Vision-Based Systems in Path Monitoring", Proc. SPIE Int. Conf. Photonics for Industrial Applications, Mobile Robots IX, pp. 228-238, Boston, 1994. 2. R.Y. Tsai, "An Efficient and Accurate Camera Calibration Technique for 3D Machine Vision", Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 364-374, Miami Beach, 1986.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
301
An algorithm for tracking pedestrians at road crossing M. Loria and A. Machi Istituto di Fisica Cosmica ed Applicazioni dell'Informatica Via Mariano Stabile, 172 90139 Palermo, Italy Email: [email protected], cnr.it This paper presents an algorithm used to detect, track and label moving objects in order to optimise car flow at a pedestrian zebra crossing. The algorithm detects blobs moving on the scene, labels them as pedestrians, cars or unidentified objects and tracks them along the scene. Pedestrians turn on the car traffic light, and the vision system switches it off as far as no pedestrians nor unidentified objects are present in a predefined area of interest of the scene. In the first case study two sets of sample image series were recorded from a fixed point of view at different times during day and night hours; the algorithm was able to safely manage noisy components apparently moving, and also to correctly recognise and track well behaving pedestrians and cars. Work is ongoing to improve its efficiency in tracking correctly objects moving on convergent trajectories. 1. INTRODUCTION Several studies have been published in the past years using image processing methodologies to support research in traffic theory, and in traffic control systems [1-3]; some of them are devoted to real-time tracking of moving objects in outdoor severe conditions [4]. This paper present an algorithm using a conservative approach to optimise the active time of a traffic light at a pedestrian road crossing, while maintaining safety. The problem consists in monitoring a section of a road lane near a zebra crossing, in detecting movement of pedestrians and in maintaining the traffic light into the stop status (for cars) for just the time required by pedestrians to cross the road. Moving cars typically cross the scene in a few seconds, pedestrians in tens of seconds, while shadows due to clouds or trees appear and disappear in hundreds of milliseconds. Normally, but not always, pedestrians come into the scene from side-walks and cross the lane orthogonally to the direction of cars waiting outside the zebra crossing area. Moreover, severe outdoor conditions are sometimes expected and pedestrian safety is of main concern, then algorithm robustness is preferred to its efficiency in the minimisation of car stop time. 2. ALGORITHM FLOW The tracking algorithm uses a reconstructed background picture as a reference image to detect blobs which change their position in subsequent image frames, acquired at regular time intervals (see fig. 1). Background subtraction [4] is a computationally light technique and gives less noisy fields (blobs) than optical flow [5-6] when a stable background image is evaluated taking advantage of the fixed point of view.
302 During car flow, the background image is continuously updated to take care of smooth changes in luminosity, of parking cars and of other slow time-varying events. If a scene area is suddenly lighted or shadowed during pedestrian crossing, a ghost blob is detected. For each image frame acquired, a feature extractor operator evaluates a list which contains a symbolic description of each moving blob detected ; the description includes measurements of blob features as position, area, mean gray level, elongation. The new obtained list is compared with the one inherited from previous frames and correspondences among new and old blobs are evaluated using a mixed parametric and fuzzy logic approach. Best pairwise associations are selected and a confidence index is assigned to each association: a strong association index is interpreted as an evidence of a successfully tracked blob. Successfully tracked blobs are labelled as pedestrians or cars according to the magnitude and direction of their speeds. New appearing blobs and old blobs associated to new ones by a poor confidence index are labelled as unidentified traversing objects (UTOs); the algorithm keeps memory of UTOs until next frame, then discards their descriptors. The procedure stops tracking when neither successfully tracked pedestrians, nor UTOs are present in the ROI.
Background evaluation NO
Tum off traffic light
~r
NO
~YES .[ "[
Grab currentframe Moving blobs detection Blob feature evaluation Pedestrians identification
[Blob matching] YES
~'UTOs in R O ~
NO
l
Figure 1. Flow chart of the algorithm
303 3. IMPLEMENTATION DETAILS
3.1. Background evaluation Background is an image representation of the scene including static or quite slow moving components (objects) not occluded by fast moving ones. An object which stops and remains stationary for a meaningful time interval joins the background: another which starts moving comes out of it. In our case we assume as meaningful time constant the maximum time allowed to pedestrians for road crossing (20-30 s). Then a car stopping at the traffic light does not join the background while a parking car does. The reference needs to be updated to take care of slow variations of scene illumination due to varying weather conditions and to static reflections and shadows: we evaluate and update the reference image continuously during normal car flow. To eliminate fast-moving components we follow the method described by Inoue and Seo in [5]. We acquire frames one per second and process them in groups of three: each frame in the sequence is compared with the previous and following ones. Pixels not varying their intensity more than once are added and normalised. The threshold for change is set to 1.5 the square root of pixel intensity. The process of local average reduces statistically the effects of slow-moving components. At the end of the procedure we obtain: both a background reference image and a (local) measure of variance of pixels intensities.
3.2. Moving components segmentation We assume the local intensity variance as the threshold value able to separate moving components from the background. The frequency of scene sampling is set to half a second and each acquired frame is subtracted from the background image and thresholded. Each blob in the obtained binary image is supposed to represent one or more objects on the scene moving or suddenly changing their luminance. The image is filtered, connected components are identified and blobs with area greater than a selected threshold are labelled. The area, the coordinates of the minimum including rectangle, the mean gray level, and the elongation of each blob are evaluated and put into a list.
3.3. Components tracking The aim of the tracking step is to find associations between components segmented in subsequent frames, to identify stable ones and to collect evidence of well-behaved moving components. The procedure searches for the best pairwise association among components in the lists from the actual and previous frame. For each couple it computes their distance in the feature space, fuzzifies it and evaluates an index of affinity. To each feature k it is assigned a relative weight Wk, (heuristically determined on the base of relative discriminating power of the feature in the application) [7], and it is also associated a fuzzy function F~k which defines the distance metric to be used in the feature subspace (i.e. the meaningfulness of the difference between different values of the features). The distance A(n,o) between two components n and o consists in the weighted sum of distances in the various features subspaces k. The effect of the fuzzy function F~k is to enhance similarities and to bias negatively the association index in case components strongly differ in some features.
304 If i and j are two components, respectively from the old list o and the new list n , the association index between components is : A(i, j) - Max {0, ~'~--~_ Fok (C~ , C kj)wk} where N is the number of features; Wk is the weight of feature k (~--'~_-=7Wk -- 1); ~k is the variance of feature k; and the fuzzyfying function is: Fo(a,b)- 2. Ex
-1
Two relative indexes, namely the relative association index NRAI(n,o) for the oM components ones and ORAI(n,o) for the new ones, are derived from the association matrix A(n,o) by normalising it by row and by columns. They express the relative preference of each component for the components in the complementary list and are used in the subsequent step to make associations: A(n, o) NRAI(n, o ) - ~---,k~NO A(n,k)
A(n,o) ORAI(n, o ) - ~--[~[? A(k,o)
where NO and NN are respectively the number of old and new components. The two input lists are then scan iteratively looking for the best pairwise associations. A minimum threshold value is selected for the relative association index. If two components reveal maximum reciprocal association, over the confidence level, the new component is labelled as updated and moved to the output list, the old one is discarded and the list scan is iterated. New components remaining unmatched are labelled, if in ROI, as unidentified traversing objects UTOs and moved to the output list; old ones are discarded if all new components have been updated, else they are moved to the output list and used in the next step to recover UTOs. UTOs are generally related to components moving on colliding paths, which occlude or touch each other, to sudden change of luminosity due to reflections, clouds or car light illumination. If an UTOs has been revealed inside the ROI, the sequence tracking is reinitialised.
3.4. Pedestrians labelling If all components have been successfully associated a search for well-behaving components is performed. Any blob whose speed component parallel to the zebra crossing is preminent over the orthogonal one, and whose speed magnitude is appropriate is labelled as a pedestrian. If no pedestrian are present in the ROI, tracking is halted and the traffic light status is toggled.
305
Figure 2. Tracks of w e l l b e h a v e d pedestrians 4. CASE STUDY To test the performance of the algorithm, 78 sequences were recorded from two different scenes and then played back to simulate real-time conditions. Both scenes show a one-way road, near the traffic light, and include a section of the lane, the zebra crossing and sidewalks. The scene was observed from a fixed point of view on a relevant position (15 meters) to minimise perspective effects and occlusions. The ROI was appropriately set to contain the zebra crossing and a limited neighbourhood of it on the lane sides. Frames from the recording were taken two per second, so that cars moving at 40 Km/h were present on the scene for at least three frames, while pedestrians remained on the ROI for a few tens of frames. Image frames were software smoothed and subsampled to a resolution of 128x 128 pixels per frame, 8 bits per pixel, analysed in real-time on a SUN Sparcstation 5. Table 1 summarises the results of the test. ....................................................................... Nog_.pf___s_e__quences 47 Scene 1 Scene 2 31
Optimise d .....................__L_o__s_t_a_n___d recovered
94 % 74 %
6% 22%
Lost
0% 4%
Table 1. Experimental results in the case study In most cases the algorithm was able to correctly stop sequences as soon as no meaningful moving object was present in the ROI. In sequences crowded by pedestrians or with cars stopping inside the zebra crossing area the algorithm sometimes failed when tracking blobs with intersecting trajectories. In fact changes in the blob shape due to fusion or occlusion induced an erroneous evaluation of their speed and some pedestrians were lost for a few frames. At present the labelling of objects as cars or pedestrians rely on the expectation that pedestrian cross the road on a path almost parallel to the zebra crossing and that the direction of motion can be inferred from the apparent movement of the center of mass of the corresponding blobs. In practice, noisy changes in the blob shape induce variations of the coordinates of the center of mass which, in case of small blobs, affect the apparent direction of movement of the blob. Secondly, if two blobs merge, because they move on colliding paths, the shape of the merged blob is highly changing in the subsequent frames and the
306 center of mass moves in an apparently incoherent fashion. Fig. 3 shows one sequence in which a pedestrian blob merges successively three car blobs and its track is lost for some frames. A partial solution to the problem was to sacrify efficiency and to delay decision for a few seconds at the end of pedestrians tracking. Most broken tracks were so recovered. Work is presently progressing in two directions: exploiting measures of texture [6] to segments blobs while still merged, and adding a forecast strategy to the track procedure so that merging can be foreseen and blob reconnection tried after split.
Figure 3. Sequence with merging among pedestrians and cars 4. REFERENCES
1. G. Nicchiotti, E. Ottaviani "Automatic vehicle counting from Image Sequences" in V. Cappellini ed. "Time Varying Image Processing and Moving Object Recognition 3". Elsevier 1994 pp. 410-417. 2. S.M. Smith and J.M. Brady "ASSET-2: Real-Time Motion Segmentation and Shape Tracking" IEEE Transactions on PAMI Vol.17, No 8, August 1995. 3. M.K. Teal&T.J. Ellis "Spatial-Temporal Reasoning Based on Object Motion" R.Fisher, E. Trucco eds. Proceedings of the British Machine Vision Conference BMVC96 Edimbourgh 4. Katsunori Inoue, Wonchan Seo "Background Image Generation by Cooperative Parallel Processing Under Severe Outdoor Conditions" in Proceedings of IAPR Int. Workshop on Machine Vision Appplications MVA92, Tokyo 1992. 5 J.K. Aggarwal, N. Nandhakumar, "On the computation of Motion from sequences of Images" Proc. IEEE Vol 76, No 8, Aug. 1988. 6 M. Hotter, R. Mester, F. Muller "Detection and description of moving objects by stochastic modelling and analysis of complex scenes" Image Communications 8 (1996) pp. 281-293. 7. A. M. Darwish, A. K. Jain "A Rule Based Approach for Visual Pattern Inspection" IEEE Transactions on PAMI, Vol 10,No 1, Jan 1988, pp 56-68.
I APPLICATION TO CULTURAL HERITAGE
This Page Intentionally Left Blank
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.)
309
9 1997 Elsevier Science B.V. All rights reserved.
Cultural Heritage: The example of the Consortium Alinari 2000-SOM
by A n d r e a
de Polo,
Emanuela
Sesti,
Roberto
Ferrari
Summary: This
speech
Alinari
is about
2000-SOM
its historical
On July 1995,
remote
regarding
archive.
has been
between Alinari The goal
the experience
the indexing
created
and c a t a l o g u i n g
the C o n s o r t i u m A l i n a r i
of
2000,
and Finsiel.
of the C o n s o r t i u m
retrival,
of the C o n s o r t i u m
at least
is to index and d i g i t i z e 150.000
images
for
in the next
three
years~
The Consortium experience into Cultural Heritage: The most
important
. Identify 9 Restore this
150.000
steps
vintage
and p r e s e r v e
task,
of the project
Alinari
delle Pietre Dure,
images
are-
to be used
in the project
them in the best p o s s i b l e
has created in order
a "link"
to restore
with
way.
For
the Opificio
and teach photo-
conservation 9 Duplicate 9 Digitize
the images the rolls,
9 Monitoring Reference
using
the scanning Calibrator
color q u a l i t y film,
on 35mm roll
control
the Photo-CD
will be t h e r e a f t e r
film
Photo-CD
process,
monitors between
technolgy
by u s i n g Barco
and by creating a scanner,
file and the final converted
121
a standard
the p h o t o g r a p h i c
output.
into RGB format
The
files
in order
to
310
be usable
for ultimate
Spectrocolorimeter,
CYMK printing.
Apple ColorSync
other hardware/softwares
the most costant quality.
2.0 technology
Vasari
Photo indexing:
people:
images.
description
issues,
of the images.
Venice-departiment
began to understand
photo-processing
and
The certification
of this task is
of Firenze and the University
as a firm in 1852 in the past
the importance
collection
discoveries.
the color and image
of Photography.
established
photographic
from the Uffizi
the other for the visual-semiologic
granted to the University
Alinari,
using
this work is performed by two group of
one for the historical
conservation
of Firenze,
scanner,
Museum to compare and characterize quality of the digital
and
will be constantly used to assure
9 Color certification by the University also the famous-unique,
A
of adapting
its
few years
to the new technological
Alinari decided to follow-up
those changes,
was also decided to keep part of the Alinari
handwork and
historical
profile as it is for the rest of the time in
historical
Archive
order to create a virtual changes!
The Alinari current
and "mood" with the 20th world of
collection
and Trombetta
preserves
founds.
-
-
"Consortium Alinari
To preserve
To promote
photographic
and protect
and provide collection
images;
the
images from the Wulz
the rest of the Alinari
and by using the past experiences, aims:
over 1.500.000
is based upon the past
with about 150.000
In order to catalogue
created the
but
link between the 19th Alinari
indexing experience
cataloguing work,
of
Alinari
collection,
and FINSIEL
2000-Save Our Memory"
1.500.000 Alinari
images
the maximum exposure
that
to the Alinari
311 - To be p r o f i t a b l e The first steps of the C o n s o r t i u m are to study,
to restore
traditionally
images.
Than,
place
available
Regarding
and with the computer the Alinari
them into a database
for remote
consultation
the restoration,
c o o p e r a t i o n with the to create
the first
for the r e s t o r a t i o n
Regarding
prints
for the end user.
Alinari has e s t a b l i s h e d
"Opificio delle Pietre Dure"
a
in order
Italian l a b o r a t o r y d e v o t e d e x c l u s i v e l y
of the p h o t o g r a p h s
computer methods.
the Alinari
duplicating
in order to have the data
2000 project,
the Alinari
into 35mm film,
glass plates
using traditional
and
the main steps are:
and historical
scan and catalogue
vintage
them in order to
create a data bank of Photo-CDs p r e s e r v i n g high r e s o l u t i o n images.
The d i s t r i b u t i o n will
on-line distribution. Photo-CD
Portfolios
masterized.
follow two directions:
The C o n s o r t i u m already p r o d u c e d
CDs and
and 15 more are w a i t i n g to be
two
The on-line aspect will use the most b r o a d
access platform: technologies
ISDN,
ITAPAC and Internet
and the
that Finsiel will have available
specific project.
for this
The indexing and c a t a l o g u i n g will be
approved by a scientific board that includes p r o f e s s o r s the U n i v e r s i t y
of Firenze and the U n i v e r s i t y
PROFILE "Consortium
OF
THE
ALINARI
2 PARTNERS 2000
- SAVE
OF
of Venezia.
from
THE
OUR
MEMORY"
FINSIEL
From the ist July 1994,
with the i n c o r p o r a t i o n Tecsiel. element
the company has been integrated,
of Finsiel,
The goal of this o p e r a t i o n
for the d e f i n i t i o n
this concept,
Agrisiel,
Italsiel
and
is to create a new
of the group itself.
a new logo has been p r e s e n t e d
Following
to the media.
312
Finsiel
includes
the "Unita'
d'Affari
Istruzione
e Cultura",
which is the financial branch of this operation. is dedicated
areas,
Finsiel
to promotions
with new projects. is the world's
into the cultural
leader for communications,
know-how will bring the transmission speed between
This branch
and artistic
the cultural
and its
of the data at ISDN
institutions.
The system of the
National
Italian Library has been already been completed
FRATELLI
ALINARI
into a "Virtual
Library".
This is the oldest photographic activity.
Established
printing house, and exhibition
in 1852,
photographs
the archive
the photo-library,
space,
ranges
photographic
images.
from the early 19th century
and Albumen prints
challenge
Photo CD files,
the library,
historical
halide and thermo prints. The today's
is to put the collection
in
the
the museum,
Alinari
is
The vintage
to the contemporary
silver
into Pro-
in order to make it as one of the greatest
archives,
scientifically productive
economically profitable.
For more information please write to: "Consortium Alinari 2000-Save Our Memory" Largo Alinari 15 50123 Firenze, Italy tel: +39 55/2395-229 fax: +39 55/2382-857 e-mail: [email protected] URL: http://www.alinari2000.tecsiel.it copyright
includes:
and the publishing house.
guardian of over 1.500.000
Daguerreotypes
archive that is still
1996 Consorzio Alinari
2000-SOM
and
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
313
Color Certification A. Abrardo a, V. Cappellini a, A. Mecocci a and A. Prosperi a aDipartimento di Ingegneria Elettronica, Universit?t di Firenze, via di S. Marta, 3, 50139 Firenze, Italy 1. A B S T R A C T In this paper a method to assess the exact color reproduction accuracy of a multispectral scanner capable to acquire digital images directly from paintings is described. Different color correction procedures are investigated and compared. In particular numerical results are reported in the case of linear and quadratic transformation of the seven-stimulus recorded value. Two different standard color charts are considered: the Macbeth color checker chart and the AgfaReference IT8.7/2 chart. The two charts are taken alternatively as Reference and Test chart for the color correction procedure. The concept of residual and generalized errors during color correction are introduced showing that a seven filter acquisition device, such as the VASARI scanner, gains its better results when the Macbeth chart is taken as reference chart and a linear transformation is used. 2. I N T R O D U C T I O N In the reproduction of a color image, the original image is initially recorded using a device that measures the reflected or transmitted energy of the image in a number of different wavelength bands. In practice, the color values of the original image under a particular viewing illuminant are estimated from the data obtained from the recording device. This estimated data is then used by a reproduction device to match the original image under that viewing illuminant. One of the most important goals of the high-quality digital image reproduction is the achievement of a high chromatic fidelity to the original painting. In application such as art archival, it is clearly imperative to obtain accurate color measurements of the original image. One approach to improve the color accuracy of a color measuring device is to use an increased number of color filters [1][2][3][4]. In particular, a method for computing the optimal transmittances of such color filters is demonstrated in [3][4]. As the number of color filters is increased, additional information about the reflectance spectra in the original image is obtained. This additional information is used to improve the color match with the original image under a particular viewing illuminant. In the framework of the EU project ESPR1T-MUSA, in the area of multimedia systems and telematics networks for dissemination of Art-Works, a new digital image acquisition system, the VASARI scanner, was developed and manufactured at the National Gallery, in London. As main result of this project, a new version of the VASARI scanner has been installed at the Uffizi Gallery Labs in Florence, on February 1995. This device, capable to acquire digital images directly from paintings, uses seven color filters during the acquisition step thus achieving a good color reproduction of the original Art-Work.
314 3. M A T H E M A T I C A L B A C K G R O U N D OF C O L O R C O R R E C T I O N In the following a vector space approach to represent the visible spectrum (400-700 nm) will be considered. The visible Spectrum is sampled at N wavelengths: thus, the spectral reflectance or spectral transmittance of an object can be represented by an N element vector f. If the spectral power distribution of the illuminant is written as an N x N diagonal matrix L, then the radiant power reflected or transmitted by the object is represented by the vector Lf. The recording of a color stimulus is performed by measuring the intensity of filtered light. If ni represents the transmittance of filter i, the recording process can be modeled as C =
N TLr
f + u
(1)
where N = [ n l , n 2 ..... n p ] , L~ is the recording illuminant, c is the P-stimulus value recorded for spectral reflectance f under illuminant Lr, and u is the additive noise. Three color filters are often used, in which case c is referred to as tristimulus value. The human eye contains four types of sensors consisting of three types of cones, which are used for color vision, and the rods, which are used for low luminance vision. We will assume that the luminance level is sufficiently high such that only the cone responses need to be considered. The human visual subspace (HVSS) can be defined by any set of three vectors that are a nonsingular linear transformation of the spectral responses of the three types of cones. The response of the human visual system to a radiant spectrum is uniquely determined by the orthogonal projection of that radiant spectrum onto the subspace generated by three vectors chosen as basis for the HVSS. In order to introduce a standardization, the Commission Internationale de l'Eclairage (CIE) produced a set of nonnegative vectors which are a nonsingular linear transformation of the spectral responses of the three types of cones; these three vectors are referred to as the CIE XYZ color matching function [7, Chapter 3]. If the sampled CIE XYZ color matching functions are contained in the columns of matrix A = [al,a2,a3], then two radiant spectra fl and 1'2 visually match if and only if their tristimulus values are equal A r f l = Arf2
(2)
In the color correction procedure the recorded data must be transformed to obtain the tristimulus values of the original image. Mathematically color correction can be described by c' = F(c) = ATL~ f
(3)
where c is the recorded data (see (1)), Lv is the viewing illuminant, and F is the ideal color correction transformation. Due to drastic reduction of information in the recording process, equality can be achieved only under conditions that are rarely met in physical situations. Infact some of the colors to be reproduced are often outside the color gamut of the acquisition device. In this case it is impossible to find a linear transformation of the recorded data which match exactly the CIE XYZ color matching functions. For this reason a color correction of the recorded data may be obtained only by a linear minimum mean square error (LMMSE) approach. In this case the corrected color c' is obtained by minimizing the error between the true tristimulus value and its estimate value e = II ATLv f - F(c)ll 2
(4)
In a low-end acquisition device the number of filters is generally equal to three, i.e. c T = [Cl,C2,C3]. We will consider in the following two different types of color correction
315 transformations: a linear transformation F/(c) and a quadratic transformation Fq(c). The two different transformations are given by the following expression:
F(Cl)=WlCl F(Cq)::::Wqcq
(5)
where CXl = (1,Cl, C2,C3), CXq = (1,Cl, C2,C3,C12,c22,c32,clc2,c~c3,c2c3), We and Wq are a 3x4 and a 3x 10 matrix, respectively. By denoting with Cothe actual tristimulus value of the acquired color, that is Co = A x E f
(6)
the matrices Wl and Wq are computed by minimizing the mean square error of (4) over a set of colors for which the true tristimulus values are known. The minimization of the mean square error yields:
VE[(Co-WlCl)(Co-WlClr 0 VE[(c0-WqCq)(C0-Wqcq)T]= 0
(7)
The matrices WI and Wq may be computed by solving three equation set of dimension 4 and 10 respectively. The residual error of the color correction procedure, that is the error (4) evaluated over the set of known tristimulus values, determines the quality of the color correction procedure. In order to obtain a measure of the residual perceptual error the CIE L*a*b* color coordinates are considered. In particular the average distance AEL*a*b*[5-6] between the actual and the corrected colors, referred to as ~, is considered as the measure of the color correction procedure. A value of ~ greater than unity is intended to indicate that the eye can discriminate between the actual and the corrected colors (that is the color correction procedure suffer from an approximation). When the number of filters of the acquisition device is high enough, the color correction procedure is successful [3]. With the VASARI scanner having seven filters, after the color correction procedure one can hardly discriminate the corrected color from the originals. 4. THE V A S A R I S C A N N E R
Among the goals of a multimedia system that allows high-quality image reproductions for either display or hard-copy format is the achievement of a high chromatic fidelity to the original Art-Works (e.g., paintings). The colors of the digital reproduction should be related to those of the original painting, in order both to give a measure of color fidelity, and to provide tools to enhance the chromatic quality of the poor reproduction. Some parameters that attempt to measure chromatic differences, as perceived by the human visual system, are reported in [5],[6]. The VASARI scanner is an extremely sophisticated imaging system which allows high-resolution digital color images to be acquired directly from paintings. One of the main features of the VASARI scanner is the high-quality color reproduction when compared to the original painting. Differently from commercial scanning systems working in 3 spectral bands, broadly corresponding to R-G-B, the VASARI scanner employs 7 spectral filters to span the visible wavelengths. In Fig. 1 the transmittance of the 7 filters used by VASARI scanner are reported.
316 Blue
07
,
Magenta
,,,/\
Cyan ,
,
Green
,
Orange Yellow
Red
06
05
04
03
02
01
0
,.,,., ~ 0
400
450
500
55 0
600
6,.50
700
7,.50
Cnm) Fig. 1 Seven filters of the VASARI scanner Color accuracy is ensured by a preliminary calibration procedure of the instrument in the XYZ domain based upon a least-squares (LS) fit to color patches from special color-checker charts. In order to test the color correction method described in section 2, we have considered two different color-checker charts: the Macbeth chart [3] composed by 24 colors uniformly spread over the visible spectrum, and the AgfaReference IT8.7/2 chart designed by IT8 Ansi subcommission and constituted of 264 color patches and 24 levels of gray. Numerical experiments have been carried out considering alternatively one of the two chart as reference chart and the other as test chart. The reference chart is used to compute the matrix W of eqn. (5) and the AEL,a,b* residual error ~r, while the test chart is used to derive the generalized AFr referred to as ~g. The generalized error is calculated by applying the matrix W to the P-stimulus values of the test chart. The colors of the Macbeth and of the Agfa charts are characterized by N samples of the spectral reflectance named fM and fa respectively. The corresponding P-stimulus values recorded for spectral reflectance fM and fa are: cM = Nv "rfM
(8)
c A = NvTfA
where Nv are the N samples of the seven filters of Fig. 1. In [8] it is demonstrated that a proper sampling may be obtained by considering 10 nm sampling, i.e. N = 30. Thus a value of N = 30 have been considered in eqn. (8). The mathematical model described above may be considered as an ideal acquisition procedure
317 for uniform illuminant L. The real P-stimulus recorded by the VASARI scanner will considerably differ from that of eqn. (8) as a consequence of the physical devices involved in the color acquisition procedure (see [3]). The ideal approach is considered in the following in order to test the color correction procedures described in section 2. In Table I the numerical results are reported in the case of linear and quadratic transformation. When the Test chart is the Agfa IT-8 chart and the Reference chart is the Macbeth chart we deal with case TA-RM. The opposite case is referred to as TM-RA. The notation quadratic or linear indicates the type of color correction transformation. Note that the computation of matrix Wq for a seven filters acquisition device may be obtained by minimizing the mean square error of (4) over a set of at least 36 colors, being 36 the dimension of CTq. For this reason it is impossible to compute the matrix Wq when the 24 colors Macbeth chart is used as test chart. Table I Ideal acquisition: Residual ~ d generalized e~ors fo!- p=7 Case Case Case Case
T A . R M - linear T A - R M - quadratic TM-RA - l i n e a r TM-RA-quadratic_
0.3 /
0.35 /
O. 16
0.029
0.73
....................
1.02
In Table 11 are shown the results obtained in the case of real acquisition, i.e. the actual Pstimulus vectors acquired by the VASARI scanner are considered. Such vectors are obtained by using the preliminary correction for non-uniform distribution of light and non-linearities in the response of the filters described in [3]. The cases of P = 5,6,7 are considered in order to test the performance dependence from the number of filters used during the acquisition. When P=6 the output of the Magenta filter is not considered during color correction. Moreover both the output of the Magenta and of the Red filters are not considered when P=5. Note that when P = 5 the dimension of CTq becomes 21. In this case it is possible to compute the matrix Wq when the 24 colors Macbeth chart is used as test chart. It is observed a slight degradation in the performance of the VASARI scanner when the number of filters decrease down to 5. The performance degradation observed in Table II when compared to the ideal case of Table I, is principally due to the non-linearities in the response of the filters which are not completely removed by the procedures described in [3]. Table II Real acquisition: Residual and generalized errors
___ii_i'_._................ii..i_..i_'i..._.i.................i..iii_........................ ,..i..i..iiiiiiiiiii_~r(P=7 / P=6 / P=5) Case T A - R M - linear 2.19 / 2.21 / 2.23 - / - / 0.36 Case T A - R M - quadratic 3.47/-/Case T M - R A . linear .............Case TM'RA-~quadratic ...................................................................2 . 5 6 / - / , ...................
(P=7 / P=6 / P=5) 4.75 / 4.82 / 5.45 / - / - / 28.9 6.87/-/19.19/-/-
Kang [9] and Kang and Anderson [10] have experimented with polynomial approximations ranging from linear to cubic equations for relating scanner RGB values to the CIE XYZ tristimulus values of the data set. The average errors on the training set became smaller as the number of polynomial terms increased, while the generalization results became higher with the increasing of the polynomial order. The same situation is observed in Table I and ]I for ideal and real acquisition device characterized by P acquisition filters. A color correction procedure applied to a physical device, such as the VASARI scanner, must
318 be able to approximate reflectance colors which may be quite different from the colors of the Reference chart (as, for example, the colors of a painting). For this reason the generalized error gives a better measure of the color correction procedure reliability. Table I suggest that the better results are obtained when the Macbeth chart is taken as reference chart and a linear transformation is used to derive the matrix W. REFERENCES
1. R. V. Kollarits and D. C. Gibbon,"Improving the color fidelity of cameras for advanced television systems" SPIE proc., vol. 1656, 1992. 2. K. Martinez, J. Cupitt and D. Saunders, "High resolution colorimetric imaging of painting", SPIE Proc., Vol. 1901, 1993. 3. M. J. Vrhel and H. Hoel Trussel, "Filter Considerations in Color Correction", IEEE Transaction on Image Processing, Vol. 3, No. 2, March 1994. 4. M. J. Vrhel and H. Hoel Trussel, "Optimal Transaction on Image Processing, Vol. Color Filters in the Presence of Noise", IEEE 4, No. 6, june 1995. 5. Commission Internationale de 1Eclairage, "Recommendation on uniform color spaces difference equations, psychometric color terms", Supplement No. 2 to CIE Publication No. 15 (E-2.3.1), 1971/(TC-1.3), 1978. 6. F. J. J. Clarke, R. McDonald, B. Rigg, "Modification to the JCP79 Color-Difference Formula", Journal of the Society of Dyers and Colourists 100 (1984), 128-132. 7. G. Wyszecki, W. S. Stiles, "Color Science", 2nd Edition, New York, John Wiley & Sons, 1982. 8. H.J. Trussel and Manish S. Kulkami "Sampling and Processing of Color Signals", IEEE Transaction on Image Processing, Vol. 5, No. 4, April 1996. 9. Kang, H. "Color scanner calibration" J-Imaging and Technology 14, 47-52, 1992. 10. Kang, H.R. and P.G. Anderson "Neural Network application of the color scanner and printer calibrations" J-Electronic Imaging, 1, 25-135, 1992. 11. W. Niblack, "Digital Image Processing", Prentice/Hall, Denmark, 1986.
Time-Varying Image Processing and Moving Object Recognition, 4 - V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All rights reserved.
319
Image Retrieval by Contents with Deformable User-drawn Templates A. Del Bimbo, P. Pala Dipartirnento di Sistemi e Informatica, Universit~ degli Studi di Firenze Via S. Marta 3, 50139 Firenze, Italy
Abstract
Image retrieval by contents from database is a major research subject in advanced multimedia systems. Effective image retrieval by contents requires that visual image properties are used instead of textual labels to properly index and recover pictorial data. In this paper, we present a technique which combines an elastic and a feature based approach to evaluate the similarity between user sketched templates and shapes in the images.
1
Introduction
Visual information contents associated with pictorial data advises against the use of indexing and retrieval based on textual keywords as traditionally used in text documents. Visual queries exploit human natural capabilities in picture analysis and interpretation and largely reduce the cognitive effort of the user in the access to the image database. In this approach, the user reproduces on the screen the approximate visual representation of the pictorial contents of images to be retrieved, and retrieval is reduced to the matching of the user visual representation against the image representations stored in the database. A number of techniques have appeared in the literature, which deal with content representation and visual retrieval of single images; differences between these approaches are related to the different facets of pictorial data that are taken into account. Picture color distribution and object texture organization have been used as a representation of image contents in [2] and [3]. In this case, queries request images that contain object colors and textures similar to those selected from a menu, and matching is performed by comparing color histograms or using the Euclidean distance in the texture space [2].
320 Retrieval by-contents based on similarity between imaged object shapes and user-drawn sketches has been proposed by a few authors [4], [5], [2], [1]. Unlike retrieval by colors or textures, or spatial relationships, here the problem is complicated by the fact that a shape does not have a mathematical definition that exactly matches what the user feels as a shape. In recent works [4], [1], elastic approaches have been used to provide an effective measure of perceptual similarity between shapes, avoiding the need to evaluate shape features. According to these approaches, the amount of deformation which leads two shapes to overlap is described by an appropriate energy functional. Similarity between two shapes is computed on the basis of the energy spent in the deformation. Elastic based approaches have proven to be a more robust technique for measuring the similarity between shapes. However, they cannot be directly coupled with traditional index structures. It follows that with these techniques, searching for a desired shape, requires to sequentially match it against every shape stored in the database. In the following, we present a system for image retrieval by shape similarity. The system combines both the elastic and the feature based approach into a unified framework. According to this, the search process is composed of two steps. In the first step, a set of candidate shapes is built. This set is composed of all those database shapes which share some signature feature with the query shape. In the second step, candidate shapes are processed through an elastic deformation technique. The user sketch is deformed to adjust itself to the shapes of the objects in the images. The match between the deformed sketch and the imaged objects, as well as the elastic deformation energy of the sketch, are used to measure the similarity between the query and the image. This paper is organized as follows: in Sect. 2, it is expounded which shape features are extracted and used to represent signature features for the shapes. In Sect. 3, the elastic approach to shape matching is introduced, expounding the model of shape similarity, the numerical solution and how similarity ranks of the matched images are obtained. In Sect. 4, considerations about the effectiveness of the approach are expounded with some examples of image retrieval.
2
Feature Based Shape Description
Since the first work of Attneave [6], many contributions have appeared to point out the central role played in the partitioning process by those points where the shape bends more sharply. This idea has been the starting point for most
321 of the successive efforts in shape partitioning, and it represents the starting point of this work too. In our method, it is assumed that the partitionment of a shape occurs at the points of minima and maxima of its curvature function. Mathematically, a planar, continuous, closed curve can be parameterized with respect to its arc-length t, and expressed as:
= {z(t), y(t)}
,
t e [0,1]
The curvature F(t, a) of c(t) at the point {x(t), y(t)) can be expressed as:
r(t)
- xt(t)ytt(t) - xtt(t)yt(t) +
y?(t))
where xt, Yt and xtt, Ytt are the first and second derivatives of x and y with respect to t, respectively. We define P = {pn} as the set of points pn of the curve corresponding to the relative minima and maxima tn of F(t), that is:
Pn - c(t,~) Partitioning a shape in sub-parts is just a part of the goal. Besides we must be able to detect those parts which represent similarity prints, apart from the details which appear in a particular instance. An effective description of a shape can be achieved by representing the visual features of each partition point. For each point p, we consider two features: the curvature 7~ = F(t,) and the direction 0, of the vector from Pn and b,, where bn is the median point of the segment Pn-1 --Pn+I. The description of a curve c(t) with n partition points is thus achieved by considering the set: en-1)}
Feature Matching A measure of similarity between two curves cl (t) and c2(t) is obtained by comparing the two feature sets/91 and P2. Suppose that P1 and ]92 are composed of nl and n2 points, respectively. A measure of the distance D between/91 and P2 is computed as: nl --I
D=
~di i--O
where di =
min
j=O,...,n2--1
dist ((7i, 0i), (Tj, 0j))
322 T h e elastic approach to s h a p e m a t c h i n g
3
Suppose we have a one-dimensional template, modeled by a second order spline 3 = (vx, ~-y) " tt ~ It2 (that is a piecewise first degree polynomial function). We will always assume that the template is parameterized with respect to arc-length, and normalized so as to result of length 1. We have an image I " It2 ~ [0, 1] - we suppose the luminance at every point normalized in [0, 1] - that we search for a contour with a shape similar to that of 3. To make a robust match even in the presence of deformations, we must allow the template to warp. If f f - (0x, 0y) 9It ~-+ It2 is the deformation, then the deformed template r (also parameterized with respect to arc-length) is given by: -
+
The template must warp taking into account two opposite requirements. First, it must follow as closely as possible the edges of the image. The match between the deformed template and the edge image IE can be measured as" 1
M = SI, (r 0
ds.
If we normalize IE so that IE E [0, 1], then A4 E [0, 1]. A value M - 1 means that the template lies entirely on image areas where the gradient is maximum (i.e., on image edges), while M = 0 means that the template lies entirely in areas where the gradient is null. The second requirement to be taken into account is the deformation of the template. We measure an approximation of the elastic deformation energy for the template given by:
\ ds ] + \ d s ]
s 0
ds+Z
\ d s 2] + ~ , ~ ]
ds
0
The quantity 8, depending on the first derivative, is a rough measure of how the template 3 has been strained by the deformation if, while the quantity B, depending on the second derivative, is an approximate measure of the energy spent to bend the template. Therefore, we assume $ and B to be measures of the strain energy and bend energy associated with the deformed template + 0 with respect to the original template 3, respectively. Allowing arbitrary deformations of the template results in a situation in which every template matches every image, and in a mathematically ill-posed problem. In order to discover the similarity between the original shape of the
323 template and the shape of the edge areas on the image, we must set some constraints on deformation [7]. Hence our goal is to maximize A/t while minimizing ~'. This can be achieved by minimizing the compound functional:
~
+\<s=)
+
-
T e m p l a t e M a t c h i n g After a template reached convergence over an image shape, we need to measure how much the two are similar. Similarity is a fuzzy concept, and to measure it we need to take into account a number of things: the degree of overlapping A4 between the deformed template and the gradient of the image, the strain energy $, the bend energy •, the number Af of zeroes of the curvature function associated with the original template, the correlation C between the curvature function associated with the original template and that associated with the deformed one. All these five parameters ($,/3, J~/l,Af, C) are classified by a back-propagation neural network subjected to appropriate training.
4
Experimental Results
Based on the techniques previously expounded, a prototype system has been developed for image retrieval by shape similarity. The system can be requested to retrieve images by drawing sketches of imaged objects on a graphic screen. In the images of the database, each interesting shape is manually extracted and a 128 • 128 edge image is built for each shape. Then, the feature description is extracted for each database shape and feature vectors are used to derive a feature index which can take the form of any multidimensional point access method. Given a user drawn query template, the corresponding feature vector is derived and used to access the feature index. Shapes database which match the query feature vector are selected to be processed through the elastic matching approach which outputs a similarity rating Si. Retrieved images are sorted depending on the values of Si and visualized on the computer screen. As an example in Fig. 1 (a) a sketch roughly representing a bust is shown. In Fig. l(b) the first six best matched images are shown.
324
Fig. 1. (a) User drawn sketch. ( b)Retrieved images according to the sketched template. References [1] S. Sclaroff, A. Pentland "Object Recognition and Categorization Using Modal Matching", Proc. 2nd CAD-based Vision Workshop, Champion, PA, Feb. 1994. [2] W. Niblack and al. The QBIC project: Querying images by content using color, texture and shape. Res.Report 9203, IBM Res.Div. Almaden Res. Center,, February 1993. [3] M. Swain and D. Ballard. Color indexing. Int. Journal of Computer Vision, 7(1), 1991. [4] A. Del Bimbo and P. Pala. Visual Image Retrieval by Elastic Matching of User Sketches. To appear on IEEE Transactions on PAMI. [5] K. Hirata and T. Kato. Query by visual example: Content-based image retrieval. In Adv. in Database Technology EDBT '92, 3rd Intl. Conf. on Extending Database Technology, volume 580, Vienna, Austria, March 1992. Springer Verlag. [6] F.Attneave "Some Informational Aspects of Visual Perception". Review, vol. 61, pp. 183-193, 1954.
Psych.
[7] A. Tihonov. Regularization of incorrectly posed problems. Soviet Mathematical Doklady, 4:1624-1627, 1963.
Time-Varying Image Processing and Moving Object Recognition, 4- V. Cappellini (Ed.) 9 1997 Elsevier Science B.V. All fights reserved.
325
Synthesis of Virtual Views of Non-Lambertian Surfaces through Shading-Driven Interpolation and Stereo-Matched Contours Federico Pedersini, Augusto Sarti and Stefano Tubaro ~ ~Dipartimento di Elettronica e Informazione, Politecnico di Milano Piazza L. da Vinci, 32, 20133 Milano, Italy pedersin, sarti, tubaro~elet .polimi. it We propose and test a technique for the synthesis of "virtual" views of a 3D scene. 3D reconstruction is performed from both stereo correspondences and shading, and uses a calibrated multicamera system. The data over which surface interpolation is performed is a set of 3D edges, computed through stereopsis, and a set of additional curvature-tuning points, placed where the reflectivity model can be reliably estimated. The 3D coordinates of the tuning points are those that minimize the MSE between real and corresponding synthetic views. Texture-mapping of the reflection-corrected luminance function is then performed. By doing so, we simulate the migration of reflections, due to the change of viewpoint, by using an estimate of the non-Lambertian surface reflectivity. Synthesis is finally carried out through reprojection on the "virtual" image plane. The technique has been tested on real images, producing realistic synthesized views. 1. O v e r v i e w The synthesis of arbitrary views from a set of images taken from different angles is a problem of particular interest in a number of applications such as virtual object manipulation (3D catalogs, tele-surgery, etc.) and 3D television. Several strategies can be used for synthesizing a virtual view. In general, they can be divided into two categories: those that operate an "interpolation" between the available views and those that first reconstruct the 3D object surface, and then reproject it onto the image plan of the "virtual" camera. The work we present here is based on the second approach. In fact, we extract geometric (stereo correspondence) and radiometric (shading) information from the available views for 3D reconstruction, and then use the information on reflectivity and illumination for realistic rendering on the virtual camera. The complementary nature of stereo and shading has been discussed in several previous articles [3,6]. It is well known, in fact, that intensity correlation performs better on highly textured regions of the input images, whereas the accuracy of shading is higher on regions corresponding to more regular surfaces. The information we use for reconstruction comes from a calibrated multicamera (trinocular) system. Stereometric matching is performed between edges, and triangulation allows us to locate such edges in a 3D space with rather good precision. This procedure provides us with an irregular "mesh" of 3D details, but does not provide any clue about the
326 local curvature of smooth surfaces, which is exactly the shape information that can be extracted from shading. Once the set of 3D edges and points is available, we can construct a rough approximation of the 3D object surface. Such a surface can thus be used for a preliminary estimate of the parameters of surface reflectivity and illumination and a measure of their reliability. We then use the reliability mask to rule out regions of the surface over which the reflectivity model is not dependable and, by keeping only the reliable portions of the object surface, proceed with a refinement of the radiometric parameters. Shading is taken into account by introducing a set of curvature-tuning points, scattered inside the reliability mask of the radiometric model. The 3D coordinates of all tuning points are computed by minimizing the MSE between the available real views and the corresponding synthetic views. A better approximation of the object surface is now possible by using both matched edges and tuning points. Last step of our technique for the synthesis of a virtual view is the rendering process, which takes into account the fact that reflections may depend on the position of the viewer. In fact, we reproject on the new image plane the estimated object surface over which texture-mapping of the reflection-corrected luminance function has been performed. Texture correction, which uses an estimate of a non-Lambertian reflectivity model, is done in such a way as to simulate the migration of reflections due to the change of viewpoint. In other words, we modify the image texture through a compensation of the non-lambertian component followed by a simulation of the reflections that would be visible from the virtual viewpoint. 2. R e c o v e r y of 3D i n f o r m a t i o n f r o m s t e r e o Stereo matching techniques compute, through geometric triangulation, the 3D coordinates of details that originate corresponding edges on two views. Since binocular vision does not guarantee a unique determination of correspondences in a complex scene, we use a set of three cameras, as shown in Figure 1. Trinocular vision, in fact, allows us to select the best pair of cameras for a specific correspondence between elements of two images and to validate this correspondence through a check on the third view [1]. Preliminary
Figure 1. Trinocular camera system and virtual camera.
327 calibration is performed in order to estimate intrinsic (focal length, optic distortion) and extrinsic (relative position of the cameras) parameters of the trinocular system [4,7]. The 3D edge positioning requires an accurate edge extraction from the three images. This task is performed by a modification of the Canny edge detector [5]. In order to match corresponding edges, an improved version of a technique proposed by Ayache [1,2] has been developed [7]. Once the correspondences have been found, the construction of the depth map can be easily performed through geometric triangulation. The final result of this system is an irregular set of 3D edges whose accuracy critically depends on the quality of the calibration procedure and whose density depends on the degree of smoothness of the surfaces of the scene. Highly textured objects, in fact, will produce denser 3D meshes of edges.
3. Shading-driven interpolation The set of 3D edges determined through stereo matching, is often too sparse for a reliable surface reconstruction. This happens especially when dealing with surfaces that are not very textured. For this reason, wherever a dependable reflectivity model is available and the 3D edges are too sparse, we exploit the radiometric properties of the surface and extract additional information from shading. The surface reflectivity of real images, besides depending on the direction of illumination, changes according to the viewing direction. Lambertian surfaces, in fact, are very rare, and specular reflections are practically always present. For this reason we use a non-Lambertian radiometric surface description based on a model developed by Torrance and Sparrow [8]. The reftectivity function we use has the following form R = Acos0i + K
a2 2~ 2
+ D
(1)
COS ~r
where v~i is the angle between surface normal and incident light, v~r is the angle between surface normal and viewing direction and c~ is the angle between the surface normal and the plane corresponding to incident light and viewing direction. The first term represents the Lambertian component, the second term is the specular reflection and the third one is an extra constant that we included to take diffused light into account. The above radiometric model allows us to describe realistic surfaces and illumination conditions. In fact, it is suitable for modeling diffuse and specular reflections produced by one dominant source of light combined with diffused light. Such a model is fully described by A, K, D, ~r and the direction of dominant light. Surface interpolation is performed over the previously determined 3D edges and an additional set of curvature-tuning points. The tuning points are automatically allocated in two steps: first a reliability mask is built for the reflectance model, and then the points are scattered inside the reliable areas in such a way not to lie close to matched edges. The 3D coordinates of the added points are computed through the minimization of a cost function. Since the ultimate goal is to synthesize virtual views, the cost function chosen to be the MSE between the available real views and the synthetic images in corresponding position. The synthesis is carried out by using an estimate of a viewerdependent reflectivity model. Notice that, since the reflectivity model is non-Lambertian,
328 the minimization can be performed on all available views, as each of them provides some extra information. 4. R e n d e r i n g The synthesis of the virtual view is performed by using both shape information and radiometric properties of the surface. Since the surface reflectivity is, in general, nonLambertian, specular reflections can occur; this means that the surface luminance on the virtual view depends on the viewing direction. In fact, observing a non-opaque curved surface from a moving viewpoint, we see that all reflections move accordingly. In order to take this effect into account we first eliminate the specular component of eq. 1 from the surface luminance map, then we add the specular component corresponding to the virtual viewing direction. Such an operation is possible as all parameters of the radiometric model have already been estimated before the shading-driven interpolation. Texture map of the reflection-corrected luminance function is finally performed over the reconstructed 3D surface, and its reprojection onto the image plane of the virtual camera produces the desired view. 5. T e s t i n g a n d results Tests have been performed on several real images. In Fig. 2a, one of the three views of the face of a dummy is shown. The correspondent reliability mask of the adopted nonLambertian reflectance model results as in Fig. 2b. The virtual viewpoint in Figs. 2c has been intentionally chosen quite far out of the triangle of the three cameras (see Fig. 1). In
Figure 2. Reconstruction of the face of a dummy. Original image (a), reliability mask of the reflectivity model (b), virtual view (c).
329 order to measure the accuracy of the 3D reconstruction algorithm, we also used a triplet of images of a ring-shaped portion of circularly symmetric Styrofoam smooth surface of known shape. The shading-driven interpolation of all matched edges produces a surface whose horizontal and vertical sections are shown in Figs. 3a and 3b, respectively. As we can see, the curvature of the reconstructed surface (solid line) approximates acceptably well the actual curvature (dotted line). In general, however, the accuracy depends on how well the reflectivity model describes both surface and illumination conditions. As the minimization is performed on the MSE between real and synthetic views, it is reasonable to expect that the synthesis of the virtual view will give rise to good results even when the accuracy of the reconstruction is not very high. In Fig. 4 the actual view of the test shape and two virtual views, respectively, are shown. 6. Conclusions We have presented a technique for the synthesis of virtual views from stereo correspondences and shading, using the output of a calibrated trinocular vision system. The key point of the method is the way texture mapping is performed before reprojection. In fact, a non-Lambertian reflectivity model of the surface is used for computing the luminance that would be seen from the virtual viewpoint. The obtained luminance map is the texture we map onto the surface before reprojection. The method has been successfully tested on real images. Further improvements are currently being made on the above technique. In particular, preliminary piece-wise smooth segmentation of surfaces and a "smart" placement of curvature-tuning points are under study. REFERENCES
1. N. Ayache: "Artificial Vision for Mobile Robots". MIT Press 1990. 2. N. Ayache, F. Lustman: "Trinocular Stereo Vision for Robotics". In IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 13, No. 1, pp. 73-85, Jan. 1991. 3. A. Blake, A. Zisserman, and G. Knowles. "Surface descriptions from stereo and shading". Image Vision Computation, Vol. 3, No. 4, pp. 183-191, 1985. 4. S. Brofferio, F. Pedersini, S. Tubaro, "Calibration of Trinocular System for 3D Measuremts". Proc. 4-th European Workshop on Three-Dimensional Televison (COST 230), Rome, Italy, October 20-21, 1993, pp. 89-96. 5. J. Canny: "A computational approach to edge detection". IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 8, No. 6, pp. 679-698, Nov. 1986. 6. Y.G. Leclerc, A.F. Bobick. "The direct computation of height from shading". Conf. on Computer Vision and Pattern Recognition, Lahaina, Maui, Hawaii, June 1991. 7. F. Pedersini, S. Tubaro, F. Rocca: "Camera Calibration and Error Analysis, An Application to Binocular and Trinocular Stereoscopic Systems". Proc. Int. Workshop on Time Varying Image Processing and Moving Object Recognition, Florence, Italy, Jun. 10-11, 1993. 8. K.E. Torrance, E.M. Sparrow: "Theory for Off-Specular Reflection from Roughened Surface for Ray Reflection". In Journal of Optical Society of America, Vol. 65, pp. 531536, 1975.
330
.
(~)
.
.
.
.
.
(b)
Figure 3. Section profile of a Styrofoam shape (dotted line) and its reconstruction (solid line). Horizontal section (a), vertical section (b)
Figure 4. Original view (a) and two virtual views (b,c) of a Styrofoam shape
331
AUTHOR INDEX ABRARDO A. ALLEN R. ALPARONE L. ANDROUTSOS D. APPENZELLER G. ATTOLICO G. ATZORI L. BALLERINI L. BARONTI S. BARZANTI A. BASSINO P. BENVENUTI M. BIEMOND J. BIFULCO P. BIGI F. BOGAERT J. BOJKOVIC Z. BONIFACIO S. BRACALE M. BRANCA A. BRUTON L.T. BUSCEMI M. CAPPELLINI V. CASINI A. CASTELLANO G. CECCARELLI M. CESARELLI M. CHENG H.L.M. CHIMIENTI A. CHIUDERI A. COLMENAREZ A. CONESE C. COPPINI G. CORTELAZZO G.M. DEL BIMBO A. DELLEPIANE S. DENASI S. DE POLO A. DI CHIARA C. DI GREGORIO M. DILLMANN R. DI SALLE F. DISTANTE A. D I VAIA M.
313 213 256 19 69 36,42 115 205 256 256 51 165 9 213 105 159 139 99 213,219 36 3 93 313 256 42 133 213 3 262 147 79 165 205 244 256,319 197 289 309 165 115 69 219 36,42 153
DI VECCHIA A. D' ORAZIO T. EYVAZKHANI M. FAVALLI L. FENU R. FERRARI R. FERRI M. FORMISANO E. GALATI G. GAMBA P. GARIBOTTO G. GIACOMELLI G. GIUNTA G. GIUSTO D.D. GUIDUCCI A. HANJALIC A. HERODOTOU N. HOTTER M. HUANG T.S. ILIC M. ILLGNER K. IMPAGNATIELLO F. IMPENS I. KAMIKURA K. KOMATSU T. KOSMALA A. KOTERA H. LAGENDIJK R.L. LEONCINO F. LIGGI G. LOPEZ R. LORIA M. LOTTI F. LUCCHESE L. MACHI A. MARAZZI A. MARSI S. MARTI F. MASCIA U. MASCIANGELO S. MECOCCI A. MESTER R. MIIKE H. MITRA S.K.
165 42 121 250 93 309 184 219 184 171,250 51 205 57 93,115 295 133 227 268 79 51 238 176 159 127 63 233 127 9,133 153 93 79 301 256 244 301 171,250 99 190 57 51 171,250,313 268 274 27
332 MONTEVERDE S. 197 MUGGLETON J. 213 MULLER F. 238,268 NADENAU M.J. 27 NAKAZAWA Y. 63 NALDI M. 184,190 NOMURA A. 274 PALA P. 319 PEDERSINI F. 325 PELLEGRINI P.F. 153 PEPINO A. 219 PIAZZA E. 153,190 PICCO R. 262 PLATANIOTIS K.N. 19 313 PROSPERI A. QUAGLIA G. 289,295 REGAZZONI C.S. 283 RELJIN B. 139 RIGOLL G. 233 RIZZATO M. 244 RUGGERONE R. 197 SAITO T. 63 SAMCOVIC A. 139 SARTI A. 325 SATO M. 87 SAULINO C. 219 SCHUSTER M. 233 SESTI E. 309 SHIMAMURA K. 127 SICURANZA G.L. 99 STELLA E. 36,42 TACCONI G. 197 TANIMOTO M. 87 TESCHIONI A. 283 TESEI A. 283 TORRE A. 176 TRAVERSO D. 197 TUBARO S. 325 VALLI G. 205 VAN ROOSMALEN P.M.B. 9 VENETSANOPOULOS A.N. 19,227 VERNAZZA G. 283 VINAYAGAMOORTHY S. 19 VIVALDA M. 262 VON ESSEN A. 69 WECKESSER P. 69
WATANABE H. ZHANG M.
127 159