Stereo Scene Flow for 3D Motion Analysis
Andreas Wedel Daniel Cremers
Stereo Scene Flow for 3D Motion Analysis
Dr. Andreas Wedel Group Research Daimler AG HPC 050–G023 Sindelfingen 71059 Germany
[email protected]
Prof. Daniel Cremers Department of Computer Science Technical University of Munich Boltzmannstraße 3 Garching 85748 Germany
[email protected]
ISBN 978-0-85729-964-2 e-ISBN 978-0-85729-965-9 DOI 10.1007/978-0-85729-965-9 Springer London Dordrecht Heidelberg New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2011935628 © Springer-Verlag London Limited 2011 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: VTeX UAB, Lithuania Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The estimation of geometry and motion of the world around us from images is at the heart of Computer Vision. The body of work described in this book arose in the context of video-based analysis of the scene in front of a vehicle from two frontfacing cameras located near the rear view mirror. The question examined of where things are in the world and how they move over time is an essential prerequisite for a higher-level analysis of the observed environment and for subsequent driver assistance. At the origin of this work is the combination of a strong interest in solving the real-world challenges of camera-based driver assistance and a scientific background in energy minimization methods. Yet, the methods we describe for estimating highly accurate optical flow and scene flow are a central prerequisite in other domains of computer vision where accurate and dense point correspondence between images or between geometric structures observed in stereo-videos is of importance. Step by step we introduce variational methods which allow us to enhance the image data acquired from two cameras by spatially dense information on the geometric structure and 3D motion of the observed structures. In particular, we introduce variational approaches to optic flow estimation and present a variety of techniques which gave rise to the world’s most accurate optic flow method. We introduce a variational approach to estimate scene flow, i.e. the motion of structure in 3D. We discuss metrics for evaluating the accuracy of scene flow estimates. We will also show extensions of scene flow, including flow-based segmentation and the tracking of 3D motion over multiple frames. The latter employs Kalman filters for every pixel of an image assuming linear object motion which results in a stable and dense 3D motion vector field. The book is written for both novices and experts, covering both basic concepts such as variational methods and optic flow estimation, and more advanced concepts such as adaptive regularization and scene flow analysis. Much of the work described in this book was developed during the Ph.D. thesis of the first author, both at the University of Bonn and at Daimler Research, Böblingen. Many of these results would not have been possible without the enthusiastic support of a number of researchers. We are particularly indebted to Uwe Franke, Clemens Rabe, and Stefan Gehrig for their work on 6D vision and disparity estimation, to v
vi
Preface
Thomas Pock in the context of efficient algorithms for optic flow estimation, to Thomas Brox in the parts on variational scene flow estimation, and to Tobi Vaudrey and Reinhard Klette for their research support on residual images and segmentation. We are grateful to our collaborators for their support. With lane departure warning systems and traffic sign recognition, camera-based driver assistance is gradually becoming a reality. Latest research deals with intelligent systems such as autonomous evasive maneuvers and emergency situation takeover assistance. We hope that this book will help to lay the foundations for higher-level traffic scene understanding, object motion detection, and the development of advanced driver assistance. Böblingen and Munich, Germany
Andreas Wedel Daniel Cremers
Contents
1
Machine Vision Systems . . . . . . . . . . . . . . . . . . . . . . . . .
2
Optical Flow Estimation . . . . . . . . . . . 2.1 Optical Flow and Optical Aperture . . . 2.2 Feature-Based Optical Flow Approaches 2.2.1 Census Based Optical Flow . . . 2.2.2 The Optical Flow Constraint . . . 2.2.3 Lucas–Kanade Method . . . . . 2.3 Variational Methods . . . . . . . . . . . 2.3.1 Total Variation Optical Flow . . . 2.3.2 Quadratic Relaxation . . . . . . 2.3.3 Large Displacement Flow: Novel Approaches . . . . . . . . . . . 2.3.4 Other Optical Flow Approaches . 2.4 The Flow Refinement Framework . . . . 2.4.1 Data Term Optimization . . . . . 2.4.2 Smoothness Term Evaluation . . 2.4.3 Implementation Details . . . . .
1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithmic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
5 5 7 8 9 10 11 13 14
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
15 17 17 18 24 30
3
Residual Images and Optical Flow Results . . . . . . . . . 3.1 Increasing Robustness to Illumination Changes . . . . . 3.2 Quantitative Evaluation of the Refinement Optical Flow 3.2.1 Performance . . . . . . . . . . . . . . . . . . . 3.2.2 Smoothness Filters . . . . . . . . . . . . . . . . 3.2.3 Accuracy . . . . . . . . . . . . . . . . . . . . . 3.3 Results for Traffic Scenes . . . . . . . . . . . . . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
35 35 38 38 42 44 45 48
4
Scene Flow . . . . . . . . . . . . . . . . . . . . . 4.1 Visual Kinesthesia . . . . . . . . . . . . . . . 4.1.1 Related Work . . . . . . . . . . . . . 4.1.2 A Decoupled Approach for Scene Flow
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
51 51 53 53
. . . .
. . . .
. . . .
. . . .
. . . .
vii
viii
Contents
4.2 Formulation and Solving of the Constraint Equations . 4.2.1 Stereo Computation . . . . . . . . . . . . . . 4.2.2 Scene Flow Motion Constraints . . . . . . . . 4.2.3 Solving the Scene Flow Equations . . . . . . 4.2.4 Evaluation with Different Stereo Inputs . . . . 4.3 From Image Scene Flow to 3D World Scene Flow . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
55 55 57 59 59 61
5
Motion Metrics for Scene Flow . . . . . . . . . . . . . 5.1 Ground Truth vs. Reality . . . . . . . . . . . . . . 5.2 Derivation of a Pixel-Wise Accuracy Measure . . . 5.2.1 A Quality Measure for the Disparity . . . . 5.2.2 A Quality Measure for the Scene Flow . . . 5.2.3 Estimating Scene Flow Standard Deviations 5.3 Residual Motion Likelihood . . . . . . . . . . . . . 5.4 Speed Likelihood . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
65 65 66 67 73 74 75 77
6
Extensions of Scene Flow . . . . . . . . . . . . . . . . . . . . . 6.1 Flow Cut—Moving Object Segmentation . . . . . . . . . . 6.1.1 Segmentation Algorithm . . . . . . . . . . . . . . . 6.1.2 Deriving the Motion Likelihoods . . . . . . . . . . 6.1.3 Experimental Results and Discussion . . . . . . . . 6.2 Kalman Filters for Scene Flow Vectors . . . . . . . . . . . 6.2.1 Filtered Flow and Stereo: 6D-Vision . . . . . . . . 6.2.2 Filtered Dense Optical Flow and Stereo: Dense-6D . 6.2.3 Filtered Variational Scene Flow: Variational-6D . . 6.2.4 Evaluation with Ground Truth Information . . . . . 6.2.5 Real-World Results . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
81 81 83 85 88 92 92 94 95 96 98
7
Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . .
99
8
Appendix: Data Terms and Quadratic Optimization 8.1 Optical Flow Constraint Data Term . . . . . . . . 8.2 Adaptive Fundamental Matrix Constraint . . . . . 8.3 Quadratic Optimization via Thresholding . . . . . 8.3.1 Karush–Kuhn–Tucker (KKT) Conditions . 8.3.2 Single Data Term . . . . . . . . . . . . . 8.3.3 Two Data Terms . . . . . . . . . . . . . .
9
Appendix: Scene Flow Implementation Using Equations . . . . . . . . . . . . . . . . . . . . 9.1 Minimization of the Scene Flow Energy . 9.2 Implementation of Scene Flow . . . . . .
. . . . . . .
. . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
101 101 101 103 103 104 106
Euler–Lagrange . . . . . . . . . . . . . 111 . . . . . . . . . . . . . 111 . . . . . . . . . . . . . 113
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
List of Notations
R Ω Ψε
Real numbers Image domain Differentiable approximation of the absolute function √ |x| ≈ Ψε (x) = x 2 + ε2 F Fundamental matrix ∇I Spatial gradient of I : ∇I (x, y, t) = (Ix , Iy ) Ix , Iy , It Partial derivatives with respect to x, y and t N Local neighborhood of a pixel 4-connected neighborhood N4 L Pixel labelling
ix
Chapter 1
Machine Vision Systems
Everything used to measure time really measures space. J. Deshusses
Accurate, precise and real-time capable estimation of three-dimensional motion vector fields remains one of the key tasks in computer vision. Different variants of this problem arise inter alia in the estimation of ego motion [4], object motion [8], human motion [77], and motion segmentation [106]. The knowledge of the surrounding motion field is a key enabler for a wide range of applications such as driver assistance systems and modern surveillance systems. Especially in security relevant applications robustness, accuracy, and real-time capability are of utmost importance. Estimating this three-dimensional motion vector field from stereo image sequences has drawn the attention of many researchers. Due to the importance of this problem, numerous approaches to image based motion field estimation have been proposed in the last three decades. Most of them can be classified into the following main strategies: • model based approaches, • sparse feature tracking methods using multiple image frames, • dense scene flow computation from two consecutive frames. A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_1, © Springer-Verlag London Limited 2011
1
2
1
Machine Vision Systems
The estimation of motion vectors involves both the reconstruction of the threedimensional scene via stereo matching and the solving of a point correspondence problem between two or more consecutive images. Both problems are classical illposed problems1 in the sense that merely imposing matching of similar intensities will typically not give rise to a unique solution. The three aforementioned strategies choose different ways to overcome this ill-posedness. In model based approaches such as [77] and [8] parameterized models of objects or humans are used to constraint the solution space and overcome the ill-posedness of the problem. However, the absence of appropriate models for generic applications disqualifies model based approaches in a multitude of situations. Many researchers therefore circumvent specific object models and employ regularization techniques for feature tracking and scene flow approaches in order to formulate the motion estimation problem in a well-posed way. This regularization is either formulated in the time domain for the tracking of features, as done in [54] or [74], or in the spatial domain, imposing smoothness of the motion field between two consecutive frames like in [103] and [108]. The latter is known as variational scene flow estimation from stereo sequences. Algorithmically variational scene flow computation methods build up on the seminal optical flow algorithm of Horn and Schunck [42]. In what is often considered the first variational method in computer vision, Horn and Schunck suggested to compute the flow field between two consecutive images of a video as the minimizer of an energy functional which integrates a brightness constancy assumption with a smoothness assumption on the flow field. This framework has been improved in [60] to cope with flow discontinuities and outliers and in [13] to cope with large flow vectors. In recent years, several real-time optical flow methods have been proposed—see for example [16] and [119]. In Chaps. 2 and 3 we review the classical optic flow estimation and discuss a series of improvements [91, 102, 105–107, 110], including median filtering of flow vectors, decomposition of the input images, and considering optical flow estimation as an iterative refinement process of a flow vector field accompanied by outlined implementation details. We continue in Chap. 4 to introduce scene flow estimation as an extension of the optical flow estimation techniques. Joint motion and disparity estimation for the scene flow computation was introduced in [69]. In [108] the motion and disparity estimation steps were decoupled in order to achieve real-time capability without loosing accuracy. Subsequent publications have focused on improving the accuracy, the formulation of uncertainties, and establishing motion metrics for scene flow [105, 109, 112] and are handled in Chap. 5. Implementation details on the scene flow algorithm are found in the Appendices of this book. We additionally include Chap. 6 on recent developments in the research field of scene flow estimation described in [75, 111]. These include scene flow segmentation and using Kalman filters for scene flow estimation. The latter approach 1 Following
Hadamard [37], a mathematical model is called well-posed if there exists a solution, if the solution is unique and if it continuously depends upon the data. Otherwise it is called ill-posed.
1 Machine Vision Systems
3
combines the aforementioned strategies of feature tracking over multiple frames and scene flow computation from two consecutive frames. This allows a dense and robust reconstruction of the three-dimensional motion field of the depicted scene.
Chapter 2
Optical Flow Estimation
Abstract In this chapter we review the estimation of the two-dimensional apparent motion field of two consecutive images in an image sequence. This apparent motion field is referred to as optical flow field, a two-dimensional vector field on the image plane. Because it is nearly impossible to cover the vast amount of approaches in the literature, in this chapter we set the focus on energy minimization approaches which estimate a dense flow field. The term dense refers to the fact that a flow vector is assigned to every (non-occluded) image pixel. Most dense approaches are based on the variational formulation of the optical flow problem, firstly suggested by Horn and Schunk. Depending on the application, density might be one important property besides accuracy and robustness. In many cases computational speed and real-time capability is a crucial issue. In this chapter we therefore discuss the latest progress in accuracy, robustness and real-time capability of dense optical flow algorithms.
Space is a still of time, while time is space in motion. Christopher R. Hallpike
2.1 Optical Flow and Optical Aperture This chapter is about optical flow and its estimation from image sequences. Before we go into details on optical flow estimation, we review the definition of optical flow and discuss some of the basic challenges of optical flow estimation. A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_2, © Springer-Verlag London Limited 2011
5
6
2 Optical Flow Estimation
Fig. 2.1 Color coding of flow vectors: Direction is coded by hue, length by saturation. The example on the right shows the expanding flow field of a forward motion. Flow vectors above 20 px are saturated and appear in darker colors
Whenever a camera records a scene over time, the resulting image sequence can be considered as a function I (x, y, t) of the gray value at image pixel position x = (x, y) and time t. In this book we will limit ourselves to gray value sequences. Color images therefore have to be transformed accordingly, depending on the color scheme used. Furthermore, some approaches easily generalize to color images. If the camera, or an object, moves within the scene, this motion results in a timedependent displacement of the gray values in the image sequence. The resulting two-dimensional apparent motion field in the image domain is called the optical flow field. Figure 2.1 shows the color scheme used to display such a flow vector field in this book. An example for a scene from the Middlebury optical flow benchmark with overlaid displacement vectors can be seen in Fig. 2.2. The most common assumption used in optical flow estimation is the brightness constancy assumption. It states that the gray value of corresponding pixels in the two consecutive frames should be the same. Unfortunately, not every object motion yields a change in gray values and not every change in gray values is generated by body motion. Counter examples to this brightness constancy assumption may arise through changes in illumination or upon observations of transparent or nonLambertian material. Another source of gray value changes is the inherent noise of the camera sensor. Especially in bad illumination conditions (e.g. at night time) the number of photons, which impact an image pixel, may vary over time. In such cases brightness constancy is violated and thus respective motion estimates will degrade. Another source of ambiguity arises from the so-called aperture problem. The aperture problem arises as a consequence of motion ambiguity when an object is viewed through an aperture, as demonstrated in Fig. 2.3. If an untextured object moves within the image, the motion within the object area cannot be recovered
Fig. 2.2 Optical flow for the mini cooper scene of the Middlebury optical flow benchmark. The displacement of the image pixels is shown as displacement vectors on top of the flow color scheme defined in Fig. 2.1
2.2 Feature-Based Optical Flow Approaches
7
Fig. 2.3 Illustration of the aperture problem as a consequence of motion ambiguity. If an untextured object (square) moves within the image and is viewed through an aperture (circle), motion can only be recovered in the direction perpendicular to an edge. Two-dimensional motion estimation (right) is only possible with structural information, such as edges, in linear independent directions
without additional information. Even at object edges, where the gray value of the background differs from the gray value of the object (or object edge) itself, without additional information motion can only be recovered in one dimension. The twodimensional motion can only be recovered where more information is available, for example at object corners or if texture is available. As a result of the aperture problem the optical flow problem, like many other inverse problems, is not wellposed in the sense of Hadamard. Thus, the optical problem needs to be re-formulated for numerical treatment. Typically this involves considering additional assumptions, such as smoothness of the optical flow field and known as regularization. Depending on the type of regularization, optical flow algorithms are often categorized into the two general classes feature-based approaches and variational approaches. Feature-based approaches compute the optical flow displacement for a pixel and its neighborhood independently of the optical flow solutions obtained at the other pixels in the image. Section 2.2 covers feature-based approaches. Variational approaches take into account the optical flow solutions of neighboring pixels and imply smoothness assumptions on the optical flow field. Variational approaches are covered in Sect. 2.3. Variations of variational optical flow exhibit the currently best results on standard evaluation benchmarks. Section 2.4 then presents the novel Flow Refinement Framework based on an iterative improvement of a prior optical flow vector field as a third class of optical flow algorithms.
2.2 Feature-Based Optical Flow Approaches Feature-based optical flow algorithms can again be categorized into pixel accurate optical flow approaches and sub-pixel accurate optical flow approaches. Pixel accurate optical flow approaches assign pixels of the input image to pixels of the output image. This is usually done by evaluating a pixel matching score based on gray val-
8
2 Optical Flow Estimation
ues of a pixel’s neighborhood. Algorithms of this class can be understood as a combinatorial task, because the number of possible flow vectors or displacement vectors for a certain pixel is bounded by the image size. Due to the combinatorial structure, pixel accurate optical flow algorithms can be parallelized yielding real-time efficiency on dedicated hardware. They have the nice property of robustness due to the restricted solution space (only discrete integer pixel positions are considered and outliers are less likely), but at the same time suffer from accuracy limitations, i.e. the displacements are only pixel-discrete. In Sect. 2.2.1 a census based optical flow algorithm is reviewed as a representative of this class. Section 2.2.2 introduces the optical flow constraint, which is employed by most sub-pixel accurate optical flow algorithms. It assumes that the gray value image sequence has a continuous domain. The change of a pixel’s gray value between two time instances can then be computed evaluating the image gradient. The early work of Lucas and Kanade [54] is reviewed which employs the linearized version of the optical flow constraint.
2.2.1 Census Based Optical Flow The Census transformation is a form of non-parametric local mapping from intensity values of pixels to a signature bit string. It relies on the relative ordering of intensities in a local window surrounding a pixel and not on the intensity values themselves. Thus, it captures the image structure [118]. Applied to a local 3 × 3 neighborhood of an image pixel, the Census transform compares the center pixel x = (x, y) to all other pixels x inside the patch resulting in ⎧ ⎪ ⎨ 0 if I (x) − I (x ) > c, θ (I, x, x ) = 1 if |I (x) − I (x )| ≤ c, ⎪ ⎩ 2 if I (x) − I (x ) < −c, with the gray value intensity I (x) at pixel position x. In [48, 85], the threshold c is chosen between 12 and 16 for images with gray values between 0 and 255. The Census digit θ measures the similarity between the gray values at pixel positions x and x . All census digits of the image patch are unrolled clockwise, building the signature bit string or signature vector: 124 74 32
2
1
0
124 64 18 → 2
X
0 →
157 116 84
2
2
gray values
2
Census digits
21002222
signature vector
The larger the threshold value c is, the more insensitive the result becomes to nonadditive illumination changes. On the other hand, a large threshold value yields
2.2 Feature-Based Optical Flow Approaches
9
less unique signature bit strings. The signature vector is then used to search for corresponding pixel pairs. To this end, all signature vectors of the first image are stored in a hash-table together with their pixel position. Then, all signature vectors of the second image are compared to the hash-table entries. This gives a list of putative correspondences (hypotheses) for each signature. The list is empty if a signature in the second image does not exist in the first image. In the event of multiple entries, the list can be reduced by applying photometric and geometric constraints; if there are still multiple entries, the shortest displacement vector wins (see [85] for further details). Thanks to the indexing scheme, arbitrarily large displacements are allowed. Even if an image patch moves from the top left image corner to the bottom right corner it can potentially be correctly matched. The method has been successfully implemented on parallel hardware, allowing for real-time computation. A disadvantage of the Census based optical flow method is its pixel-accuracy; the results suffer from discretization artifacts. Furthermore, low contrast information (gray value difference below the threshold c) is ignored.
2.2.2 The Optical Flow Constraint Most sub-pixel accurate solutions to optical flow estimation are based on the (linearized) optical flow constraint (see optical flow evaluation in [6]). The optical flow constraint states that the gray value of a moving pixel stays constant over time, i.e. I (x, y, t) = I (x + u, y + v, t + 1) where u = (u, v) is the optical flow vector of a pixel x = (x, y) from time t to time t + 1. The linearized version of the image function (using the first-order Taylor approximation) reads u I (x, y, t) ≈ I (x, y, t + 1) + ∇I (x, y, t + 1) , v (2.1) u 0 = I (x, y, t + 1) − I (x, y, t) +∇I (x, y, t + 1) . v
It (x,y,t+1)
From now on the partial derivatives of the image function are denoted as It , Ix , and Iy and also the inherent dependency on the image position (x, y) is dropped yielding the optical flow constraint equation OFC(u, v):
0 = It + Ix u + Iy v.
(2.2)
The optical flow constraint has one inherent problem: it yields only one constraint to solve for two variables. It is well known that such an under-determined equation system yields an infinite number of solutions. For every fixed u a valid v can be found fulfilling the constraint. Again, the aperture problem previously discussed in Sect. 2.1 becomes visible.
10
2 Optical Flow Estimation
Fig. 2.4 The original image corresponds to level 0 of the image pyramid. The upper pyramid levels are down-sampled versions of the image with lower resolution; hence, high frequency image details are filtered out (left). For pyramid approaches, the solution on an upper pyramid level is propagated onto the lower levels and refined using the high frequency details (right). Figure courtesy of Thomas Brox
2.2.3 Lucas–Kanade Method A common workaround to solve this ambiguity is to assume a constant (or affine) optical flow field in a small neighborhood of a pixel x. Such a neighborhood N typically consists of n × n pixels with n smaller than 15. The optical flow constraint is then evaluated with respect to all pixels within this neighborhood window N . This results now in an over-determined equation system because there are more equations than unknowns. In other words, in a general setting there exists no direct solution. Minimizing the sum of quadratic deviations yields the least square approach, first proposed by Lucas and Kanade in 1981 [54]: 2 (2.3) It (x ) + Ix (x )u + Iy (x )v . min u,v
x ∈N (x)
Extensions of this approach have been proposed. These include replacing the sum of squared errors with an absolute error measurement [5] or using Kalman filters to further gain robustness over time [73]. The resulting flow vector is sub-pixel accurate. But due to the Taylor approximation in the optical flow constraint, the method is only valid for small displacement vectors where the effect of higher order terms in (2.1) is negligible [5]. Essentially, the Lucas–Kanade method implements a gradient descent method, minimizing the deviation of the gray value between an image pixel and the gray value of its corresponding pixel (the pixel that the flow vector points to). Due to the first-order Taylor approximation consecutive executions of (2.3) and an embedding within a multilevel approach might be necessary to receive the desired result. A common approach is the use of image pyramids, solving for low frequency structures in low resolution images first and refining the search on higher resolved images (see Fig. 2.4 and [11, 60]).
2.3 Variational Methods
11
One way to measure the quality of the resulting flow vector is using the sum of the gray value differences within the neighborhood N . If this value is larger than a predefined threshold, it is likely that gradient descent was unsuccessful and either more iterations are needed or the descent process got stuck in a local minimum. While the maximum successfully matched track length depends on image content, generally speaking, flow vectors with large displacements are less likely to be found than those within a few pixels displacement. More insight into this effect can be found in [5]. The Lucas and Kanade method is currently beyond real time if a flow vector is estimated for every pixel of the input image using a common CPU implementation. Hence, Tomasi proposed to evaluate only those regions which yield a well-conditioned equation system when solving (2.3) and proposed an algorithm to search the image for such regions [92]. Recent GPU based implementations are able to track up to 10,000 image points at frame rates of 25 Hz [120]. Such KLT trackers are frequently used in applications ranging from video analysis, over structure and motion estimation to driver assistance systems.
2.3 Variational Methods Simultaneously with the work of Lucas and Kanade, in 1981 Horn and Schunck [42] proposed another approach to cope with the under-determined optical flow constraint. The authors reverted to regularization, or smoothness, for the resulting flow field. Such smoothness was introduced by penalizing the derivative of the optical flow field, yielding an energy which was minimized by means of variational approaches: 2 ∇u(x)2 + ∇v(x)2 dΩ + λ min It + Ix u(x) + Iy v(x) dΩ . u(x),v(x)
Ω
Ω
(2.4)
Here, u(x) ∈ R and v(x) ∈ R is the two-dimensional displacement vector for an image pixel x ∈ R2 ; Ω is the image domain. The first integral (regularization term) penalizes high variations in the optical flow field to obtain smooth displacement fields. The second integral (data term) imposes the optical flow constraint (2.2) with a quadratic cost. The free parameter λ weights between the optical flow constraint and the regularization force. Variational optical flow approaches compute the optical flow field for all pixels within the image, hence they result in a dense optical flow field. Being computationally more expensive, variational approaches only recently became more popular as processor speed has increased, allowing for the computation of dense flow fields in real time. In [18, 19], a highly efficient multi-grid approach on image pyramids is employed to obtain real-time performance. In [119] a dualitybased method was proposed, which uses the parallel computing power of a GPU to gain real-time performance.
12
2 Optical Flow Estimation
Fig. 2.5 For all three functions shown on the right, the total variation is identical as it merely measures the size of the total jump independent of the number of steps (1 step, 10 steps, and 100 steps, respectively). Figure courtesy of Thomas Pock [70]
The original work of Horn and Schunck suffers from the fact that its smoothness term does not allow for discontinuities in the optical flow field and the data term does not handle outliers robustly. Since discontinuities in the optical flow often appear in conjunction with high image gradients, several authors replace the homogeneous regularization in the Horn–Schunck model with an anisotropic diffusion approach [66, 113]. Others substitute the squared penalty functions in the Horn– Schunck model with more robust variants. To date the work of Horn and Schunck has attracted over 3,900 citations, many of these dealing with applications of motion estimation in different scenarios, many suggesting alternative cost functionals, and many investigating alternative minimization strategies. Cohen [24] as well as Black and Anandan [9] apply estimators from robust statistics and obtain a robust and discontinuity-preserving formulation for the optical flow energy. Aubert et al. [2] analyze energy functionals for optical flow incorporating an L1 norm for the data fidelity term and a general class of discontinuity-preserving regularization forces. Brox et al. [13] employ a differentiable approximation of the L1 norm for both the data and smoothness term and formulate a nested iteration scheme to compute the displacement field. The integral of the L1 norm of the gradient, i.e. |∇f (x)| dx, is referred to as the total variation (TV norm) of f . Essentially, the integral counts the amount of variation, or fluctuation, in the data. In contrast to the original quadratic L2 -regularity suggested by Horn and Schunck, the L1 -regularity is known to better preserve discontinuities [9, 13, 30, 66, 81]. Figure 2.5 illustrates the advantage of using the L1 norm for denoising. For any monotone function growing from 0 to 1, its total variation is 1 regardless of whether the function undergoes a gradual transition or a sharp jump. As a consequence, in contrast to the quadratic regularization proposed by Horn and Schunck, the total variation no longer favors smooth transitions
2.3 Variational Methods
13
over sharp transitions. In the PDE community it is therefore called “discontinuitypreserving”. The total variation plays a prominent role in image analysis since it is the most discontinuity-preserving among all convex regularizers. For a review of total variation methods in image analysis we refer the reader to [22]. Zach et al. [119] proposed to solve the robust TV optical flow formulation by rewriting it into a convex dual form, thereby allowing a quadratic decoupling of the data term and the smoothness term. Let us replicate this approach in Sect. 2.3.1 and draw conclusions about the algorithmic design behind the mathematical notations. It will turn out that basically two steps are performed alternatingly, (1) data term evaluation and (2) flow field denoising.
2.3.1 Total Variation Optical Flow Recall that the general Horn and Schunck energy (equation (2.4)) for a twodimensional flow field (u, v) (dropping the inherent dependency on x) is given by |∇u|2 + |∇v|2 + λ|It + Ix u + Iy v|2 dΩ. Ω
The quadratic regularizer is surprisingly simple, favoring flow fields which are spatially smooth. In contrast to the original L2 -regularity suggested by Horn and Schunck, the L1 -regularity is known to better preserve discontinuities [9, 30, 64, 67, 81]. In recent years, researchers have suggested far more sophisticated regularization techniques based on statistical learning [90]. So far these have not been able to outperform the more naive approaches. Of course it is hard to say why this is the case, one reason may be that the challenge of learning “typical” flow patterns may not be feasible, given that different image structures, unknown object deformations and camera motions may give rise to a multitude of motion patterns with little resemblance between motion fields from different videos. Nevertheless, appropriate regularization is of utmost importance for optic flow estimation, since it stabilizes the otherwise ill-posed optical flow problem and induces a filling-in in areas of low contrast or texture. Replacing the quadratic penalizers with robust versions (i.e. the absolute function) yields (2.5) |∇u| + |∇v| + λ|It + Ix u + Iy v| dΩ. Ω
Although (2.5) seems to be simple, it offers computational difficulties. The main reason is that the regularization term and the data term are not continuously differentiable. One approach is√to replace the absolute function |x| with the differentiable approximation Ψε (x) = x 2 + ε2 , and to apply a numerical optimization technique on this slightly modified functional (e.g. [13, 23]). In [119] the authors propose an exact numerical scheme to solve (2.5) by adopting the Rudin–Osher–Fatemi (ROF) energy [78] for total variation based image denoising to optical flow. We will review their approach in this section. The minimization of the ROF energy for image denoising will be covered in more detail in Sect. 2.4.
14
2 Optical Flow Estimation
Fig. 2.6 Basic algorithmic design of the total variation optical flow algorithm. Subsequent data term evaluation and denoising steps, embedded into a warping strategy, are applied to an approximate flow field (e.g. starting with the zero flow field) to compute the resulting flow field
2.3.2 Quadratic Relaxation Inspired by algorithms for TV-L1 minimization like the one in [3], Zach and coworkers [119] introduce auxiliary (so-called dual) variables u and v and replace one instance of the original variables u and v in (2.5) with the dual variables, yielding min with u = u and v = v . |∇u| + |∇v| + λ|It + Ix u + Iy v | dΩ u,v,u ,v
Ω
This step does not change the energy function. The key novelty is that the side conditions u = u and v = v are now relaxed and embedded into the energy functional to be minimized. This quadratic relaxation yields the following strictly convex approximation of the original total variation regularized optical flow problem 1 1 |∇u| + |∇v| + (u − u )2 + (v − v )2 min u,v,u ,v 2θ 2θ Ω + λ|It + Ix u + Iy v | dΩ , (2.6) where θ is a small constant, such that u is a close approximation of u (and equivalently v is a close approximation of v ). Due to the quadratic germs, the resulting energy is strictly convex, hence it has a unique minimum. This minimum is found by alternating steps updating either the dual variables, u and v , or the primal variables, u and v, in every iteration (see also Fig. 2.6). More specifically the two steps are 1. For u and v being fixed, solve 1 1 2 2 min (u − u (v − v ) + ) + λ|I + I u + I v | dΩ. t x y 2θ u ,v Ω 2θ
(2.7)
2.3 Variational Methods
15
Note that this minimization problem can be independently evaluated for every pixel because no spatial derivatives of the flow field contribute to the energy. More specifically, the optical flow constraint data term, λ|It + Ix u + Iy v |, is minimized w.r.t. the dual flow variables u and v such that deviations from the primal variables are penalized quadratically. This demands a solution close to the given (approximate) primal flow field. 2. For u and v being fixed, solve 1 1 2 2 |∇u| + |∇v| + (u − u ) + (v − v ) dΩ. (2.8) min u,v Ω 2θ 2θ This is the total variation based image denoising model of Rudin, Osher, and Fatemi [78]. The denoising will be presented in more detail within the general framework in Sect. 2.4. For now, note that a denoising of the dual variables, u and v , is computed. Embedding (2.7) and (2.8) into a pyramid warping scheme implies that the denoised solution vector (u, v) serves as a new approximate flow field (u, v) for the next iteration or next lower pyramid level. Iteratively solving the data term and subsequent denoising yields the final optical flow result. This basic idea of data term solving and denoising is picked up in Sect. 2.4 and a generalized approach is presented which is called Refinement Optical Flow. In that section the solutions for the minimization problems (2.7) and (2.8) are also outlined. But beforehand, let us conclude the literature overview on optical flow approaches.
2.3.3 Large Displacement Flow: Novel Algorithmic Approaches Traditionally large displacements in the flow can be captured despite the data term linearization by iteratively solving the optic flow problem in a coarse-to-fine manner, where on each level of the pyramid one of the two images is being warped according to the flow field estimated on the coarser scale. Clearly, this strategy is suboptimal because fine-scale structures of low contrast may easily disappear on the coarser scales and hence will not be recovered in the final flow field. There have been efforts to improve flow estimation for the case of large displacement: One such method proposed by Brox et al. [12] is based on imposing SIFT feature correspondence as a constraint in the optic flow estimation, thereby driving the estimated flow field to match corresponding features. While this may fail due to incorrect correspondences, in cases where SIFT correspondence is reliable it will help to capture large-displacement flow. On the other hand, the approach is still based on the traditional coarse-to-fine hierarchical estimation scheme. From an optimization point of view the limitations regarding the estimation of large-displacement flow invariably arise because the original (non-linearized) optic flow functional is not convex. Interestingly the above quadratic relaxation approach can be exploited even further, giving rise to a method proposed by Steinbruecker et al. [86] which can capture
16
2 Optical Flow Estimation
Fig. 2.7 One of two images from the Middlebury data set, the flow field computed using traditional warping and the flow field computed the decoupling approach in [86]. The latter approach allows to better capture large displacements of small scale structures like the balls
large-displacement flow without reverting to a coarse-to-fine warping strategy. The key idea is to consider the quadratic relaxation of (2.6), but for the non-linearized optic flow problem: 1 1 u 2 2 |∇u| + |∇v| + (u − u ) + (v − v ) + λI1 (x) − I2 x + dΩ. 2θ 2θ v Ω
(2.9) A closer look reveals that the two optimization problems in (u, v) and in (u , v ) can each be solved globally: The optimization in (u, v) for fixed (u , v ) is the convex ROF problem and can therefore be solved globally. And the optimization in (u , v ) for fixed (u, v) is a pointwise optimization problem which can be solved (in a discretized setting) by a complete search for each pixel. The latter can be drastically accelerated using parallel processors such as the GPU. One can therefore directly solve the above problem alternating the two solutions for (u, v) and (u , v ), always reducing the coupling parameter θ → 0 during iterations. In practice, this implies that the original flow estimation problem is solved by alternatingly determining best correspondences in the spatial vicinity and smoothing the correspondence field. Figure 2.7 shows a comparison of standard warping techniques with the above strategy: It shows that indeed small structures undergoing large displacement might be better captured using this technique. While the above solution allows a decoupling of the original flow problem into two problems each of which can be minimized globally, the overall solution may still be far from being globally optimal. Nevertheless, recent efforts aim at finding convex relaxations of the non-linearized flow estimation problem [34]. The key idea is to formulate flow estimation as a problem of multilabel optimization and to devise appropriate convex representations which account for the fact that labels are vector-valued. The resulting algorithms give rise to optic flow solutions which are independent of initialization and minimization strategy. Moreover they come with a quality guarantee in terms of a computable energetic bound (of typically 5%) from the (unknown) global optimum.
2.4 The Flow Refinement Framework
17
2.3.4 Other Optical Flow Approaches Over the last years, the number of optical flow algorithms with increasingly good performance has grown dramatically and it becomes difficult to summarize all contributions. Robust variational approaches have been shown to yield the most accurate results to the optical flow problem known in literature. But they cover only a subset of dense optical flow algorithms. In [91] the authors pick up recent developments in optical flow estimation and reveal some of the secrets for accurate flow estimation techniques. They observe that median filtering of optical flow values is one clue to accurate flow fields. Consequently, they propose a variational approach based on a continuous version of the median filter. We will take a closer look at the median filter later in this chapter. In [84], a two-step method is presented where first optical flow is computed in a local neighborhood and secondly a dense flow field is derived in a relaxation step. An unbiased second-order regularizer for flow estimation was proposed in [95]. We shall revisit this approach in Sect. 2.4.2.5. Other optical flow algorithms with outstanding performance are graph cuts (discrete optimization) [26] or the fusion of previously mentioned approaches into an optimized result [53]. A similar concept of fusion in a continuous setting was proposed in [94]. For more information on recent developments and the latest state-of-the-art optical flow algorithms, we refer to the Middlebury optical flow evaluation homepage [6].
2.4 The Flow Refinement Framework Although the optical flow constraint, (2.2), was originally formulated for every single pixel individually, it is impossible to calculate the two translational values of optical flow using only a single reference pixel. This is part of the aperture problem, the aperture size of a single pixel is too small. We have seen solutions based on extending the aperture (Census flow and the Lucas–Kanade method) and variational approaches where smoothness is used to regularize the optical flow field solution. In this chapter we discuss a third method, based on evaluating the flow field in the proximity of a given prior flow field. Figure 2.8 illustrates how this compares to the approaches revisited beforehand. The idea of (1) minimizing the weighted sum of a prior flow field and the optical flow constraint followed by (2) smoothing the optical flow result is very similar to the ROF-denoising based total variation optical flow approach. If iteratively used, the algorithm is given by the general make-up in Fig. 2.9. This general make-up consists of a pixel-wise (local) data term optimization step
2 and a global (nonlocal) smoothness term evaluation step . 3 Both steps are presented separately in Sects. 2.4.1 and 2.4.2. Section 2.4.3 summarizes the algorithm and provides implementation details.
18
2 Optical Flow Estimation
Fig. 2.8 If the aperture is too small, a motion vector cannot be computed using a single pixel (left). Therefore authors have used either larger apertures or a smoothing constraint, enforcing similar optical flow values for neighboring pixels (middle). In case a prior flow field is given, minimizing the weighted sum of the optical flow constraint and the distance to the prior solution (right) also yields a unique solution (resulting flow vector not shown in the illustration)
Fig. 2.9 The general framework for Refinement Optical Flow. An approximate flow field is refined by iteratively evaluating the data term and subsequent smoothing, embedded in a pyramid warping scheme. The result is an optimized and refined optical flow field (compare with Fig. 2.6)
2.4.1 Data Term Optimization The data term represents the driving force for optical flow computation. Instead of limiting ourselves to the two-dimensional (optical flow) case, we present the findings in this section using an n-dimensional flow vector. Given an approximate ndimensional flow field u ∈ Rn as prior, an optimal solution is computed for every pixel which fulfills best a single or multiple data terms. More specifically, a solution u = (u1 , . . . , un ) ∈ Rn is obtained, which minimizes the weighted sum of the distance to the prior flow field and the individual data term violations p(·),
1 2 u = min u − u + ω λp p(u) . (2.10) u 2 p∈P
2.4 The Flow Refinement Framework
19
Here, P is the set of all data terms and each data term is given an individual weight λp . ω is a function penalizing deviations from 0. In this subsection, the robust L1 norm will be used, therefore ω(x) = |x|. A typical data term is the brightness constancy constraint, p(u) = I1 (x) − I2 (x + u). In fact, for a single data term this becomes the methodology employed in [119]. See Chap. 8 for more details on different data terms. The investigated data terms in this subsection employ linearized constraint violations (e.g. violations of the optical flow constraint (2.2)). The linearized version of a constraint p(u) consists of a constant scalar value, denoted by a subscript zero p0 , and a vector denoted by p, i.e. λp p(u) ≈ p0 + p u = p0 + p1 u1 + · · · + pn un . Note that the weight λp is included in the data term entries pi for convenience reasons, where pi is the ith component of the vector p. In summary, in this section we will be dealing with the following data term:
1 2 u = min u − u + p0 + p u . (2.11) u 2 p∈P
The remainder of this subsection deals with the optimization of (2.11) and presents a simple example employing the optical flow constraint. Although (2.11) may look quite harmless, it offers some computational difficulties. Due to the absolute function in the robust data deviation term, (2.11) is not differentiable at p0 + p u = 0. This is a pity, because the violations of the data terms should be as small as possible demanding the data terms to vanish and causing numerical instabilities. Two possible ways to solve this issue are using an ε-approximation of the absolute norm [13] and quadratic optimization techniques. The ε-approximation of the absolute norm yields a lagged iterative solution, while quadratic optimization yields an exact solution within a fixed number of thresholding steps. In the following subsection, the lagged iterative solution is presented. Then, quadratic optimization techniques are derived for a single, two, and multiple data terms. 2.4.1.1 Approximating the Absolute Function The absolute function |g(x)| is not differentiable in x = 0. An often used approximation (e.g. see [13]) is to replace the absolute function by Ψ (g(x)) = g(x)2 + ε2 with a small constant ε > 0. This function is convex and differentiable. Its derivative is given by g (x) g(x) Ψ g(x) = . g(x)2 + ε2 Due to the appearance of g(x) in the derivative, an iterative approach is commonly used in order to minimize the function w.r.t. g(x) by gradient descent. Such a solution approach is also called lagged feedback because, usually, a few iterations are needed to get close to the global minimum.
20
2 Optical Flow Estimation
Replacing the absolute functions in (2.11) with the Ψ functions yields the following approximation for the optimization problem:
1 2 min u − u + Ψ p0 + p u . (2.12) u 2 p∈P
The objective function now is convex and the minimum is found by successive gradient descent steps. This is done by setting its derivative w.r.t. uk , with k being the iteration index, equal zero. The starting point u0 is arbitrary because the objective function is convex. This yields
1 k k p p0 + p u , u =u − (p0 + p uk−1 )2 + ε2 p∈P −1
1 k u = Id + pp (p0 + p uk−1 )2 + ε2 p∈P
1 × u − p0 p . (p0 + p uk−1 )2 + ε2 p∈P Due to the lagged feedback, in general a few iterations (re-linearizations) are needed to find a minimum which is close to the optimum.
2.4.1.2 Quadratic Optimization Another approach to solve the objective equation, (2.11), is quadratic programming. In that case, the optimization problem involving the absolute function is transformed into a quadratic optimization problem with inequality constraints. Some fundamentals for optimization techniques based on quadratic optimization will now be presented. Finally, a fast and optimal thresholding scheme for solving the resulting quadratic optimization problem is presented. In a first step, the absolute functions (the original data terms p ∈ P) are replaced by dual variables p and inequality constraints (see Fig. 2.10 for an example). The remaining minimization problem is to find the minimum of
1 u − u 2 + p (2.13) 2 p∈P
with inequality constraints (i)
p0 + p u − p ≤ 0
(ii)
−p0 − p u − p ≤ 0
∀p ∈ P
and
∀p ∈ P.
(2.14)
Equation (2.13) with side conditions (2.14) yields the same minimum as (2.11). The absolute function has been replaced yielding a differentiable objective function. This, however, is at the cost of one additional variable and two inequality constraints per data term. Note that (2.13) is a combination of convex functions and
2.4 The Flow Refinement Framework
21
Fig. 2.10 Example for finding minx {2 + |4 − x|} (minimum x ∗ = 4). The | · | function is replaced by a dual variable y and inequality constraints (equations to right), yielding minx,y {2 + y}. The minimum is (x ∗ , y ∗ ) = (4, 2). Note that both problems yield the same minimum for the primal variable x ∗
hence is convex itself. Let us now use the Karush–Kuhn–Tucker theorem [47] for convex quadratic minimization problems under linear inequality constraints to find an optimal solution to (2.13) and hence the original problem (2.11). Proposition 2.1 For a convex quadratic minimization problem minx {f (x)}, under N linear inequality constraints gi (x) ≤ 0 where 0 ≤ i ≤ N , a global optimum of a solution x ∗ holds true if there exist constants μi such that the Karush–Kuhn–Tucker (KKT) conditions [47] are fulfilled:
(i) Stationarity: ∇f (x ∗ ) + μi ∇gi (x ∗ ) = 0, (ii) Primal feasibility:
gi (x ∗ ) ≤ 0,
(iii) Dual feasibility:
μi ≥ 0,
i
(iv) Complementary slackness: μi gi (x) = 0 ∀i. Solution schemes for a single data term, two data terms based on thresholding, and for the general case of multiple data terms are now presented in the following.
2.4.1.3 Single Data Term For a single data term, the task is to find the vector u∗ which solves the minimization problem 1 (2.15) u∗ = min u − u 2 + p0 + p u , u 2 or its dual formulation u∗ = min u
1 u − u 2 + p 2
(2.16)
22
2 Optical Flow Estimation
Table 2.1 Thresholding checks for a single data term
Thresholding check
Solution
p(u ) < −p p
u∗ = u + p
|p(u )| ≤ p p
u∗ = u −
p(u ) > p p
u ∗ = u − p
p(u ) p p p
with linear side conditions p0 + p u − p ≤ 0
and
−p0 − p u − p ≤ 0. The solution computes as (first presented by [119]; see Sect. 8.3.2 for the proof) in Table 2.1.
2.4.1.4 Two Data Terms In this subsection, an efficient solution scheme to minimize the objective function (2.11) or, equivalently, its dual quadratic optimization problem (2.13) with inequality constraints (2.14) for two data terms is presented. To this end, we have two data terms, λ1 p1 (u) = a0 + a u and λ2 p2 (u) = b0 + b u, where a0 and b0 are constant and a and b represent the linear parts of the data terms. The minimization problem states 1 2 (2.17) min u − u + a0 + a u + b0 + b u . u 2 λ1 p1 (u)
Its dual formulation with dual variables min u
1 u − u 2 + a + b 2
a
λ2 p2 (u)
and
such that
b
is
(i)
a0 + a u − a ≤ 0,
(ii)
−a0 − a u − a ≤ 0,
(iii)
b0 + b u − b ≤ 0,
(iv)
−b0 − b u − b ≤ 0.
(2.18)
The thresholding steps for solving (2.17) are given in Table 2.2; the derivation can be found in Sect. 8.3.3.
2.4.1.5 Multiple Data Terms If more than two data terms are given, a direct solution via thresholding becomes computationally expensive due to the high number of comparisons needed. Let the number of data terms be n. Then for every possible solution (each data term may be negative, positive or equal to zero yielding 3n possible solutions), one has to
2.4 The Flow Refinement Framework
23
Table 2.2 Thresholding checks for two data terms Thresholding checks
Solution
a0 + a (u − a − b) ≥ 0
u∗ = u − a − b
b0 + b (u − a − b) ≥ 0 a0 + a (u − a + b) ≥ 0
u∗ = u − a + b
b0 + b (u − a + b) ≤ 0 a0 + a (u + a − b) ≤ 0
u∗ = u + a − b
b0 + b (u + a − b) ≥ 0 a0 + a (u + a + b) ≤ 0
u∗ = u + a + b
b0 + b (u + a + b) ≤ 0 a0 + a (u − a − | b b (b0 1
b (b + b u b b 0 + b u − b a)| ≤ 1
a0 + a (u + a − | b b (b0 1
b (b + b u b b 0 + b u + b a)| ≤ 1
b0 + b (u − b − | a a (a0 1
a (a + a u a a 0 + a u − a b)| ≤ 1
b0 + b (u + b − | a a (a0 1
+ a u
If all checks fail
a (a + a u a a 0 + a b)| ≤ 1
− b a)) ≥ 0
u ∗ = u − a −
b (b b b 0
+ b u − b a)
+ b a)) ≤ 0
u ∗ = u + a −
b (b b b 0
+ b u + b a)
− a b)) ≥ 0
u∗ = u − b −
a (a a a 0
+ a u − a b)
+ a b)) ≤ 0
u∗ = u + b −
a (a a a 0
+ a u + a b)
u∗ =
a −1 −a b
0
−b0
perform n thresholding checks. Hence, the total number of comparisons is n · 3n ; it increases exponentially. Of course, the search for the minimum can be stopped once a solution is found, yielding n · 3n only in the worst case. With one data term, three comparisons have to be performed. Two data terms lead to a worst case of 18 comparisons (see Table 2.2) and for three data terms up to 81 comparisons are necessary. This leads to the question, whether other quadratic optimization techniques can be used to minimize the objective function. A large number of approximate algorithms are known which approximate the linear constraint functions with barrier or penalty functions [55]. An exact solution is given by the so-called active set method [25]. Here, the solution is found by gradient descent within an active subset of inequality conditions. However, the algorithmic complexity also prohibits a very large number of data terms. Hence, approximating the absolute function (e.g. as in (2.12)) or approximating the inequality constraints is the most practicable approach to handle multiple data terms efficiently.
24
2 Optical Flow Estimation
Table 2.3 Thresholding checks for two toy example with one data term Thresholding check
Solution u1
It + Ix u1 + Iy u2 < −λ∇I 2
u2
u1
|It + Ix u1 + Iy u2 | ≤ λ∇I 2
u2
u1
It + Ix u1 + Iy u2 > λ∇I 2
u2
= = =
u1 u2
u1 u2
u1 u2
+ λ∇I −
It +Ix u1 +Iy u2 ∇I ∇I 2
− λ∇I
Fig. 2.11 Optical flow for the rubber whale sequence of the Middlebury optical flow data set using a simple thresholding step. The computed flow field (second from left) is noisy and shows many outliers. The data term deviation (third from left) is small (black corresponds to 3.75% gray value error) while the optical flow end point error is fairly large (right image, black denotes 0.5 px error)
2.4.1.6 Toy Example: Optical Flow Data Term This subsection presents a toy example, minimizing (2.11) with one data term, the linearized optical flow constraint. The initial flow field to start with is set to (u1 , u2 ) = (0, 0). Ergo, small flow vectors are favored, which is correct in this very example, where most of the flow vectors have a displacement below one pixel. For the simple case of a one-channel gray value image, the data term becomes λp(u) = λ(It + Ix u1 + Iy u2 ) which yields the thresholding steps shown in Table 2.3 to solve for u1,2 (see also [119] and Sect. 8.3.2). As one would expect, if only the thresholding step is performed for a given initial (approximate) flow field, the resulting optical flow field shows a lot of noise. See Fig. 2.11 for such a result with λ = 50. The data term is nearly fulfilled for most part of the image while the optical flow field proves to be very noisy. Hence, it is straightforward to denoise and smooth the flow field in order to derive a more appealing solution. The next section presents methods for the second step of the Refinement Optical Flow, the smoothness term.
2.4.2 Smoothness Term Evaluation The first step in the general framework for Refinement Optical Flow consists of a data term evaluation step. The data term optimization (2.10) is independently performed for each pixel. An advantage of such an algorithm is that it can be sped
2.4 The Flow Refinement Framework
25
up using multiple processors and parallel computing power, e.g. modern graphics processing units (GPUs). The disadvantage is that the solution is local in the way that only the pixel itself contributes to the solution. As seen above (compare Fig. 2.11), this leads to noisy flow fields, mainly due to three reasons; corrupted image data due to sensor noise, low entropy (information content) in the image data, and illumination artifacts. Assuming that the noise is uncorrelated and has zero mean, the common approach to lower the noise level is a subsequent smoothing operation. However, smoothing the flow field may lead to a lack of sharp discontinuities that exist in the true flow field, especially at motion boundaries. Hence, discontinuity-preserving filters yielding a piecewise smooth flow field have to be employed. The aperture problem (see Fig. 2.3) can only be solved if information from a region’s boundary is propagated into the interior. While smoothing can be performed locally, this does not wholly solve the aperture problem (unless the filter mask is chosen large enough). Only global techniques, propagating information across the whole image, promise improvements. The three main objectives for the smoothing step can be summarized as • Discard outliers in the flow field due to corrupted image data (denoising). • Preserve edges, i.e. do not smooth over flow edges. • Propagate information into areas of low texture (smoothing). Illumination artifacts are different in nature as they cannot be removed by denoising or smoothing. Section 3.1 will tackle the illumination problem and show how to increase robustness to illumination changes by preparing the input image accordingly. In this subsection, different smoothing (or denoising) filters are presented; the median filter, total variation denoising filters, an anisotropic and edge-preserving diffusion filter, and a second-order prior denoising. Quantitative results obtained using the different filters are discussed in Sect. 3.2. In general, a filter F is applied to a given n-dimensional flow field u, where the gray value image I may yield as prior; it returns the smoothed flow field u such that u = F (u, I ).
(2.19)
2.4.2.1 Median Filtering Discarding outliers and preserving edges is a well-known characteristic of rank filters [58]. The median filter probably is the best-known rank filter. According to [31], the median of N numerical values has the following properties: • The median of a list of N values is found by sorting the input array in increasing order, and taking the middle value. • The median of a list of N values has the property that in the list there are as many greater as smaller values than this element. • The median of a list of N values minimizes the sum of deviations over all list entries.
26
2 Optical Flow Estimation
The last property makes the median filter especially suitable for denoising if noise characteristics are unknown. A median filter has also the nice property of converging to a periodic solution if recursively executed on an input image. This periodic (for most part of the image stationary) image is called the root image of the median filter [116]. For recursively applying n median filter steps, denoted by a superscript, i.e. u = MF n (u) = MF MF n−1 (u) , (2.20) the root image property yields ∃n > 0, i > 0:
MF n+i (u) = MF n (u).
Recursively applying the median filter propagates information across the image and produces piecewise smooth flow fields. For the root image, the above definitions of the median filter are inherently fulfilled; outliers are replaced by local medians, such that deviations between neighboring pixels are minimized. In the experiments, a fixed number of median filter iterations is used. Using the Bubble-Sort algorithm, median filtering can be implemented quite efficiently employing a fixed number of comparisons [31]. The window size used in the experiments is 3 × 3. A fixed number of median filter iterations and a fixed window size are especially suitable when using parallel processing to speed up computational time. 2.4.2.2 TV-L1 Denoising The continuous counterpart to median filtering is total variation denoising with an L1 data term, which provides an edge-preserving and smooth flow field. The classical TV denoising algorithm is described by Rudin, Osher, and Fatemi in [78]. Their algorithm seeks an equilibrium state (minimal energy) of an energy functional consisting of the TV norm of the data field, |∇ u|, and a fidelity term of the data field u to the noisy input data field u: 1 |∇ u| + λ| (2.21) u = T V Lp (u) = min u − u|p dΩ . u 2 Ω Here, λ ∈ R is a scalar value controlling the fidelity of the solution to the input image (inversely proportional to the measure of denoising); p = 2 yields a quadratic penalizer and p = 1 linear penalizing. The resulting data field contains the cartoon part of the image [117]. The cartoon has a curve of discontinuities, but elsewhere it is assumed to have small or null gradient, |∇ u| ≈ 0. In the original work [78], a quadratic data fidelity term (p = 2) was used. A closer approximation of the discrete median filter is an absolute data fidelity term [7]. The following solution scheme for total variation with an absolute data term is found in [70]: Proposition 2.2 The solution of T V L1 (u) = min u
1 u − u| dΩ, |∇ u| + λ| 2 Ω
(2.22)
2.4 The Flow Refinement Framework
27
is given by 1 u = q + λ div p. 2 The dual variables p = [p1 , p2 ] and q are defined iteratively (superscript denotes the iteration index, subscripts denote the pixel location) by computing 2τ 1 pn+1 n+1 n n n p = p + ∇ q + λ div p , followed by pn+1 = λ 2 max{1, | pn+1 |} and
⎧ n q − 1 λθ ⎪ ⎪ ⎨ i,j 2 n+1 n = qi,j qi,j + 12 λθ ⎪ ⎪ ⎩ f
n if qi,j − ui,j > 12 λθ, n if qi,j − ui,j < − 12 λθ,
otherwise
where p0 = 0, q0 = u, θ = 0.2 and the time step τ ≤ 1/4. The implementation of Proposition 2.2 uses backward differences to approximate div p and forward differences for the numerical gradient computation in order to have mutually adjoint operators [20]. The discrete version of the forward difference gradient (∇u)i,j = ((∇u)1i,j , (∇u)2i,j ) at pixel position (i, j ) for a data field of width N and height M is defined as ui+1,j − ui,j if i < N, (∇u)1i,j = and 0 if i = N, ui,j +1 − ui,j if j < M, 2 (∇u)i,j = 0 if j = M. The discrete version of the backward differences divergence operator is 1 p (div p)i,j = div p2 i,j ⎧ 1 ⎧ 2 1 2 ⎪ ⎪ ⎨ pi,j − pi−1,j if 1 < i < N ⎨ pi,j − pi,j −1 if 1 < j < M, 1 2 if i = 1 if j = 1, = pi,j + pi,j ⎪ ⎪ ⎩ ⎩ 1 2 −pi−1,j −pi,j −1 if i = N if j = M. 2.4.2.3 TV-L2 Denoising A solution scheme using the quadratic penalizers for (2.21), the ROF model, 1 |∇ u| + λ| T V L2 (u) = min u − u|2 dΩ, u 2 Ω was proposed in [21]:
28
2 Optical Flow Estimation
Proposition 2.3 The solution of T V L2 (u) = min u
1 |∇ u| + λ(u − u)2 dΩ 2 Ω
(2.23)
is given by u=u+
1 div p. λ
The dual variable p = [p1 , p2 ] is defined iteratively with p0 = 0, the time step τ ≤ 1/4, and 1 pn+1 pn+1 = pn + λτ ∇ u + div pn . (2.24) and pn+1 = λ max{1, | pn+1 |}
2.4.2.4 Structure-Adaptive Smoothing For many image sequences, discontinuities of the motion field tend to coincide with object boundaries and discontinuities of the brightness function. Although this is certainly not always true, the quantitative experiments on the optic flow benchmark [6] will demonstrate that the introduction of brightness-adaptive smoothness leads to improvements of optic flow estimates. An elegant theoretical treatise of image-adaptive regularization of flow fields was presented in [114]. There, the authors introduce regularizers of the form Ψ ∇u D(∇I ) ∇u + Ψ ∇v D(∇I ) ∇v , (2.25) corresponding to an inhomogeneous and potentially anisotropic regularization induced by a structure-dependent tensor D(∇I ). The central idea is that the smoothness of v along the two eigenvectors of D is weighted by the corresponding eigenvalues. In fact, anisotropic structure-dependent regularization was already proposed by Nagel in 1983 [65]. This is achieved by setting D(∇I ) =
⊥ ⊥ 1 + λ2 Id ∇I ∇I + 2λ
|∇I |2
where Id denotes the unit matrix. This leads to an anisotropic smoothing of u along the level lines of the image intensity while preserving discontinuities across level lines. Lately, a fast numerical approximation scheme for anisotropic diffusion was proposed by Felsberg in [32]. This scheme will also be evaluated in the experimental section. In order to include structure-adaptive smoothing into the total variation denoising, an inhomogeneous isotropic regularization is considered. Following [1], discontinuities of the motion field arising at locations of strong image gradients are favored setting D(∇I ) = g(|∇I |) Id with a strictly decreasing positive function g |∇I | = exp −α|∇I |β . (2.26)
2.4 The Flow Refinement Framework
29
Instead of solving (2.23) this yields the following solution scheme, proposed in [96]: 2 1 g |∇I | |∇ u| + λ(u − (2.27) T V L2 (u, I ) = min u ) dΩ. u 2 Ω This solution is obtained using Proposition 2.3 and replacing (2.24) with pn+1 =
pn+1 | p | max{1, g(|∇I |) } n+1
.
2.4.2.5 Advanced Priors for Motion Estimation The total variation regularity although quite powerful for practical applications is a rather simple prior and one may ask if more sophisticated priors could be designed for the specific scenario under consideration. We will briefly sketch a few of these developments. Total variation techniques favor spatially smooth and in particular constant and piecewise constant flow fields. This effect is not always desired. Especially if the flow field is slightly affine total variation regularization is known to cause a staircasing effect. As a remedy, in [95] the authors propose to penalize the variation of the second derivative. While it is rather straightforward to penalize second-order derivatives the interesting aspect in the above work is to ensure that there is no bias in the sense that all deviations from affine flows are penalized consistently. This approach is also considered in the experiments in Sect. 3.2. While the above prior energetically favors affine flow fields, typical sequences arising in videos exhibit predominantly rigid body motion. The first reason is that the sequences often show a rigid scene filmed by a moving camera (a single 3D rigid motion), or rigid objects moving in a scene (multiple 3D rigid motions). In Sect. 8.2, we will detail a prior for variational flow field estimation which favors rigid flow fields. It is derived from epipolar geometry and simply penalizes deviations from motion along the epipolar lines. There are other scenarios in which priors for flow field estimation have been designed to target a specific application. For the estimation of fluid flows, for example, researchers have advocated priors which emerge from physical models of fluid flow. Since such flows are governed by the Navier–Stokes equation, the priors are derived from this equation and ideally penalize deviations from the predicted dynamics— see for example [27, 79]. For now, we will continue with the toy example from Sect. 2.4.1 and show how median filtering improves the optical flow field. The example is followed by implementation details on the Refinement Optical Flow framework.
2.4.2.6 Toy Example Cont.: Denoising the Flow Field Let us continue the toy example from Sect. 2.4.1, Fig. 2.11. The inherent problem is that the optical flow vectors after the data term evaluation are quite noisy; hence,
30
2 Optical Flow Estimation
Fig. 2.12 Results after convergence of median filtering to the root image. The leftmost image shows the flow result from simple thresholding around the initial zero flow field. The result after convergence of iterative median filtering yields a smoother flow field and smaller flow error for most of the image (same color scale as in Fig. 2.11). This is at the cost of data term deviations
the flow field needs to be denoised. In the toy example, the median filter is applied recursively to the flow field shown in Fig. 2.11. The result can be seen in Fig. 2.12. Clearly, the flow field is much smoother than simply minimizing the data term equations while edges are preserved (compare with Fig. 2.11). The average end point error is twice as low as using a simple thresholding procedure. At the same time the data term deviation (gray value difference between current position and gray value at the end point of the flow vector) is larger for most of the image, as one would expect. Notice that the approach in the toy example uses simple, pixel-wise thresholding and pixel-wise median filtering and can be seen as a local approach. Nevertheless, due to the iterated median filtering message passing does take place, suggesting a touch of globality. Due to the median filter, the presented method is inherently robust against a wide range of outliers. Furthermore, it benefits from the easy implementation, compared to the classical approaches of optical flow estimation (KLT [54] and Horn and Schunck [42]). While small flow vectors are well approximated using this simple approach, large flow vectors, especially present in the lower part of the image, are still not precise. The next section reviews two common techniques around this problem: warping and image pyramids. Furthermore it provides implementation details for the involved numerical scheme. An evaluation comparing different data and smoothness terms as well as an evaluation of the Refinement Optical Flow framework on a optical flow benchmark can be found in Sect. 3.2.
2.4.3 Implementation Details This section gives details on the implementation for the proposed Refinement Optical Flow framework. Insights on the implementation of the numerical scheme, the restriction and prolongation operators and the used warping scheme are given. 2.4.3.1 Pyramid Restriction and Prolongation Recall that large flow vectors in Fig. 2.12, especially present in the lower part of the image, are still not precise. This is mainly due to the linearized flow constraint
2.4 The Flow Refinement Framework
31
which does not allow for large flow vectors. Two common approaches are known to solve this problem; image pyramids and warping [13]. image pyramids use downsampled versions of the image to find larger flow vectors whereas warping linearizes the optical flow constraint using the derived solution as a new starting point. Therefore pyramid approaches inherently use warping to propagate the results between pyramid levels. In our experiments we use image pyramids with a scale factor of 2. Propagating results from a lower pyramid level, say 640 × 480 px, to a higher pyramid level, say 320 × 240 px is called a restriction operation. Restriction has to be performed on the gray value input images in order to assemble the image pyramid. The restriction operator is a combination of a low pass 5 × 5 Gaussian filter and subsequent downsampling [87]. That is, a Gaussian filter with σ = 1, ⎡ ⎤ ⎡ ⎤ 1 1 4 6 4 1 ⎢4⎥ ⎢ 4 16 24 16 4 ⎥ ⎢ ⎥ ⎥ 1 ⎢ ⎢ 6 ⎥ × 1 [ 1 4 6 4 1 ] = 1 ⎢ 6 24 36 24 6 ⎥ , ⎢ ⎥ 16 ⎥ 16 ⎢ 256 ⎣4⎦ ⎣ 4 16 24 16 4 ⎦ 1 1 4 6 4 1 is applied to the image. Then odd rows and columns are removed from the image, yielding the lower resolved pyramid image (note that such a procedure does require the size of the input image to be a power of twice the size of the lowest resolved image). For the opposite direction, flow vectors and dual variables have to be transformed from the higher pyramid level onto the lower pyramid level. This operation is called prolongation. The prolongation operator up-samples the image, that is, inserts odd zero rows and columns, and then applies the 5 × 5 Gaussian filter multiplied by 4 to it. Here, one must differentiate between up-sampling of the flow vectors u, which have to be multiplied by an additional factor 2, and up-sampling of the dual variable p. The dual variable p is not multiplied by a factor. Instead, Dirichlet boundary conditions are enforced by first setting the border of the dual variable to 0 and then up-sampling the dual variable. Optical flow results from upper pyramid levels (e.g. 320 × 240 px) are a good approximation for the optical flow on lower pyramid levels (e.g. 640 × 480 px). Hence, they can be passed to the Refinement Optical Flow data term evaluation step as the approximate solution u . However, in order to allow for the estimation of large flow vectors and not to get stuck in local minima of the gray value intensity function, the optical flow constraint has to be evaluated in the vicinity of the approximate solution. This process is called warping.
2.4.3.2 Re-sampling the Coefficients of the Optical Flow Data Term via Warping To compute the image warping, the image I1 is linearized using the first-order Taylor approximation near x + u0 (instead of solely x), where u0 is the given (approximate)
32
2 Optical Flow Estimation
Fig. 2.13 The two left images show the results of the median filtered optical flow on an image pyramid. The two right images have been achieved using additional warps on each pyramid level. Both approaches further improve the derived optical flow field from 0.38 px average end point error (as in Fig. 2.12) to 0.155 px and 0.128 px average end point error, respectively
optical flow map: I1 (x + u) ≈ I1 (x + u0 ) + ∇I1 (x + u0 ) (u − u0 ). The data fidelity term ρ(u) now reads ρ(u) = I1 (x + u0 ) − I0 (x) − ∇I1 (x + u0 )u0 +∇I1 (x + u0 )u
c
where the left part, denoted by c, is independent of u, and hence fixed. Commonly, bi-linear or bi-cubic look-up is used to calculate the intensity value I1 (x + u0 ) and the derivatives of I1 . The derivatives on the input images are approximated using 1 the five-point stencil 12 [ −1 8 0 −8 1 ]. If the bi-cubic look-up falls onto or outside the original image boundary, a value of 0 is returned for the derivative and the gray value.
2.4.3.3 Toy Example Cont.: Pyramids and Warping Figure 2.13 shows the progress in terms of optical flow end point error when applying firstly only a pyramid approach (with one initial warp for each level) and secondly additional warps on each pyramid level. Five pyramid levels and 10 warps on each pyramid level were used for this toy example. The decrease in end point error becomes visible as more sophisticated methods are used for the optical flow estimation.
2.4.3.4 Symmetric Gradients for the Data Term Evaluation Assuming that u0 is a good approximation for u, the optical flow constraint states that I0 (x) ≈ I1 (x + u0 ). Taking this further onto image derivatives, we obtain that ∇I0 (x) is a good approximation for ∇I1 (x + u0 ). Note that replacing ∇I1 (x + u0 ) with ∇I0 (x) implies that no bi-cubic look-up for the image gradients has to be employed and the computation time can be sped up. However, it turns out that using blended versions of the derivatives, larger flow vectors can be matched and hence even better results are achieved. Figure 2.14 reveals that the accuracy of the optical flow result increases (keeping all other parameters fix) if blended versions of
2.4 The Flow Refinement Framework
33
Fig. 2.14 The plot shows the optical flow accuracy, measured as the average end point error for different β values for the blending of the gradients from image I0 and I1 of the rubber whale test sequence. The improvement using a blended version of the gradients for the optical flow computation is visible
Input: Two intensity images I0 and I1 Output: Flow field u from I0 to I1 Preprocess the input images; for L = 0 to max_level do Calculate restricted pyramid images L I0 and L I1 ; end Initialize L u = 0 and L = max_level; while L ≥ 0 do for W = 0 to max_warps do Re-sample coefficients of ρ using L I0 , L I1 , and L u; (Warping) for Out = 0 to max_outer_iterations do
Solve for L u via thresholding; (Sect. 2.4.1) for I n = 0 to max_inner_iterations do Smoothness term iteration on L u ; (Sect. 2.4.2) end end end if L > 0 then Prolongate L u to next pyramid level L − 1; end end Algorithm 2.1: Numerical scheme of the Refinement Optical Flow algorithm. In the numerical scheme, a super-scripted L denotes the pyramid level the derivatives, ∇I = (1 − β)∇I1 (x + u0 ) + β∇I0 (x), are being used. Values for β around 0.5 show the best results in terms of optical flow accuracy. This can be explained by the fact that both images contribute to the gradient, increasing the amount of image information used.
34
2 Optical Flow Estimation
2.4.3.5 Numerical Scheme The general numerical implementation for a specific setting of the Refinement Optical Flow framework is as follows. Beginning with the coarsest level, (2.11) and (2.19) are iteratively solved while the solution is propagated to the next finer level. The prior solution on each pyramid level is used to compute the coefficients of the linearized data terms on the corresponding pyramid levels. Hence, a warping step for the input images takes place every time when the solution is propagated within the pyramid and furthermore additional warps are used on each level to achieve more accurate results. For implementation on modern graphic processing units (GPUs), bi-linear warping is essentially available at no additional cost, therefore the concept of outer iterations is only used on CPU implementations. Avoiding poor local minima is not the only advantage of the coarse-to-fine approach. It turns out that the filling-in process induced by the regularization (smoothness) occurring in texture-less region is substantially accelerated by a hierarchical scheme as well. The resulting numerical scheme is summarized in Algorithm 2.1. The next chapter tackles the problem of illumination artifacts, which were not addressed in the Refinement Optical Flow framework. Illumination problems such as shadows, color change and reflections are a quite general problem. Using color images, a common way to tackle illumination artifacts is to change the color space representation (e.g. from RGB to HSV) and to work solely on the color itself instead of using illumination. The approach for gray value images needs to be somehow more subtle.
Chapter 3
Residual Images and Optical Flow Results
Abstract In real world motion estimation a major source of degradation in the estimated correspondence field comes about through illumination effects. Two obvious solutions to this problem are to either explicitly model the physical effect of illumination or to pre-filter the images so as to only preserve the illumination-independent information. The first approach is quite sophisticated because it involves the generation of a physical model of the observed scene, including the reconstruction of geometry, illumination, material properties and explicitly modeling the light transport. For realistic outdoor scenes this is currently infeasible. In this chapter we therefore revert to the second approach and propose to use the concept of residuals, which is the difference between an image and a smoothed version of itself, also known as structure–texture decomposition. Experimental results confirm that using residual input images for optical flow improves the accuracy of flow field estimates for image sequences that exhibit illumination effects.
3.1 Increasing Robustness to Illumination Changes The image data fidelity term states that the intensity values of I0 (x) do not change during its motion to I1 (x + u(x)). For many sequences this constraint is violated due to sensor noise, illumination changes, reflections, and shadows. Hence, real scenes generally show artifacts that violate the optical flow constraint. Figure 3.1 shows an example, where the ground truth flow is used to register two images from the Middlebury optical flow benchmark data base [6]. Although the two images are registered at the best using the ground truth flow, the intensity difference image between the source image and the registered target image reveals the violations of the optical flow constraint. Some of these regions, showing artifacts of shadow and shading reflections, are marked by blue circles in the intensity difference image. A physical model of brightness changes was presented in [39], where brightness change and motion is estimated simultaneously; shading artifacts, however, have not been addressed. In [98] and [62] the authors used photometric invariants to cope with brightness changes, which requires color images. A common approach in the literature to tackle illumination changes is to use image gradients with, or instead of, the plain image intensity values in the data term [13]. The advantage of using the image gradient for motion estimation is that it is invariant to additive intensity A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_3, © Springer-Verlag London Limited 2011
35
36
3 Residual Images and Optical Flow Results
Fig. 3.1 The source and target images of the rubber-whale sequence in the Middlebury optical flow benchmark have been registered using the ground truth optical flow. Still, intensity value differences are visible due to sensor noise, reflections, and shadows. The intensity difference image is encoded from white (no intensity value difference) to black (10% intensity value difference). Pixels which are visible in a single image due to occlusion are shown in white
Fig. 3.2 The original image is decomposed into a structural part, corresponding to the main large objects in the image, and a textural part, containing fine-scale details. All images are scaled into the same intensity value range after decomposition
changes. The downside, however, is that multiple data fidelity terms have to be used. Moreover, since images are differentiated twice the resulting motion estimates will be highly sensitive to noise. An alternative is to employ concepts of structure–texture decomposition pioneered by Meyer and others [3, 61, 95] in order to remove the intensity value artifacts due to shading reflections and shadows. The basic idea behind this splitting technique is that an image can be regarded as a composition of a structural part, corresponding to the main large objects in the image, and a textural part, containing fine scale details. See Fig. 3.2 for an example of such a structure–texture decomposition, also known as cartoon–texture decomposition. Loosely speaking, structure–texture decomposition can be interpreted as a nonlinear variant of high-pass filtering. The expectation is that shadows show up only in the structural part which includes the main large objects. The structure–texture decomposition is accomplished using the total variation based image denoising model of Rudin, Osher and Fatemi [78]. For the intensity value image I (x), the structural part is given by the solution of 1 |∇IS | + (IS − I )2 dx. (3.1) min IS 2θ Ω
3.1 Increasing Robustness to Illumination Changes
37
Fig. 3.3 Intensity difference images between the source image and the registered target image using ground truth optical flow for the original image pairs and their structure–texture decomposed versions (intensity coding as in Fig. 3.1). Note the presence of shading reflection and shadow artifacts in the original image and in the structure image
The textural part IT (x) is then computed as the difference between the original image and its denoised version, IT (x) = I (x) − IS (x). Figure 3.3 shows the intensity difference images between the source image and the registered target image using the ground truth flow as look-up for both the original image and its decomposed parts. For most parts the artifacts due to shadow and shading reflections show up in the original image and the structural part. The intensity value difference in the textural part, which contains fine-scale details, is noisier than the intensity value difference in the structural part. These intensity value differences are mainly due to sensor noise and sampling artifacts while shadow and shading reflection artifacts have been almost completely removed. This is best visible in the area of the punched hole of the rotated D-shaped object. This observation leads to the assumption that the computation of optical flow using the textural part of the image is not perturbed by shadow and shading reflection artifacts, which cover large image regions. To prove this assumption experimentally, a blended version of the textural part is used as input for the optical flow computation, IT (α, x) = I (x) − αIS (x). Figure 3.4 shows the accuracy for optical flow computation using a fixed parameter set and varying the blending factor α. The plot reveals that for larger values of α the accuracy of the optical flow is 50% more accurate than using a small value for α. This confirms the assumption that removing large perturbations due to shadow and shading reflections yields better optical flow estimates. In the experiments, the image decomposition is computed as follows: The original source and target images are scaled into the range [−1, 1] before computing the structure part. Proposition 2.3 is then used to solve (3.1). A good choice of the parameters is α = 0.95, θ = 0.125 and 100 for the number of reprojection iterations.
38
3 Residual Images and Optical Flow Results
Fig. 3.4 The plot shows the optical flow average end-point error (y-axis) using different α values for the blending of the textural part of the image. The optical flow accuracy gain for using blended images becomes visible
3.2 Quantitative Evaluation of the Refinement Optical Flow In this section, specific settings of the Refinement Optical Flow framework, derived in Sect. 2.4, are quantitatively evaluated based on the Middlebury optical flow benchmark [6]. The benchmark provides a training data set where the ground truth optical flow is known and an evaluation set used for a comparison against other algorithms in literature. A subset of the investigated data terms and smoothness terms in this section have also been uploaded for this comparison evaluation. For an online evaluation with other state-of-the-art algorithms we refer to the website; note that most top-ranked algorithms employ color information to compute the optical flow field whereas the presented Refinement Optical Flow utilizes gray value images. Table 3.1 demonstrates the accuracy of the optical flow field for different settings of the Refinement Optical Flow framework on the training data set. The table shows the average end-point error between the ground truth flow vectors and the estimated flow vectors for the data sets in the training set. Figures 3.5 and 3.6 show the obtained results for all eight evaluation sequences. For most part, the remaining flow errors are due to occlusion artifacts. The table has three sections, a Performance section where the focus is set on real-time optical flow, a Smoothness Filters section where the different denoising algorithms presented in Sect. 2.4.2 are systematically evaluated, and an Accuracy section which demonstrates the accuracy gain using the additional adaptive fundamental prior term (see Chap. 8) and weighted total variation in the smoothing. The last row shows the average deviation from the epipolar lines, hence reflecting the amount of dynamics within the scene. In Sect. 3.3, the TV-L1-improved version [107] of the Refinement Optical Flow algorithm is evaluated on real scenes, taken from a moving vehicle. The results demonstrate the performance of the optical flow algorithm under different illumination conditions and under large image motion. For visualization of the flow vectors the color-coding scheme shown in Fig. 2.1 is used.
3.2.1 Performance The performance section compares real-time capable implementations for optical flow. Both the TV-L1 optical flow algorithm (equation (2.6)) and the image decomposition described in Sect. 3.1 employ the TV-L2 denoising algorithm. This denoising step can be efficiently implemented on modern graphics cards, putting up with
Table 3.1 Evaluation results on the Middlebury training data. The evaluation is grouped into a real-time Performance section, a Smoothness Filter comparison section, and an Accuracy section, which demonstrates the systematic accuracy gain of the flow field using the adaptive fundamental matrix prior and gradient driven smoothing. The table shows the average end-point error of the estimated flow fields; apart from the last row where rel-ρF = Ω ρF (u, x)/u d2 x is the average relative epipolar line distance. Parameters have been carefully chosen for algorithm comparison (see text for parameters and run time)
3.2 Quantitative Evaluation of the Refinement Optical Flow 39
40
3 Residual Images and Optical Flow Results
Fig. 3.5 Optical flow results for the army, mequon, schefflera, wooden, and grove sequence of the Middlebury flow benchmark. The left images show the first input image and the ground truth flow. The middle image shows the optical flow using the proposed algorithm. The right image shows the end point error of the flow vector, where black corresponds to large errors
3.2 Quantitative Evaluation of the Refinement Optical Flow
41
Fig. 3.6 Optical flow results (cont.) for the urban, yosemite, and teddy sequence of the Middlebury flow benchmark. The left images show the first input image and the ground truth flow. The middle image shows the optical flow using the proposed algorithm. The right image shows the end point error of the flow vector, where black corresponds to large errors
small accuracy losses: For parallel processing, the iterative T V-L2 denoising is executed on sub-blocks of the image in parallel, where boundary artifacts may occur. Hence, high accuracy is exchanged versus run-time performance (see Table 3.2). In all three algorithm settings, P, P-MF, and P-TI -MF, the linearized optical flow constraints (2.2) is used as data term. In the plain version, algorithm P, five iterations of the TV-L2 denoising are used to smooth the flow field in every warping step. The number of refinement warps on every pyramid level was set to 25. The parameter settings are λ = 25 and θ = 0.2. Gray value look-up is bi-linear, as this can be done without additional costs on modern graphics cards. The image gradient is computed via central derivatives from the average of both input images. The P-MF algorithm extends the basic algorithm by an additional Median filter step, hence five iterations of the TV-L2 , followed by a median filter step, are performed for each warp. The Median filter makes the whole scheme more robust
42
3 Residual Images and Optical Flow Results
Table 3.2 Run-time comparison for the Performance section in Table 3.1. Using the parallel power of a GPU yields performance gain at the cost of accuracy loss. The run time is measured on the Grove3 test image (640 × 480 px) Algorithm
Processor
Avg. accuracy
Run time
P(GPU)
NVidia® GeForce® GTX 285
0.486
0.039 [sec]
P
Intel® Core™2 Extreme 3.0 GHz
0.468
0.720 [sec]
P-MF(GPU)
NVidia® GeForce® GTX 285
0.416
0.055 [sec]
P-MF
Intel® Core™2 Extreme 3.0 GHz
0.375
0.915 [sec]
P-IT -MF(GPU)
NVidia® GeForce® GTX 285
0.408
0.061 [sec]
P-IT -MF
Intel® Core™2 Extreme 3.0 GHz
0.361
1.288 [sec]
against outliers. For this reason the influence of the data term, weighted by λ can be increased to λ = 50. All other parameters are kept fix. In the third algorithm, P-TI -MF, the textural part of the image is used, as described in Sect. 3.1. Again the increase of accuracy at the cost of a longer execution time can be seen in the quantitative evaluation. It is interesting to note that only the flow fields for the real scenes within the test set benefit from the image decomposition. The optical flow for the rendered scenes, Grove and Urban, is actually worse. This is not surprising as texture-extraction removes some structure information in the images; such a procedure is only beneficial if the images contain illumination artifacts. Because this is the fact for all natural scenes (which are for obvious reasons more interesting and challenging), in the remaining experiments the texture– structure decomposition is performed inherently.
3.2.2 Smoothness Filters The smoothness filter section quantitatively evaluates different smoothness filters from Sect. 2.4.2. The parameter setting is carefully chosen to be fixed wherever possible, allowing one to compare the different smoothness techniques amongst each other. An essential difference to the performance section is the preparation of the input images. The two input images are decomposed into their structural and textural part as described in Sect. 3.1. Then, the texture input images are scaled into the range [−1, 1], yielding the same gray value range for every input test image. Note that consequently this scaling is performed before and after the decomposition of the images because the gray value range of the texture image may be different from the range of the original image. Obviously such procedure needs the minimum and maximum of an image and is not well suited for real-time implementation on a graphics card. Throughout the experiments the parameter settings are λ = 30 (except for the median filter where λ is given in the brackets) and θ = 0.25 (TV-L1 also θ = 1, see brackets). The number of refinement warps is kept fix at 35 and bi-cubic look-up is
3.2 Quantitative Evaluation of the Refinement Optical Flow
43
used for the interpolation method. The number of outer and inner iterations is the only varying parameter as some algorithms need more inner iterations than others. Gradients are computed using the five-point gradient mask with a blending factor of β = 0.4 (see Sects. 2.4.3.4 and 2.4.3.2). The following listing examines the advantages and disadvantages of the single denoising filters (timings measured on the Grove3 test image). In summary, TV-L2 denoising does yield the best results while the Median filter shows tremendous robustness against outliers. A combination of these two techniques is investigated in more detail in the third set of results, the Accuracy section. TV-L2 (0.348 px avg. EPE, 2.17 sec run time) The TV-L2 is not only the fastest but also the most accurate smoothing technique out of all evaluated techniques. Although part of the run-time performance depends on the implementation, this fact is a major advantage of TV-L2 denoising. Furthermore, it is the simplest algorithm besides the (trivial) Median filter. For the algorithm the number of outer iterations is five; in every outer iteration one inner iteration is performed. Disadvantages are the forming of regions with constant displacement (the so called stair-casing effect) and the negative effect of outliers in the data term when increasing λ. Felsberg Denoising (0.378 px avg. EPE, 5.94 sec run time) The numerical approximation scheme for anisotropic diffusion described by Felsberg in [32] is well suited to denoise the optical flow field. It has the advantage that the iterative diffusion process does handle outliers robustly. However, parameter tuning is complicated and more investigation is necessary to formulate intelligent stopping criteria for the diffusion process. Here a fixed number of iterations was used: one outer and five inner iterations. 2nd Order Smoothing (0.364 px avg. EPE, 61.5 sec run time) For the second order prior suggested in [95] 10 inner iterations are performed for each of the five outer iterations per warp. The large number of iterations yields poor run-time results. However, the resulting flow field is fairly accurate and smooth with some ringing artifacts at object boundaries. Main disadvantage is the inherent numerical instability; coupling the second order denoising with e.g. a Median filter leads to diverging effects and hence wrong flow fields. TV-L1 (0.413 px avg. EPE, 1.0 sec run time using λ = 0.25; λ = 1.0 yields an avg. EPE of 0.429 px) Being relatively stable (changing λ has low effect on the results), the TV-L1 has an inherent advantage over other denoising methods. Parameter settings are: five outer iterations with one inner iteration each. Although the method is robust w.r.t. outliers it seems to over-smooth the results. This can be anticipated by increasing the influence of the data term, hence λ (with low effect) or by increasing θ and allowing a larger deviation of the data term from the smooth solution. Latter does yield better results for some test scenes but the overall performance is still far from the TV-L2 denoising filter. Furthermore, computational time and memory consumption is larger due to the more complicated minimization process.
44 Table 3.3 Run times for the accuracy settings in Table 3.1
3 Residual Images and Optical Flow Results Algorithm
Run time
TV-L1 -imp.
3.46 [sec]
F-TV-L1
7.13 [sec]
∇I-TV-L1
5.23 [sec]
Adaptive
8.87 [sec]
Adapt. comb.
8.72 [sec]
Median Filtering (0.494 px avg. EPE, 3.21 sec run time for 20 iterations; number of iterations 2/20/200 yields avg. EPE of 0.553/0.494/0.533 in px) The Median filter is the simplest denoising method out of all in this evaluation. It is also the worst in terms of accuracy. The robustness, however, is tremendous. For different settings of λ the results do not change significantly. This is not surprising as Median filtering removes outliers independent of their distance to the current flow field. The run time is slow because values within every 3 × 3 window need to be sorted prior to filtering. Median filtering does not result in interpolated values in order to exactly reconstruct inclines in the flow field (because only values of neighbor pixels can be assigned) but it does add robustness to a large range of λ. In the experiments one Median filter step and five outer iterations are used.
3.2.3 Accuracy In this section the TV-L2 denoising with subsequent Median filtering is carefully examined. More precisely, for all 35 warps in every outer iterations (five total), one TV-L2 step followed by one Median filter step is performed on the flow field. Parameter settings are as in the Smoothness Filter section, in particular λ = 30 and θ = 0.25. These settings describe the TV-L1 -improved algorithm. Keeping all parameters fix and adding an additional fundamental matrix prior data term (see Chap. 8) yields the F-TV-L1 algorithm. The ∇I-TV-L1 setting uses the structure-aware smoothness term. Last, Adaptive uses a combination of both adaptive fundamental matrix prior and structure weighting. The Adaptive combined approach further uses the length of the sum of all four respective dual variables of the optical flow field for the re-projection step. Evidently the non-adaptive fundamental matrix prior increases accuracy in static scenes but worsens the results if the scene is dynamic. The structure-aware regularization does improve the optical flow accuracy on most test examples but only the combined adaptive approach yields top performing results. Run times in Table 3.3 are given for optical flow estimation on the Grove3 test image (640 × 480 px). The next section evaluates the algorithm qualitatively on real sequences from driver assistance. For the evaluation all optical flow parameters are kept fix and only λ is changed to 60. This is because the gray value range of the images is 16-bit, hence a greater importance is given to the data term.
3.3 Results for Traffic Scenes
45
Fig. 3.7 Optical flow computation with illumination changes. Due to illumination changes, the optical flow constraint in the two input images (upper images) is violated and flow computation using plain gray value intensities fails
Fig. 3.8 Optical flow computation with illumination changes. Using the structure–texture decomposed images, a valid flow estimation is now possible (compare with Fig. 3.7)
3.3 Results for Traffic Scenes The test images in this section are taken from a moving vehicle where the camera monitors the road ahead. The optical flow estimation is evaluated for different scenarios under different illumination conditions (night, day, shadow). The first experiment in Fig. 3.7 shows the optical flow computation on an image sequence with a person running from the right into the driving corridor. Due to illumination changes in the image (compare the sky region for example) and severe vignetting artifacts (images intensity decreases circular from the image middle), flow computation using plain gray value intensities fails. Using the proposed structure– texture decomposition, a valid flow estimation is still possible. See Fig. 3.8 for the same two frames where optical flow is computed on the structure–texture decomposed images. Note the reflection of the moving person on the engine hood which is only visible in the structure–texture decomposed images. What is more important, artifacts due to vignetting and illumination change are not visible in the structure– texture decomposed images. This demonstrates the increase in robustness for the optical flow computation under illumination changes using the proposed decomposition of the input images. Figure 3.9 shows the same scene a few frames later using the additional fundamental matrix data term. Most of the image, except for the running person, is static and the expanding flow field should follow the epipolar rays. The results show that, except for the Mercedes star and the running person, this assumption holds. It is not in the scope of this chapter to segment moving objects, the results, how-
46
3 Residual Images and Optical Flow Results
Fig. 3.9 Optical flow for a scene with a running person and a moving camera, installed in a vehicle. The distance to the epipolar rays (from green to red as in small to large) encodes independently moving objects
Fig. 3.10 Computation of optical flow for a night scene. The left image is the original intensity images. The middle image is the structure–texture decomposed images used for optical flow computation. The optical flow result is shown on the right, where flow vectors with length above 10 px are saturated in the color coding
ever, are well suited to detect independently moving regions of the image. There are more constraints available than just the distance to epipolar rays constraint to detect moving objects (see [99]); segmentation approaches will be discussed in Chap. 6. Another example to illustrate the robustness of the structure–texture decomposition is shown in Fig. 3.10. Here a scene at night with reflections on the ground plane is used. In the intensity images the scene is very dark and not much structure is visible. The structure–texture decomposed images reveal much more about the scene. Note that this information is also included in the intensity image but most structure in the original images is visible in the cloud region. The figure shows the optical flow using the decomposed images. Note the correct flow estimation of the street light on the left side. The next examples in Fig. 3.11 demonstrates the accurate optical flow computation for large displacements. It shows a scene with shadows on the road. Clearly, the structure–texture decomposed images reveal the structure on the road surface better than the original intensity images (the shadows are still noticeable because a blended version of structure and texture is used as presented in Sect. 3.1). Different scales for the optical flow color scheme are used to demonstrate the accuracy of the optical flow algorithm. Although nothing about epipolar geometry is used in the flow algorithm, the effect of expansion (and hence depth) corresponding to flow length becomes visible. Note that optical flow for the reflection posts is correctly estimated even for flow length above 8 px. Optical flow is correctly estimated for the road surface up to 30 px. The engine hood acts like a mirror; this has no negative
3.3 Results for Traffic Scenes
47
Fig. 3.11 Optical flow field for the scene depicted in the upper left with the original and structure–texture image. The flow is saturated for flow vector length above 1, 2, 3, 4, 5, 6, 8, 10, 15, 20, 25, 30 pixels from left to right
48
3 Residual Images and Optical Flow Results
Fig. 3.12 The scene shows the computation of optical flow with large displacement vectors. The original input images are shown on the left. The middle images are the blended structure–texture images. Flow vectors above 20 px are color-saturated in the optical flow color image
impact on the estimation of the optical flow for the road surface. Note the accurate flow discontinuity boundary along the engine hood and in the tree regions. In Fig. 3.12 the image is taken while driving below a bridge on a country road. Note that the shadow edge of the bridge is visible in the original images but almost disappeared in the decomposed images. The large flow vectors on the reflector post are correctly matched. The figure also illustrates the limits of the presented Refinement Optical Flow approach. In the vicinity of the car optical flow is perturbed due to missing texture on the road surface. Due to the dependency on the linearized optical flow constraint, optical flow in texture-less regions and in regions with very large displacements is still not satisfactory, highlighting the need of further research.
3.4 Conclusion In the last two chapters we presented an optical flow framework based on the idea of iterative data term evaluation and denoising to derive a highly accurate optical flow field. In the data term step, the brightness constancy constraint and a fundamental matrix constraint were evaluated and a flow field in the vicinity of a given flow field was derived. Future research may include additional constraints or extensions of the proposed method, e.g. the use of color images. The fundamental matrix constraint can be used to segment the image into regions of different motions where a fundamental matrix may be estimated for each region independently (as in [93, 104]). A promising idea which has been pursued in the literature lately is the use of high-level information to steer the optical flow field. The brightness constancy constraint has the inherent weakness that it only allows one to recover relatively small flow vectors. In [15] variational approaches and descriptor matching is combined for optical flow computation. Such descriptor matching methods do not suffer from local minima in the brightness constancy assumption. In Sect. 2.2.1 a Census based
3.4 Conclusion
49
optical flow method was presented, which does yield optical flow via descriptor matching [85]. An interesting research idea is to include flow vectors from the Census based method in the data term evaluation step. A promising investigation topic is statistical learning for optical flow regularization [90]. So far these have not been able to outperform the more naive approaches. Of course it is hard to say why this is the case, one reason may be that the challenge of learning “typical” flow patterns may not be feasible, given that different image structures, unknown object deformations, and camera motions may give rise to a multitude of motion patterns with little resemblance between motion fields from different videos. Including such an approach as a data term in the presented Refinement Optical Flow framework and careful comparison of this method with other regularizers may provide answers. Occlusion artifacts and initialization with previously computed flow fields (on the same sequence) have not been addressed in this chapter. Research in both topics has the potential to substantially improve the quality of the optical flow estimation.
Chapter 4
Scene Flow
Abstract Building upon optical flow and recent developments in stereo matching estimation, we discuss in this chapter how the motion of points in the threedimensional world can be derived from stereo image sequences. The proposed algorithm takes into account image pairs from two consecutive times and computes both depth and a 3D motion vector associated with each point in the image. We particularly investigate a decoupled approach of depth estimation and variational motion estimation, which has some practical advantages. The variational formulation is quite flexible and can handle both sparse or dense disparity maps. With the depth map being computed on an FPGA, and the scene flow computed on the GPU, the scene flow algorithm currently runs at frame rates of 20 frames per second on QVGA images (320 × 240 pixels).
The speed of light is too fast for light to be captured.
4.1 Visual Kinesthesia One of the most important features to extract in image sequences from a dynamic environment is the motion of points within the scene. Humans perform this using a process called visual kinesthesia, which encompasses both the perception of movement of objects in the scene and the observer’s own movement. Perceiving this using A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_4, © Springer-Verlag London Limited 2011
51
52
4 Scene Flow
computer vision-based methods proves to be difficult. Using images from a single camera, only the perceived two-dimensional motion can be estimated from sequential images using optical flow estimation techniques described in Chap. 2. Up to estimation errors and some well-known ambiguities (aperture problem), the optical flow corresponds to the three-dimensional scene motion projected onto the image plane. During this projection process, one dimension is lost and cannot be recovered without additional constraints or information. Hence, 3D motion computation using a single camera is a highly ill-posed problem that requires additional information. There are ways to recover the depth information from calibrated monocular video in a static scene with a moving observer up to a similarity transform. However, this requires a translating motion of the observer. The process becomes even more complex when there are independently moving objects in the scene. The estimation of the motion then has to be combined with a separation of independent motions— a chicken-and-egg problem which is susceptible to noise and/or suboptimal local solutions [28, 46, 76, 115]. Once a stereo camera system is available, the task becomes better constrained and feasible in practice. The distance estimate from the triangulation of stereo correspondences provides vital additional information to reconstruct the threedimensional scene motion. Then, ambiguities only arise 1. if the camera motion is not known. In particular if the camera undergoes an unknown motion, only the motion relative to the camera can be estimated; 2. when points are occluded; 3. around areas with missing structure in a local neighborhood. This lack of textural information leads to the aperture problem, see also Fig. 2.3. All three ambiguities are quite natural and affect the human perception as well. For instance, human perception cannot distinguish whether the earth is rotating or the stars circulate around us—a fact that has nurtured many religious discussions. Similarly if a human observes a scene of uniform color he or she will hardly detect motion. This third ambiguity is well-known in both disparity estimation and optical flow estimation (see Chap. 2). Common ways to deal with locally missing textural information and to resolve these ambiguities of motion estimation are to perform flow estimation for spatial neighborhoods of sufficient size, to use variational approaches that incorporate a smoothness prior, or to iteratively refine and smooth a prior flow field. In this sense, when we speak of dense estimates we mean that for each 3D point that is seen in both cameras, we have an estimate of its motion. See Fig. 4.1 for an example where the motion of a preceding vehicle becomes visible in the 3D scene flow. Figure 4.6 is yet another example where the movement of a running person becomes visible in the 3D scene flow. In summary scene flow is the three-dimensional analog of the optical flow presented in Chap. 2. In Sect. 4.1.1 we review a number of advances in scene flow estimation. The decoupled approach to scene flow estimation from two consecutive stereo image pairs is then presented in Sect. 4.2. In Sect. 4.2.4 we present a framework to compare scene flow algorithms on synthetic sequences.
4.1 Visual Kinesthesia
53
Fig. 4.1 Scene flow example. Despite similar distance from the viewer, the moving car (red) can be clearly distinguished from the parked vehicles (green)
4.1.1 Related Work Scene flow estimation can be achieved by estimating all parameters in a combined approach: the optical flow in the left and right images (using consecutive frames) and enforcing the stereo disparity constraints in the two stereo image–image pairs. Besides optical flow estimation this involves an additional disparity1 estimation problem, as well as the task to estimate the change of disparity over time. The work in [68] introduced scene flow as a joint motion and disparity estimation method (coupled approach). The succeeding works in [43, 63, 121] presented energy minimization frameworks including regularization constraints to provide dense scene flow. Other dense scene flow algorithms have been presented in multiple camera set-ups [72, 103]. However, these latter approaches allow for non-consistent flow fields in single image pairs. Since scene flow estimation solves the challenging problem of estimating both geometry and 3D motion, it gives rise to a heavy computational effort. At the same time, many practical applications—such as the driver assistance that motivated the work in this book—require that accurate scene flow is computed fast. Otherwise its application in the context of driver assistance is useless. Unfortunately, none of the above approaches run in real-time (see Table 4.1), giving best performances on the scale of minutes. The work in [44] presents a probabilistic scene flow algorithm with computation times in the range of seconds, but it yields only discrete integer pixel-accurate (not sub-pixel) results. A discrete disparity flow (i.e., scene flow) algorithm is presented in [36], which runs in the range of 1–2 seconds on QVGA (320 × 240 pixel) images. Real-time sub-pixel-accurate scene flow algorithms, such as the one presented in [74], provide only sparse results both for the disparity and the displacement estimates.
4.1.2 A Decoupled Approach for Scene Flow Combining disparity estimation and motion estimation into one framework has been the common approach for scene flow computation (e.g., [43, 68, 121]). In this chap1 The
disparity is also needed to calculate the absolute 3D position, from the perspective of the camera.
54
4 Scene Flow
Table 4.1 Scene flow algorithms with their running times, density, and range tested Algorithm
# cameras Dense approach (yes/no)
Close/far
Running time
“Joint motion . . . ” [68]
2
Yes
Close
?
“Decouple: image segmentation . . . ” [121]
2
Yes
Close
?
Three-dimensional scene flow [103] 6D Vision [74] “Dense motion . . . ” [44]
17
Yes
Close
?
2
No
Both
40 ms
2
Yes
Very close 5 s
30
Yes
Close
Huguet–Devernay [43]
2
Yes
Both
5 hours
“Disparity flow . . . ” [35]
3
Yes
Close
80 ms
Decoupled (this paper)
2
Yes
Both
50 ms
“Multi view reconstruction . . . ” [72]
10 min
ter, the motion estimation is decoupled from the position estimation, while still maintaining a soft disparity constraint. The decoupling of depth (disparity) and motion (optical flow and disparity change) estimation might look unfavorable at a first glance; but it has two important advantages: Firstly, splitting scene flow computation into the estimation sub-problems, disparity and optical flow with disparity change, allows one to choose the optimal technique for each of these tasks. This is particularly useful as the two challenges of disparity estimation and motion estimation are computationally quite different. With disparity estimation, thanks to the epipolar constraint, only a scalar field needs to be estimated. This enables the use of global optimization methods, such as dynamic programming, graph-cuts or convex relaxation [71], to establish point correspondences. Optical flow estimation, on the other hand, requires the estimation of a vector field without ordered labels. In this setting, global optimization in polynomial time is not available and recent efforts of convexifying the problem [34] are computationally well outside the scope of real-time performance. Another important difference is that motion vectors tend to be smaller in magnitude than disparities. This is valid for most applications and can be ensured for all applications by minimizing the time delay in between the images. Thus sub-pixel accuracy as provided by variational methods is more important for motion estimation, whereas occlusion handling is more critical in disparity estimation. Secondly, the two sub-problems can be solved more efficiently than the joint problem. This allows for real-time computation of scene flow on the GPU, with a frame rate of 20 Hz on QVGA images (320 × 240 pixel) assuming the disparity map is given (or implemented in hardware). On the CPU, a frame rate of 5 Hz is achieved. As a consequence, the splitting approach to scene flow is about 500 times faster compared to recent techniques for joint scene flow computation. Of course decoupling the estimation of disparity and of 3D motion is a delicate issue since the two problems are obviously coupled. The proper decoupling scheme therefore needs to take into account this coupling and ensure that the computed optical flow is consistent with the computed disparities.
4.2 Formulation and Solving of the Constraint Equations
55
How does the decoupling affect the quality of the results? In the decoupling strategy we propose, the motion field estimation takes into account the estimated disparities, but the disparity computation does not benefit from the computed motion fields. In coupled approaches like [43], all variables are optimized at the same time. However, since variational optimization is a local approach it is likely to run into local minima, especially when there are many coupled variables. Thus even though a coupled energy is a more faithful formulation of the problem, globally optimal solutions will not be computable. In contrast, our disparity estimates are based on a (semi-) global optimization technique. Although the disparities are not refined later in the variational approach anymore, they are more likely to be correct than in a coupled variational setting. This explains why the estimation accuracy of our decoupled approach actually turns out to compare favorably to that of joint estimation methods.
4.2 Formulation and Solving of the Constraint Equations In this section we derive the formulation of the scene flow algorithm. We elaborate how the decoupling of position and motion is put to use effectively, while still maintaining a stereo disparity constraint. The overall approach is outlined in Fig. 4.2. As seen from this figure, stereo image pairs are needed for both the previous and current time frame. The derived approach also requires the disparity map in the previous time frame. This information is passed to the scene flow algorithm for processing, resulting in the optical flow and in the change in disparity between the image pairs.
4.2.1 Stereo Computation The presented scene flow algorithm requires a pre-computed disparity map. We assume, a disparity d := d(x, y, t) is calculated for every pixel position [x, y] at every time frame t. Current state-of-the-art algorithms (e.g., see Middlebury [80]) require normal stereo epipolar geometry, such that the pixel rows y for the left and right images coincide and the principle points in the two images [x0 , y0 ] have the same coordinates. This is achieved by a so-called rectification process given the fundamental matrix of the stereo camera [38]. A world point [X, Y, Z] (lateral, vertical and depth resp.) is projected into the cameras images, yielding [x, y] in the left image and [x +d, y] in the right image, according to ⎛ ⎛ ⎞ ⎞ ⎛ ⎞ Xfx x x0 1 ⎝ y ⎠ = ⎝ −Yfy ⎠ + ⎝ y0 ⎠ (4.1) Z bf 0 d x
with the focal lengths fx and fy (in pixels) for the x and y direction, [x0 , y0 ] is the principal point of the stereo camera system, and b is the baseline distance between the two camera projection centres (in meters). The disparity value d therefore
56
4 Scene Flow
Fig. 4.2 Outline of the scene flow algorithm. The input images (blue) are processed by the stereo and scene flow algorithms; the disparity and flow data (green) represent the resulting scene flow information
encodes the difference in the x-coordinate of an image correspondence between the left and right image. With known intrinsic camera parameters (from calibration), the position of a world point [X, Y, Z] can be recovered from an [x, y, d] measurement using (4.1). The goal of the stereo correspondence algorithm is to estimate the disparity d from the left to the right image for every non-occluded pixel. This is accomplished by local methods (using a small matching window from the left image to the right image) or global methods (incorporating global energy minimization). The disparity information can then be used to reconstruct the 3D scene. The scene flow algorithm presented in this chapter flexibly incorporates input from any stereo algorithm. Dense or sparse disparity maps are handled effectively due to the variational nature of the approach. To demonstrate this, different disparity estimation techniques are used in this chapter and evaluated in Sect. 4.2.4. Hierarchical correlation algorithms, yielding sparse sub-pixel-accuracy disparity maps, are commonly employed due to their real-time capability. A typical implementation described in [33] is used, which allows for disparity computation at about 100 Hz. In addition, an implementation based on the algorithm described in [85] is used. It computes sparse pixel-discrete disparity maps and is available in hardware without additional computational cost. Using globally consistent energy minimization techniques, it becomes possible to compute dense disparity maps, which yield a disparity value for every non-occluded pixel. The Semi-Global Matching algorithm (SGM) with a mutual information data term is used here [40]. The algorithm is implemented on dedicated hardware and runs at 25 Hz on images with a resolution of 640 × 480 pixels. The remaining part of this section presents the data terms make-up of the scene flow energy functional with solution strategies. Section 4.2.4 demonstrates results with the aforementioned sparse and dense stereo algorithms.
4.2 Formulation and Solving of the Constraint Equations
57
Fig. 4.3 Motion and disparity constraints employed in the scene flow algorithm. The left images show two stereo image pairs with a corresponding point visible in every single image. The schematic images on the right illustrate the mathematical relations of corresponding image points
4.2.2 Scene Flow Motion Constraints The data dependencies exploited in the scene flow algorithm are shown in Fig. 4.3. We use two consecutive pairs of stereo images at time t − 1 and t. The scene flow field [u, v, d ] is an extension of the optical flow field [u(x, y, t), v(x, y, t)] (flow in the x and y direction, respectively) by an additional component d (x, y, t) that constitutes the disparity change. The three-dimensional velocity field can only be reconstructed, when both the image data (x, y, d) and their temporal change (u, v, d ) are known. d is estimated using an arbitrary stereo algorithm, see Sect. 4.2.1. The disparity change and the two-dimensional optical flow field have to be estimated from the stereo image pairs. For all the equations derived for scene flow, we employ the normal optical flow intensity consistency assumption, i.e., the intensity should be the same in both images for the same world point in the scene. We expand this to couple the four images involved with the scene flow calculation. The first equation that we derive is from the left half of Fig. 4.3. Let I (x, y, t)L be the intensity value of the left image, at pixel position [x, y] and time t. In the constraints, we omit the implicit dependency on (x, y, t) for u, v, d and d . This leads to the following constraint, which we call the left flow constraint: I (x, y, t − 1)L = I (x + u, y + v, t)L .
(4.2)
The flow in the right hand image can also be derived using the same principle. Let I (x, y, t)R be the intensity of the right image, at pixel position [x, y] and time t. Due to rectification, we know that pixel positions in the left image and right image will have the same y component, and this means that the difference is only in the x component of the equation by the disparity d. The same is true for the flow vectors; they differ only in the x component. This leads to the right flow constraint: + u , y + v, t)R , I (x + d, y, t − 1)R = I (x + d + d
(4.3)
uR
highlighting that the position in the x component is offset by the disparity d and the flow is only different by the disparity change d .
58
4 Scene Flow
Fig. 4.4 The scene flow equations from (4.5) are summarized in the schema. The stereo disparity map at time t − 1 is given and the three constraint equations ELF , ERF , and EDF demonstrate how the images are coupled
If the optical flow field for the left and right images is estimated independently, the disparity change can directly be calculated as the difference uR − u. However, to estimate the disparity change more accurate, consistency of the left and right image at time t is enforced. More precisely, the gray values of corresponding pixels in the stereo image pair at time t should be equal, as illustrated in the bottom halve of the diagram in Fig. 4.3. This yields the third constraint, the disparity flow constraint: I (x + u, y + v, t)L = I (x + d + d + u, y + v, t)R .
(4.4)
Figure 4.4 shows a summary of the above equations with the resulting calculated world flow from the scene flow estimates. Rearranging the above equations results in !
ELF := I (x + u, y + v, t)L − I (x, y, t − 1)L = 0, !
ERF := I (x + d + d + u, y + v, t)R − I (x + d, y, t − 1)R = 0,
(4.5)
!
EDF := I (x + d + d + u, y + v, t)R − I (x + u, y + v, t)L = 0. Occlusion Handling Occlusion handling is an important aspect in scene flow estimation. It comes along with increased computational costs, but clearly improves results. Regarding the influence of occlusion handling on disparity estimation and velocity estimation, the magnitude of the disparity is generally much larger than the magnitude of the optical flow and the disparity change, which is only a few pixels. Accordingly, occlusion handling is much more decisive for disparity estimation than for motion estimation. From a practical point of view, the effort in occlusion handling should be set in comparison to the gained accuracy in the scene flow result. Hence, occlusion handling is not explicitly modeled for the scene flow. Occluded areas identified by the stereo estimation algorithm (using left–right consistency checks) and areas with no disparity information are simply discarded. More precisely, for pixels with no valid disparity value, (4.3) and (4.4) will not be evaluated. This procedure implicitly enables the use of sparse disparity maps.
4.2 Formulation and Solving of the Constraint Equations
59
4.2.3 Solving the Scene Flow Equations The constraints (or gray value constancy assumptions) in (4.5) are used as data terms to solve the scene flow problem. Similarly to the optical flow case, the linearized versions of the constraint equations yield three data terms, pLF := ItL (x, y) + IxL (x, y)u + IyL (x, y)v,
pRF := occ(x, y) ItR (xd , y) + IxL (xd , y)(u + d ) + IyL (xd , y)v , (4.6)
pDF := occ(x, y) I R (xd , y) − I L (x, y)
+ Ix (xd , y)R − Ix (x, y)L (u + d ) + Iy (xd , y)R − Iy (x, y)L v , where It , Ix , and Iy denote the partial derivatives of the image function (compare with (2.2)). xd = x + d is the horizontal image position in the right image corresponding to a pixel at position x in the left image; the occlusion flag occ returns 0 if there is no disparity known at (x, y) (due to occlusion or sparse stereo method), or 1 otherwise. If a pixel is occluded, only the first data term, pLF is evaluated. Otherwise all three data terms contribute to the solution. The above data terms can be used to compute the scene flow using the Refinement Optical Flow framework presented in Sect. 2.4 using the three-dimensional flow vector (u, v, d ) . Alternatively, Chap. 9 explains how the approach of Brox et al. [13] can be adopted to the scene flow case; embedding the above formulas into a variational framework to be solved by fix point iterations.
4.2.4 Evaluation with Different Stereo Inputs This section presents the original study in [108] of the scene flow algorithm. Different stereo methods for the disparity input are compared, together with the full (coupled) scene flow estimation approach presented in [43]. To assess the quality of the scene flow algorithm, it was tested on a synthetic rotating sphere sequence used in from [43], where the ground truth is known. The sphere is depicted in Fig. 4.5. In this sequence the spotty sphere rotates around its y-axis to the left, while the two hemispheres of the sphere rotate in opposing vertical directions.2 The resolution is 512 × 512 pixels. The scene flow method was tested together with four different stereo algorithms: semi-global matching (SGM [41]), SGM with hole filling (favors smaller disparities in occluded areas), correlation pyramid stereo [33], and an integer accurate censusbased stereo algorithm [85]. The ground truth disparity was also used for comparison, i.e., using the ground truth as the input disparity for our algorithm. For each stereo algorithm, the absolute angular error (AAE) was calculated as used in [43],
∗ 1 uv − u∗ v , (4.7) arctan AAEu,v = n uu∗ + vv ∗ Ω
2 The
authors thank Huguet and Devernay for providing their sphere scene.
60
4 Scene Flow
Fig. 4.5 Ground truth test on the rotating sphere sequence with quantitative results in Table 4.2. Top: The left image shows the movement of the sphere. Color encodes the direction of the optical flow (key in bottom right), intensity its magnitude. Disparity change is encoded from black (increasing) to white (decreasing). Bright parts of the RMS figure indicate high RMSu,v,d error values of the computed scene flow using the SGM stereo method. Bottom: disparity images are color encoded green to orange (low to high). Black areas indicate missing disparity estimates or occluded areas
and root mean square (RMS) error measurements were calculated as evaluation measurements for the disparity, the optical flow, and the scene flow, where a superscript ∗ denotes the ground truth solution and n is the number of pixels: 2 1 occ(d) − (d ∗ ) , (4.8) RMSd = n Ω 1 (u, v) − (u∗ , v ∗ ) 2 , RMSu,v = (4.9) n Ω 1 (u, v, d ) − (u∗ , v ∗ , d ∗ ) 2 . RMSu,v,d = (4.10) n Ω
The errors were calculated using two different image domains Ω: firstly, calculating statistics over all non-occluded areas, and secondly calculating over the whole input image. As in [43], pixels from the stationary background were not included in the statistics. The resulting summary can be seen in Table 4.2. The method presented achieves slightly lower errors than the combined variational method by Huguet and Devernay, even using sparse correlation stereo. The lower error is partly due to the sparseness of the disparity since problematic regions such as occlusions are not included in the computation and can therefore not corrupt the estimates. Thanks to the variational formulation of the scene flow, more reliable information is filled in where the data terms are disabled such that a dense scene flow is obtained. Particularly, the RMS error of the scene flow is much smaller and we are still considerably faster (see Table 4.1). In this sense SGM seems to do a good job at avoiding occluded regions.
4.3 From Image Scene Flow to 3D World Scene Flow
61
Table 4.2 Root mean square (pixels) and average angular error (degrees) for the scene flow of the rotating sphere sequence. Various stereo algorithms are used as input for the scene flow estimation. Ranking is done according to RMSu,v,d error in non-occluded areas Stereo algorithm
RMSd (density)
Without occluded areas
With occluded areas
RMSu,v RMSu,v,d AAEu,v RMSu,v RMSu,v,d AAEu,v
Ground truth
0.0 (100%) 0.31
0.56
0.91
0.65
2.40
1.40
SGM [40]
2.9 (87%)
0.34
0.63
1.04
0.66
2.45
1.50
2.6 (43%)
Correlation [33]
0.33
0.73
1.02
0.65
2.50
1.52
10.9 (100%) 0.45
0.76
1.99
0.77
2.55
2.76
Hug.-Dev. [43]
3.8 (100%) 0.37
0.83
1.24
0.69
2.51
1.75
Census-based [85]
7.8 (16%)
1.14
1.01
0.65
2.68
1.43
Fill-SGM
0.32
The joint approach in [43] is bound to the variational setting, which usually does not perform well for disparity estimation. Moreover, the table shows that SGM with hole filling yields inferior results to the other stereo methods. This is due to false disparity measurements in the occluded area. It is better to feed the sparse measurements of SGM to the variational framework, which yields dense estimates as well, but with higher accuracy. SGM was chosen as the best method and is used in the remainder of the results section; it is available on dedicated hardware without any extra computational cost.
4.3 From Image Scene Flow to 3D World Scene Flow We have now derived the image scene flow as a combined estimation of optical flow and disparity change. Using this information, we can compute two world points that define the start and end point of the 3D scene flow. These equations are derived from the inverse of (4.1) (fx ,fy , and b are defined there as well). For simplicity we make use of the assumption fy = fx . We have b Xt−1 = (x − x0 ) , d
b Yt−1 = (y − y0 ) , d
Zt−1 =
fx b d
and b b fx b , Yt = (y + v − y0 ) , Zt = . d +d d +d d + d This yields for the translation vector ⎛ x−x0 x+u−x0 ⎞ ⎞ ⎛ ⎞ ⎛ d − d+d Xt − Xt−1 X y−y0 y+v−y ⎟ ⎝ Y ⎠ = ⎝ Yt − Yt−1 ⎠ = b ⎜ (4.11) ⎝ d − d+d 0 ⎠ . fx fx Z Zt − Zt−1 d − d+d Xt = (x + u − x0 )
This yields the translation vector in the camera coordinate system at time t − 1. Figure 4.7 shows an example of the simplest case, where the camera is stationary.
62
4 Scene Flow
Fig. 4.6 The image shows the world flow result with color encoding velocity (where the color fades from green to red as in stationary to moving) corresponding to the input images from Fig. 4.4. Note the correct motion estimation of the feet as the person is running forward
The scene flow reconstruction is very good with almost no outliers. Figures 4.6, 4.8, 4.9, and 4.10 show scene flow results with a moving camera platform. In order to detect moving objects and to calculate the absolute motion, the motion of the camera has to be compensated. Ego-motion of the camera is known from ego-motion estimation [4], using inertial sensors for the initial guess, and compensated in the depicted results. We will denote the camera rotation by R and the camera translational movement by T. This yields for the three-dimensional residual translation (or motion) vector M ⎡ ⎡ ⎤ ⎤ x + u − x0 x − x0 b b ⎣ y + v − y0 ⎦ + T. M = R ⎣ y − y0 ⎦ − (4.12) d d + d f f Figure 4.6 shows results from a scene where a person runs from behind a parked vehicle, corresponding to the input images from Fig. 4.4. The ego-vehicle is driving forward at 30 km/h and turning to the left. The measurements on the ground plane and in the background are not shown to focus visual attention on the person. The results show that points on the parked vehicle are estimated as stationary, whereas points on the person are registered as moving. The accurate motion results can be well observed for the person’s legs, where the different velocities of each leg are well estimated. Figure 4.8 shows multiple virtual views of a vehicle that is followed by the egovehicle. This is to highlight that the vectors are clustered together and that the scene flow vectors are consistent.
Fig. 4.7 This figure shows the scene flow results when on a stationary platform. From left to right: optical flow (top) and original image (bottom), scene flow 3D vectors, zoomed in 3D flow vectors, zoomed in 3D flow vectors when viewed from above. 3D vectors are colored green ↔ red as stationary ↔ moving
4.3 From Image Scene Flow to 3D World Scene Flow
63
Fig. 4.8 This figure shows the scene flow results when following another vehicle. Top row: shows an original image with the color encoded scene flow. The bottom row shows different virtual viewing points on the same 3D scene. Left to right: panning to left with a tilt to the right, panning down to be in line with the road surface, panning up and titling down to be about 45° off horizontal, looking straight down from above the vehicle (birds-eye view)
Fig. 4.9 A real-world example of our scene flow. The left image shows the original, the two other images show the scene flow reconstruction when viewed from the front and side. Color encoding is green ↔ red is stationary ↔ moving Fig. 4.10 Dense scene flow in a traffic scene. The color in the scene flow image shows vector lengths after ego-motion compensation (green to red = 0 to 0.4 m/s). Only the cyclist is moving. The original image is in the upper right corner
Figure 4.9 shows a van driving past the car. This figure demonstrates that the scene flow generates clustered vectors, all pointing to the same right direction with similar magnitude even when viewing from different angles. Figure 4.10 shows an image from a sequence where the ego-vehicle is driving past a bicyclist. The depicted scene flow shows that most parts of the scene, including the vehicle stopping at the traffic lights, are correctly estimated as stationary. Only the bicyclist is moving and its motion is accurately estimated. Compared to Fig. 4.7 there are more outliers in these results. This highlights that the ego-motion
64
4 Scene Flow
Fig. 4.11 Povray-rendered traffic scene (Frame 11). Top: Color encodes direction (border = direction key) and intensity the magnitude of the optical flow vectors. Brighter areas in the error images denote larger errors. Bottom right: 3D views of the scene flow vectors. Color encodes their direction and brightness their magnitude (black = stationary). The results from the scene are clipped at a distance of 100 m. Accurate results are obtained even at large distances
accuracy is vital when dealing with a moving platform, and slight errors are very noticeable in the results. The evaluation on a Povray-rendered traffic scene, where the ground truth disparity is known, can be seen in Fig. 4.11. The color schema used in the figure is based on directions rather than velocity. Even at large distances, motion vectors are estimated pointing into the right direction, although the noise level increases with the distance to the camera. For follow-on calculations (e.g. speed, accuracy of world flow, detection of moving objects, segmentation of objects, integration, etc.) besides the flow vectors the accuracy of the flow vectors is needed. In the next chapter, we will define two metrics that estimate the likelihood that a pixel is moving, i.e., not stationary. These metrics can then be used for follow-on evaluations, such as segmentation and filtering.
Chapter 5
Motion Metrics for Scene Flow
Abstract The optical flow, the disparity, and the scene flow variables are estimated by minimizing variational formulations involving a data and a smoothness term. Both of these terms are based on assumptions of gray value consistency and smoothness which may not be exactly fulfilled. Moreover, the computed minimizers will generally not be globally optimal solutions. For follow-on calculations (e.g. speed, accuracy of world flow, detection of moving objects, segmentation of objects, integration, etc.), it is therefore of utmost importance to also estimate some kind of confidence measure associated with optical flow, disparity and scene flow. In this chapter, the error characteristics for respective variables are analyzed and variance measures are derived from the input images and the estimated variables themselves. Subsequently, scene flow metrics are derived for the likelihood of movement and for the velocity of a scene flow vector.
Everything should be as simple as it is, but not simpler. Albert Einstein
5.1 Ground Truth vs. Reality Recall that the absolute motion vector M defined in (4.12) yields the absolute scene flow translation vector. Hence, the velocity of an object is the length of this transA. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_5, © Springer-Verlag London Limited 2011
65
66
5 Motion Metrics for Scene Flow
Fig. 5.1 Reconstructed three-dimensional translation vectors for every pixel of an image with ground truth values for the image disparity d and the flow estimates d , u, and v. See Fig. 5.3 for the gray value image
lation vector divided by the time frame in between the two stereo image pairs. The (absolute) translation vectors for every image point of a rendered scene with known camera motion (exact values are also known for the disparity d, the disparity change d , and the optical flow u and v) are displayed in Fig. 5.1. The translation vectors point into the direction of movement for every reconstructed scene flow vector. Points which are located on the road surface and background are stationary; hence their vector length is zero. In contrast, points which are located on the vehicle are moving and represented by translation vectors. The length of a vector seems to be a valid measure for the amount of movement of a world point. Calculating the length of the reconstructed translation vector directly yields a motion metric (velocity) for the corresponding scene flow vector. The scene flow computation introduced in Chap. 4 is an estimation procedure which merely gives rise to the values d, d , u, and v that are prone to errors. Taking the estimated scene flow values for the exact same image frame as in Fig. 5.1 and displaying the translation vectors yields a much less significant three-dimensional velocity reconstruction. This can be seen in Fig. 5.2. Due to the noise in the scene flow and disparity estimation, a feasible conclusion about the amount of motion cannot be drawn from the length of the translation vector. The errors in the input data (scene flow and disparity) are propagated into the resulting translation vector which yields perturbed translation vectors. Thus, it would be favorable to have fidelity measures for the scene flow estimates and to take into account these measures when computing the motion metrics.
5.2 Derivation of a Pixel-Wise Accuracy Measure This section analyzes the error distribution of the scene flow variables. A complex driving scene, involving hills, trees, and realistic physics has been generated using Povray (see [100]). It consists of 400 sequential stereo image pairs. For each image pair at time t, the following measures are calculated:
5.2 Derivation of a Pixel-Wise Accuracy Measure
67
Fig. 5.2 Reconstructed three-dimensional translation vectors for every pixel of an image with estimated values for the image flow estimates d, d , u, and v. See Fig. 5.3 for the error in the single flow variables. The vehicle is about 50 m away from the camera; compare with Fig. 5.1, where ground truth values are used
• The root mean squared error RMSu,v,d . • Error at each pixel, i.e., difference between estimate and ground truth. • Mean and variance of the error for u, v, d , and d. An example picture of the scene flow result is shown in Fig. 5.3. This figure reveals that the major errors are at object boundaries, and the errors in d are the lowest in magnitude. Another single frame example is shown in Fig. 5.4, where the optical flow is perturbed due to optical flow occlusion. The evaluation results for the entire image sequence (all 400 frames) can be seen in the graphs in Figs. 5.5, 5.6, and 5.7. From these graphs, one can see which frames are causing problems for the algorithms. It becomes visible that the errors and standard deviation for the scene flow variables u, v, and d (Fig. 5.6) are of similar shape, yet different magnitudes. This is expected as they are solved using the variational energy minimization and smoothed in the one framework. On the other hand, the disparity error graph (Fig. 5.7) has a much different shape, with variances increasing in different frames. The figures also show an error histogram for one specific frame of the sequence. It becomes visible that the error distribution has one salient peak around zero (no error), and long tails to both sides. All four error histograms show these criteria. This leads to the assumption that these errors follow a certain error distribution. The next subsection investigates this distribution in more detail and derives conclusions about the quality criteria and the accuracy of the disparity and scene flow.
5.2.1 A Quality Measure for the Disparity The core Semi-Global Matching stereo method [40] estimates only pixel-accurate disparity maps. One option to obtain sub-pixel accuracy is a parabolic fitting of the obtained cost [40]. In the here used implementation, sub-pixel accuracy is achieved
68
5 Motion Metrics for Scene Flow
Fig. 5.3 Frame 215 in the Povray-rendered traffic scene. The RMS is encoded in intensity, white to black as low to high RMS error (saturated at 1 px error); occluded points are shown as zero values. Flow color is encoded as in Fig. 4.5. Error images are encoded in color and intensity (saturated at 1 px error), where red denotes negative error (ground truth value larger than estimate), blue encodes positive error (ground truth smaller than estimate), and black denotes occluded pixels
5.2 Derivation of a Pixel-Wise Accuracy Measure
69
Fig. 5.4 Frame 58 in the Povray-rendered traffic scene. Encoding as Fig. 5.3
by a subsequent local fit of a symmetric first-order function [82]. Let d be the disparity estimate of the core SGM method for a certain pixel in the left image. The SGM method in [40] is formulated as an energy minimization problem. Hence, changing the disparity by ±1 yields an increase in costs (yielding an increased energy). The minimum, however, may be located in between pixels, moti-
70
5 Motion Metrics for Scene Flow
Fig. 5.5 RMS error evaluation over the entire sequence
Fig. 5.6 The left graphs show the results for the mean error (light colored line of the scene flow components, u, v, and d . The bounding dark lines are one standard deviation from the mean. The right graphs show the error histogram for the scene flow between frames 122 and 123
vating a subsequent sub-pixel estimation step. The basic idea of this step is illustrated in Fig. 5.8. The costs for the three disparity assumptions d− 1 px, d px, and d+ 1 px are taken and a symmetric first-order function is fitted to the costs. This fit is unique and yields a specific sub-pixel minimum, located at the minimum of the
5.2 Derivation of a Pixel-Wise Accuracy Measure
71
Fig. 5.7 The left graph shows the results for the mean error (light colored line, μ(d)) of the disparity d. The bounding dark lines are one standard deviation from the mean (i.e. ±σ (d)). The right graph shows the error histogram for frame 122 Fig. 5.8 The slope of the disparity cost function, Δy, can be used for the quality measure of the disparity estimate
function. Note that this might not be the minimum of the underlying energy, but it is a close approximation, evaluating the energy only at pixel position. The slope of this fitting function is one possible feasibility for a goodness-offit quality measure. If the slope is low, the disparity estimate is not accurate in the sense that other disparity values could also be valid if evaluating only at the gray value difference of corresponding pixels. On the other hand, if the slope is large, the sub-pixel position of the disparity is expected to be quite accurate as deviation from this position increases the gray value difference of corresponding pixels. Hence, the larger the slope, the better is the expected quality of the disparity estimate. Note that the costs mentioned here are accumulated costs that also incorporate smoothness terms. Based on this observation an uncertainty measure is derived for the expected accuracy (variance) of the disparity estimate: UD (x, y) =
1 , Δy
(5.1)
where Δy is the larger of the two relative cost differences. This measure can be calculated for every pixel within the image, where a disparity is estimated. It can now be used to estimate the variance of the expected error distribution of this pixel. If the ground truth disparity is known, the disparity error can be calculated for every pixel of the image. Figure 5.9 displays a resulting error image and its quality measure; a low quality value and a large error is displayed in black while a good quality or low error is displayed in white. One cannot expect that the images are exactly equal; if that would be the case, this would yield a way to estimate the exact ground truth disparity. Note that the uncertainty values are unit-less; the error image
72
5 Motion Metrics for Scene Flow
Fig. 5.9 The images show the disparity uncertainty measure (left) and the disparity estimate error (right, compared to ground truth) for frame 123 of the evaluation sequence. Saturated pixels correspond to large uncertainty values and large scene flow estimate errors
Fig. 5.10 Quality analysis of the uncertainty measure over all 400 frames of the ground truth sequence. Left: 3D plot of the density in the disparity error vs. uncertainty measure domain. The density is normalized, s.t. the integral for a specific uncertainty measure over the error is one. Right: Variance vs. uncertainty measure for the disparity error distribution (slice of (3D plot on the left)) The figure shows that as the uncertainty measure increases, the variance of the error increases, too
is encoded such that errors above 1 px are black. Pixels with zero disparity and occluded pixels are displayed in white (e.g. the sky region). To assess the quality of the presented uncertainty measure, Fig. 5.10 depicts a 3D plot where errors for different uncertainty measures are accumulated over all 400 frames of the sequence. From these accumulated data one can calculate a curve of the variance over the uncertainty measure (see Fig. 5.10 on the right). As expected, the variance increases as the uncertainty measure increases. Furthermore, Fig. 5.11 shows that the disparity error for low uncertainty measures is approximately Laplacian distributed.
5.2 Derivation of a Pixel-Wise Accuracy Measure
73
Fig. 5.11 The plot shows the disparity error distribution for a slice in Fig. 5.10 (indicated in yellow) and fitted Laplace and Gaussian distribution with the same variance
5.2.2 A Quality Measure for the Scene Flow Similar to the one-dimensional disparity estimate, a quality measure is now derived for the three-dimensional scene flow estimate. Due to the higher dimensionality, a simple fit of a linear symmetric function and evaluation of its slope is not adequate. More elaborate quality measures for optical flow have been presented and evaluated in [17]. The authors propose to directly evaluate the energy functional for the variational optical flow computation. The same idea can be carried forward to the scene flow case. Recall the energy functional for the scene flow case. Minimizing the three constraint equations from (4.5) jointly in a variational approach and enforcing smoothness of the flow field u, v, and the disparity change d , yields the following energy functional: E= |ELF | + occ |ERF | + |EDF | + λ |∇u| + |∇v| + |∇d | . As before, the occ-function returns 1, if a disparity value is estimated for a pixel and 0 otherwise. The minimum of the above functional yields the resulting scene flow estimates. Evaluating the above sum within the integral for each pixel yields a goodness-of-fit, or uncertainty value, for the scene flow variables. If the constraint equations are fulfilled and additionally the solution is smooth, the pixel-wise evaluation, USF (x, y) = |ELF | + occ |ERF | + |EDF | + λ |∇u| + |∇v| + |∇d | , returns a low uncertainty value. If on the other hand, constraint equations are not fulfilled, or at discontinuities in the scene flow field, the pixel-wise evaluation yields a large uncertainty value. Figure 5.12 displays the pixel-wise uncertainty image and the RMS scene flow error for every pixel. A large uncertainty value and a large error are displayed in
74
5 Motion Metrics for Scene Flow
Fig. 5.12 The images show the scene flow uncertainty measure (left) and the scene flow RMS error (right) for frame 205 of the evaluation sequence. Saturated pixels correspond to large uncertainty values and large disparity estimate errors
Fig. 5.13 Quality analysis of the scene flow uncertainty measure over all 400 frames of the ground truth sequence. The figure shows that as the uncertainty measure increases, the variance of the error increases, too. The plots show only the scene flow u component; the v-component and d -component yield similar results. Left: 3D plot of the density in the scene flow u component error vs. uncertainty measure domain. The density is normalized, s.t. the integral for a specific uncertainty measure over the error is one. Right: Variance vs. uncertainty measure for the scene flow u component error distribution
black while a low uncertainty or low errors are displayed in white; the error image is encoded such that errors above 1 px are black. Again, one cannot expect that both images are exactly equal; if that would be the case, this would yield a way to estimate the exact ground truth scene flow. The quality of the uncertainty measure is investigated in more detail over the whole test sequence as done before for the disparity error. Figure 5.13 shows the result exemplary for the error in u.
5.2.3 Estimating Scene Flow Standard Deviations The last two subsections presented quality measures for the disparity and the three scene flow variables for every pixel in the image. A careful analysis of the quality
5.3 Residual Motion Likelihood
75
measure demonstrated that these quality measures are correlated to the variance of the estimate errors. More precisely, the variances of the estimated scene flow values can be approximated by a linear function, σ a(x, y) = σ0,a + γa · Ua (x, y), where a is the estimated scene flow variable (u, v, d, or d , respectively), σ0,a and γa are constants, and Ua (x, y) is either UD (x, y) or USF (x, y), the uncertainty measure for the scene flow variable a at image position (x, y). The parameters σ0,a and γa can be found by fitting a line into the uncertainty-variance plot (Figs. 5.10 and 5.13) for the individual scene flow variable. Such procedure ignores that the computed variances for a given uncertainty measure clearly depend on the amount of outliers in the errors; the more outliers, the larger is the variance. This boils down to the question how much influence should be given to large errors, in particular to outliers. One way to deal with this issue is to scale the parameters σ0,a and γa such that the resulting weighted distribution, ea = ea /σ a(x, y) , where ea is the estimated variables, is standard Laplacian distributed (mean 0 and scale 1). The standard Laplacian distribution is given by 12 exp(−|x|). The plots in Fig. 5.14 show for one frame of the sequence, how the resulting weighted error distributions of the scene flow errors are approximately standard Laplacian distributed, apart from the tails, which are thicker for the observed distribution due to outliers.
5.3 Residual Motion Likelihood Having defined a way to compute the uncertainty of the scene flow measurements, in this section the uncertainty of the reconstructed 3D scene flow vector is derived. Using error propagation with (4.11) yields ΣSF = J diag σd2 , σu2 , σv2 , σp2 J (5.2) where
⎛
⎞
∂ X˙ ∂ X˙ ∂ X˙ ∂u ∂v ∂p ⎟ ⎜ ∂ Y˙ ∂ Y˙ ∂ Y˙ ⎟ J=⎜ ∂u ∂v ∂p ⎠ ⎝ ∂ Z˙ ∂ Z˙ ∂ Z˙ ∂u ∂v ∂p ⎛ (x+u−x ) (x−x ) 0 ( (d+p)2 − d 2 0 ) ⎜ (y+v−y (y−y0 ) 0) = b⎜ ⎝ ( (d+p)2 − d 2 ) fx fx ( (d+p) 2 − d2 ) ∂ X˙ ∂d ∂ Y˙ ∂d ∂ Z˙ ∂d
−1 d+p
0
0
−1 d+p
0
0
(x+u−x0 ) (d+p)2 (y+v−y0 ) (d+p)2 fx (d+p)2
⎞ ⎟ ⎟. ⎠
(5.3)
This error propagation holds true, as long as the distribution is zero-mean and scales by the standard deviation (e.g., Gaussian and Laplacian). We have also assumed that the covariances are negligible, although the problem of scene flow estimation is
76
5 Motion Metrics for Scene Flow
Fig. 5.14 The plots show the standard Laplacian distribution (mean 0 and scale 1) in blue and the true error distribution, weighted by individual variances derived from the uncertainty measures, in red (frame 123 of the ground truth sequence). Both curves are similar, confirming that the uncertainty measures are a valid approximation of the variances
highly coupled. However, estimating covariances is not trivial and in our eyes even impossible. From this model, one can see that the disparity measurement has the highest influence on the variance (as it is either by itself or quadratically weighted in the equations). Furthermore, the larger the disparity, the more precise the measurement; as d → ∞ all σα → 0. The derivation above only holds true for stationary cameras. If the rotation and translation of the camera are given, the total residual motion vector M, (4.12), is calculated as ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ MX Xt Xt−1 M = ⎝ MY ⎠ = R ⎝ Yt ⎠ − ⎝ Yt−1 ⎠ + T. (5.4) MZ Zt Zt−1 Again the motion is known with a certain accuracy. For simplicity we assume the rotational parts to be small, which holds true for most vehicle applications. This approximates the rotation matrix by a unary matrix for the error propagation calculation. We denote the standard deviations of the translations as a three-dimensional
5.4 Speed Likelihood
77
Fig. 5.15 Results using the Mahalanobis distance likelihood ξM . (a) Shows a pedestrian running from behind a vehicle. (b) Shows a lead vehicle driving forward. Color encoding is ξM , i.e., the hypothesis that the point is moving, green ↔ red ≡ low ↔ high
covariance matrix ΣT . The total translation vector M now has the covariance matrix ΣM = ΣSF + ΣT . Now one can compute the likelihood of a flow vector to be moving, hence belonging to a moving object. Assuming a stationary world and a Gaussian error propagation, one expects a standard normal distribution with mean 0 and covariance matrix ΣM . Deviations from this assumption are found by testing this null hypothesis or the goodness of fit. This can be done by evaluating the Mahalanobis distance [57], giving us the residual motion likelihood: −1 ξM (x, y) = M ΣM M. (5.5) The squared Mahalanobis distance ξM is χ 2 distributed and outliers are found by thresholding, using the assumed quantiles of the χ 2 distribution. For example, the 95% quantile of a distribution with three degrees of freedom is 7.81, the 99% quantile lies at 11.34. Hence a point is moving with a probability of 99% if the Mahalanobis distance is above 11.34. This again holds only if the measurement variances are correct. Figure 5.15 demonstrates results using this metric. In both images, it is easy to identify which parts of the scene are static and which parts are moving. Note that this metric computes a value at every reconstructed scene point. A limitation to this approach is that the movement metric ξM only identifies the probability of a point being stationary; it does not provide any speed estimates. Another two examples of results obtained using this metric can be seen in Fig. 5.16.
5.4 Speed Likelihood The residual motion likelihood metric ξM omitted any information about speed. To estimate the speed S the L2 -norm (length) of the displacement vector is calculated. S = M .
(5.6)
78
5 Motion Metrics for Scene Flow
Fig. 5.16 Two examples, from a single sequence, of the residual motion likelihood defined in Sect. 5.3. Left to right: original image, optical flow result, the residual motion metric results (green ↔ red represents low ↔ high likelihood that the pixel corresponds to a moving 3D point)
Fig. 5.17 Results using speed S and its standard deviation σS . (a) Shows the pedestrian running with a speed of 3.75 m/s. (b) Shows a lead vehicle driving forward with a speed of 12.5 m/s. Color encoding is S, green ↔ red ≡ stationary ↔ maximum speed. σS is encoded using saturation, points in the distance are therefore gray or black (hence the speed information is inaccurate)
The problem is that points at large distances are always estimated as moving. This is because a small disparity change yields large displacements in 3D (see (4.11)). If variances are used in the residual motion computation one can still not derive speed information. One way around the problem is to give a lenient variance of the speed measurement σS2 . An approach to estimate this variance is to calculate the spectral norm of the covariance matrix. This involves computing the eigenvalues of the squared matrix, then taking the square root of the maximum eigenvalue. ΣM . (5.7) σS2 = ΣM = λmax ΣM Using this we now have a likely speed S and associated variance σS2 . Using these metrics leads to the examples in Fig. 5.17. In this figure, it is easy to identify the speed of moving targets, and also how confident we are of the speed measurement.
5.4 Speed Likelihood
79
The pedestrian in Fig. 5.17(a) had a displacement of 15 cm with a frame rate of 25 Hz, i.e., 3.75 m/s. The vehicle in 5.17(b) had a displacement of 50 cm, i.e., 12.5 m/s. In both examples moving objects are easily identified. From the metrics provided in this section, we now have a likelihood that the object is moving ξM , the likely speed of the object S and the uncertainty of the speed σS2 .
Chapter 6
Extensions of Scene Flow
Abstract Most hazards in traffic situations include other moving traffic participants. Hence, reliably detecting motion in world coordinates is a crucial step for many driver assistance systems and an important part of machine visual kinesthesia. In this chapter we present two extensions of scene flow, making it further amenable to practical challenges of driver assistance. Firstly, we present a framework for scene-flow-based moving object detection and segmentation. Secondly, we discuss the application of Kalman filters for propagating scene flow estimation over time.
Without geometry life is pointless. Anonymous
6.1 Flow Cut—Moving Object Segmentation Classically, moving objects are separated from the stationary background by change detection (e.g. [89]). But if the camera is moving in a dynamic scene, motion fields become rather complex. Thus, the classic change detection approach is not suitable as can be seen in Fig. 6.1 (e.g. second from left). The goal in this chapter is to derive a segmentation of moving objects for the more general dynamic setting. The A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_6, © Springer-Verlag London Limited 2011
81
82
6
Extensions of Scene Flow
Fig. 6.1 From left to right: input image, difference image between two consecutive frames, motion likelihood, and segmentation result. With the motion likelihood derived from scene flow, the segmentation of the moving object becomes possible although the camera itself is moving
motion of the camera itself is not constrained, nor are assumptions implied on the structure of the scene, such as rigid body motion. If the scene structure was known, the displacement of flow vectors on rigid objects can be constrained onto sub-spaces which yields efficient means to segment dynamic scenes using two views [104]. If the shape of objects was known, the approach employed in [14] demonstrates an efficient way to find and segment this shape in the image. If nothing about object appearance and rigidness is known, the segmentation clearly becomes more challenging. High-dynamic scenes with a variety of different conceivable motion patterns are especially challenging and reach the limits of many state-of-the-art motion segmentation approaches (e.g. [29, 52]). This is a pity, because the detection of moving objects implies certain scene dynamics. The FlowCut [111] segmentation algorithm proposed in this chapter can handle such scene dynamics implicitly with multiple independently moving objects and camera motion. Although the camera motion is not constrained, it is assumed that the camera motion is (approximately) known. In [122] the authors use dense optical flow fields over multiple frames and estimate the camera motion and the segmentation of a moving object by bundle adjustment. The necessity of rather long input sequences, however, limits its practicability; furthermore, the moving object has to cover a large part of the image in order to detect its motion. The closest related to the presented approach is the work presented in [101]. It presents both a monocular and a binocular approach to moving object detection and segmentation in high-dynamic situations using sparsely tracked features over multiple frames. In our case the focus is set on moving object detection (instead of tracking) and the minimal number of two consecutive stereo pairs is used. Algorithm Overview Figure 6.2 illustrates the segmentation pipeline. The segmentation is performed in the image of a reference frame (the left frame of the stereo camera setting) at time t employing the graph cut segmentation algorithm [10]. The motion cues used are derived from dense scene flow. In Sect. 6.1.1 the core graph cut segmentation algorithm is presented. It minimizes an energy consisting of a motion likelihood for every pixel and a length term, favoring segmentation boundaries along intensity gradients. The employed motion likelihoods are derived from the dense scene flow estimates and the motion metrics presented in Chap. 5. In the monocular setting, solemnly the optical flow component of the scene flow is used; the employed metric will be outlined in Sect. 6.1.2. In Sect. 6.1.3 the monocular method and the binocular method for the segmentation
6.1 Flow Cut—Moving Object Segmentation
83
Fig. 6.2 Work flow for the segmentation of independently moving objects. The lower two images show the main two steps: the motion likelihood result where red denotes independent object motion and the segmentation result. In the images sparse features are used for better visualization
of independently moving objects in different scenarios are compared. The experiments show systematically that the consideration of inaccuracies when computing the motion likelihoods for every pixel yields increased robustness for the segmentation task. Furthermore, the limits of the monocular and binocular segmentation methods are demonstrated and ideas to overcome these limitations are provided.
6.1.1 Segmentation Algorithm The segmentation of the reference frame into parts representing moving and stationary objects can be expressed by a binary labeling of the pixels, 1, if the pixel x is part of a moving object, (6.1) L(x) = 0, otherwise. The goal is now to determine an optimal assignment of each pixel to moving or not moving. There are two competing constraints: First, a point should be labeled moving if it has a high motion likelihood ξmotion derived from the scene flow information and vice versa. Secondly, points should favor a labeling which matches that of their neighbors.
6.1.1.1 Energy Functional Both constraints enter a joint energy of the form E(L) = Edata (L) + λEreg (L),
(6.2)
84
6
Extensions of Scene Flow
where λ weighs the influence of the regularization force. The data term is given by (6.3) L(x)ξmotion (x) + 1 − L(x) ξstatic (x) Edata = − Ω
on the image plane Ω, where ξstatic is a fixed prior likelihood of a point to be static. The regularity term favors labelings of neighboring pixels to be identical. This regularity is imposed more strongly for pixels with similar brightness, because it is assumed that neighboring pixels of similar brightness are more likely to represent the same object:
g I (x) − I (ˆx) L(ˆx) − L(x) , (6.4) Ereg = xˆ ∈N4 (x)
Ω
where N4 is the 4 neighborhood (upper, lower, left, right) of a pixel and g(·) is a positive, monotonically decreasing function of the brightness difference between 1 with a positive constant α is used. neighboring pixels. Here g(z) = z+α
6.1.1.2 Graph Mapping Summarizing the above equations yields −L(x)ξmotion (x) − 1 − L(x) ξstatic (x) E(L) = Ω
+λ
xˆ ∈N4 (x)
|L(ˆx) − L(x)| . |I (x) − I (ˆx)| + α
(6.5)
Due to the combinatorial nature, finding the minimum of this energy is equivalent to finding the s–t -separating cut with minimum costs of a particular graph G(v, s, t, e), consisting of nodes v(x) for every pixel x in the reference image and two distinct nodes: the source node s and the target node t [51]. Figure 6.3 illustrates this mapping. The edges e in this graph connect each node with the source, target, and its N4 neighbors. The individual edge costs are defined in Table 6.1. The cost of a cut in the graph is computed by summing up the costs of the cut (sometimes also referred to as removed) edges. Removing the edges of an s–t-separating cut from the graph yields a graph where every node v is connected to exactly one terminal node: either to the source s or to the target t . If we define nodes that are connected to the source as static and those connected to the target as moving, it is easy to see that the cost of an s–t -separating cut is equal to the energy in (6.5) with the corresponding labeling, and vice versa. Thus, the minimum s–t -separating cut yields the labeling that minimizes (6.5). The minimum cut is found using the graph cut algorithm from [10]. Clearly, the result depends on the costs of the edges, especially on the regularization parameter λ. If λ is low, the segmentation only contains single pixels whereas a high value of λ results in only one small segment (or no segment at all)
6.1 Flow Cut—Moving Object Segmentation
85
Fig. 6.3 Illustration of the graph mapping. Red connections illustrate graph edges from the source node s to the nodes, green connections illustrate graph edges from nodes to the target node t . Note that the ξmotion likelihood may be sparse due to occlusion. In the illustration only pixels with yellow spheres contribute to this motion likelihood. Black connections (indicated by the arrow) illustrate edges between neighboring pixels Table 6.1 Flow cut edge costs
Edge
Edge cost
Source link: s → v(x)
−ξmotion (x)
Target link: v(x) → t
−ξstatic (x)
N4 neighborhood: v(ˆx) ↔ v(x)
λ |I (x)−I1(ˆx)|+α
because removing edges connected to the source or the target becomes less costly than removing those edges connecting image pixels. Both situations can be seen in Fig. 6.4. Speed up techniques for flow vector segmentation can be achieved using the Multi-Resolution Graph Cut, see [99]. In the next section, we discuss how the ξmotion (x) likelihoods from the scene flow estimates are derived.
6.1.2 Deriving the Motion Likelihoods This section derives motion constraints to detect independently moving objects (IMOs) in image sequences. The key idea to detect moving objects is to evaluate the hypothesis “the object is not stationary.” This is done by virtually reconstructing the three-dimensional translation vector for every flow vector within the image. The length of the reconstructed threedimensional translation vector is then evaluated resulting in a motion likelihood for the corresponding flow vector. This reconstruction process needs both the flow
86
6
Extensions of Scene Flow
Fig. 6.4 The images on the left show the segmentation for a moving pedestrian appearing behind a stationary vehicle. Outliers are rejected and the segmentation border is accurate. The four images on the right show the influence of the edge costs on the segmentation result (later in the sequence). While small edge costs result in many small segments with only a few pixels (left), high edge costs result in very few small regions (such that the number of cut edges is minimized, right). From left to right: λ = {1.5, 50, 500, 1000}
vector and the motion of the camera between the two time instances when the images were taken (recall Fig. 6.2). More technically speaking one has to distinguish between motion caused by the ego-vehicle and motion caused by dynamic objects in the scene. The motion of the ego-vehicle greatly complicates the problem of motion detection because simple background subtraction of successive images yields no feasible result. Referring to the presentation of the monocular optical flow in Chap. 2 and the stereoscopic scene flow in Chap. 4, the segmentation algorithm in this section employs motion constraints for both cases. In the monocular case, the distance of world points to the camera is not known in the general setting (non-static scene) and hence a full three-dimensional reconstruction is not possible. However, due to the fundamental matrix geometry certain motion constraints can be derived. Section 6.1.2.1 presents this in more detail. In the stereo case, image points can be triangulated and the full three-dimensional translation vector can be reconstructed. In Sect. 5.3 the motion constraint ξM (x) was derived which represents the likelihood that this translation vector does not vanish when subtracting the ego motion (ergo that the translation vector does belong to a moving object). Therefore in the stereo case we set ξmotion (x) = ξM (x). 6.1.2.1 Monocular Motion Analysis There is a fundamental weakness of monocular three-dimensional reconstruction when compared to stereo methods: moving points cannot be correctly reconstructed by monocular vision. This is due to the camera and unknown object movement between the two sequential images. Hence, optical flow vectors are triangulated, assuming that every point belongs to a static object. Such triangulation is only possible, if the displacement vector itself does not violate the fundamental matrix constraint. Needless to say that every track violating the fundamental matrix constraint belongs to a moving object and the distance to the fundamental rays directly serves as a motion likelihood. In such case, flow vectors need to be projected onto the epipolar lines in order to triangulate these flow vectors.
6.1 Flow Cut—Moving Object Segmentation
87
Fig. 6.5 Side view of virtual triangulations. The camera moves from c1 to c2 . A point on the road surface is being tracked from Z1 to Z2 . The resulting triangulated point Zt lies behind the camera, if the point is moving faster than the camera (overtaking object, left). On the other hand, if the point is slower than the camera, the resulting triangulated point lies under the road surface. Figure courtesy of J. Klappstein
But even if flow vectors are aligned with the epipolar lines, they may belong to moving objects. This is due to the fact that the virtually triangulated point may be located behind one of the two camera positions or below the ground surface (see Fig. 6.5). Certainly such constellations are only virtually possible, assuming that the point is stationary. In reality, such constellations are prohibited by the law of physics. Hence, such points must be located on moving objects. In summary, a point is detected as moving if its 3D reconstruction is identified as erroneous. To this end, one checks whether the reconstructed 3D point violates the constraints of a static 3D point, which are the following (note: these constraints are necessary, not sufficient [48]): Epipolar Constraint: This constraint expresses that viewing rays of a static 3D point in the two cameras (lines joining the projection centres and the 3D point) must meet. A moving 3D point in general induces skew viewing rays violating this constraint. Positive Depth Constraint: The fact that all points seen by a camera must lie in front of it is known as the positive depth constraint. It is also called chirality constraint and it is illustrated in Fig. 6.5. If viewing rays intersect behind the camera, then the actual 3D point must be moving. Positive Height Constraint: All 3D points must lie above the road plane. This principle is usually true for traffic scenes and illustrated in Fig. 6.5. If viewing rays intersect underneath the road, then the actual 3D point must be moving. This constraint requires additional knowledge about the normal vector of and the camera distance to the road surface. Evaluating the constraints presented above results in a likelihood representing whether the flow vector is located on a moving object. This likelihood then serves as input for the segmentation step. If a third camera view is available, the trifocal constraint yields an additional observation: a triangulated 3D point utilizing the first two views must triangulate to the same 3D point when the third view comes into
88
6
Extensions of Scene Flow
consideration [50]. According to [50] there are no further constraints in the monocular case. In our experiments we use the motion metric obtained in [49], which employs the above constraints. It results in the motion metric ξmotion (x), representing to which extent the constraints of a stationary point are violated.
6.1.3 Experimental Results and Discussion This section presents results which demonstrate the accurate segmentation of moving objects using scene flow as input data. In the first part, it is shown that the use of the reliability measures presented in Chap. 5 greatly improve the segmentation results when compared to a fixed variance for the disparity and scene flow variables. In the second part, the segmentation results using the monocular and binocular motion segmentation approaches are compared.
6.1.3.1 Robust Segmentation Figure 6.6 illustrates the importance of using the reliability measures to derive individual variances for the scene flow variables. If the propagation of uncertainties is not used at all, the segmentation of moving objects is not possible (top row). Using the same variance for every image pixel the segmentation is more meaningful, but still outliers are present in both the motion likelihoods and the segmentation results (middle row). Only when the reliability measures are used to derive individual variances for the pixels, the segmentation is accurate and the influence of outliers does not corrupt the result (bottom row).
6.1.3.2 Comparing Monocular and Binocular Segmentation A binocular camera system will certainly always outperform a monocular system, simply because more information is available. However, in many situations a monocular system is able to detect independent motion and segment the moving objects in the scene. In a monocular setting, motion which is aligned with the epipolar lines cannot be detected without prior knowledge about the scene. Amongst other motion patterns, this includes objects moving parallel to the motion direction of the camera. For a camera moving in depth this includes all (directly) preceding objects and (directly) approaching objects. The PreceedingCar and HillSide sequences in Fig. 6.7 show such constellations. However, this is only true in the unconstrained case; using the ground plane assumption in the monocular setting (no virtually triangulated point is allowed to be located below the road surface) facilitates the detection of preceding objects. This can be seen in the PreceedingCar experiment, where the lower parts of the car become visible in the segmentation. If compared to the stereo settings, which does
6.1 Flow Cut—Moving Object Segmentation
89
Fig. 6.6 Results for different error propagation methods. The left images show the motion likelihoods and the right images the segmentation results
not use any information about scene structure, the motion likelihood for the lower part of the preceding car seems to be even more discriminative. However, if parts of the scene are truly located below the ground plane, as the landscape at the right in the HillSide experiment, these will always be detected as moving, too. Additionally, this does not help to detect approaching objects. Both situations are solved using a binocular camera. If objects do not move parallel to the camera motion, they are essentially detectable in the monocular setting (Bushes and Running sequences in Fig. 6.8). However, the motion likelihood using a binocular system is more discriminative. This is due to the fact that the three-dimensional position of an image point is known from the stereo disparity. Thus, corresponding points in the right image on stationary objects for a pixel in the left image are restricted to points and not to the complete viewing ray, as in the monocular setting. Note that non-rigid objects (as in the Running sequence) are detected as well as rigid objects and do not limit the detection and segmentation results, be it monocular or binocular.
90
6
Extensions of Scene Flow
Fig. 6.7 The figure shows the energy images and the segmentation results for objects moving parallel to the camera movement. This movement cannot be detected monocularly without additional constraints, such as a planar ground assumption. Moreover if this assumption is violated, it leads to errors (as in the HillSide sequence). In a stereo setting prior knowledge is not needed to solve the segmentation task in these two scenes
6.1 Flow Cut—Moving Object Segmentation
91
Fig. 6.8 The figure shows the energy images and the segmentation results for objects which move not parallel to the camera motion. In such constellations a monocular as well as a binocular segmentation approach is successful. Note that even the non-rigig independently moving objects are segmented. One can see in the energy images and in the more accurate segmentation results (the head of the person in the Running sequence) that stereo is more discriminative
92
6
Extensions of Scene Flow
Further research should focus on feedback loops in the whole motion estimation and segmentation process. Certainly, if motion boundaries and the segmentation of moving rigid objects are known, this provides additional cues for the motion estimation step. The work in [29] estimates piecewise parametric motion fields and a meaningful motion segmentation of the image in a joint approach. A similar approach could be used to segment moving objects in the input images.
6.2 Kalman Filters for Scene Flow Vectors The analysis of the scene flow vectors in Chap. 5 has shown that the motion vector field includes a large amount of noise. The segmentation of moving objects in Sect. 6.1 took into account spatially varying noise but any correlation of the data and noise over more successive frames has been neglected. In this section we use Kalman filters assigned individually to each scene flow vector of the motion field and optimize a linear motion model. This yields much more reliable estimates of the scene flow direction and magnitude as can be seen in Fig. 6.9. The figure displays (amongst others) the ground truth 3D motion field, the motion estimated from two consecutive stereo frames where correspondences are established using optical flow, and the reconstructed 3D scene flow field. It becomes visible that the reconstructed 3D motion field from scene flow estimates is less noisy than the 3D motion from two consecutive stereo disparity maps. However, the amount of noise is still considerably large. Taking into account that in most cases the motion of objects itself is smooth and can be approximated by a linear motion, this leads to the concept of Kalman filters to estimate this 3D motion. Here, we present a method to estimate the 3D position and 3D motion of a point using the Kalman filter approach described in [45]. Due to the recursive nature of the Kalman filter, the estimation is improved continuously with each measurement by updating the state vector and its associated covariance matrix. This eliminates the need to save a history of measurements and is computationally highly efficient. Additionally, Kalman filters provide improved measurement uncertainties which can be regarded when the flow field is evaluated for further applications (e.g. segmentation).
6.2.1 Filtered Flow and Stereo: 6D-Vision The idea of Kalman filters for position and motion, called 6D-Vision, has also been traced in [74], where Kalman filters for 2000 Kanade–Lucas–Tomasi (KLT) features have been employed for temporal integration. The state vector ξ of the Kalman filter includes the 3D position and the 3D velocity vector in camera coordinates, therefore ˙ Y˙ , Z) ˙ . The system model describing the propagation of the state ξ = (X, Y, Z, X, vector ξt−1 from the previous time instance t − 1 to the current one at time t is modeled assuming a linear motion. Therefore, the system model is given by the
6.2 Kalman Filters for Scene Flow Vectors
93
Fig. 6.9 Estimated motion field of the described methods. The color encodes the velocity: green encodes 0.0 m/s, red encodes 8.0 m/s. The vectors show the predicted 3D position in 0.250 s (figures (a), (d), (e)), resp., 0.050 s (figures (b), (c))
94
6
linear equation system
ξt =
R 0
Extensions of Scene Flow
Δt · R T ξt−1 + R 0
(6.6)
with R and T as the rotation and the translation component of the inverse motion of the observer, and Δt as the time in between both observed frames. The measurement model of the Kalman filter describes the relation between the measurement vector z and the state vector ξ of the Kalman filter. In [74] the authors used the measurement vector zt , defined by xt , xt = xt−1 + u(xt−1 ). zt = (6.7) dt (xt ) xt is the image position at time t, dt (xt ) its measured disparity value, and u(xt ) the optical flow estimate from the Lucas–Kanade method [54]. Note that xt−1 in (6.7) corresponds to the image position of the KLT feature itself at time t − 1 and not to the projection of the filtered state ξt−1 . That means the image position of the features is only determined by the KLT feature tracker, while the filtering only influences the velocity and the disparity estimation. Otherwise, undesired low pass filtering effects of the Kalman filter would become visible (see [74] for more details). In such a setting, only the position components of the state vector are directly measured, and the relation between the measured projection and the reconstructed 3D point is given by the (inverse) camera projection described in Sects. 4.2.1 and 4.3. Since the measurement model must be formulated in the Euclidean space rather than in the projective space, the measurement model is non-linear: ⎛ ⎞ ⎛ ⎞ Xfx x0 1 (6.8) z = Π(ξ ) = ⎝ −Yfy ⎠ + ⎝ y0 ⎠ . Z bfx 0 Using a multiple-filter approach, the 6D-Vision motion field estimation method provides robust results in real-time for real-world scenarios. However, the information provided by the 6D-Vision approach is only sparse (e.g. 2000 features in [74]). In a follow-on publication [75] the authors therefore proposed two approaches: applying Kalman filters to dense stereo and optical flow, and filtering scene flow measurements. These will be presented in the next sections.
6.2.2 Filtered Dense Optical Flow and Stereo: Dense-6D To utilize as much information as possible from a stereo image sequence, the KLTTracker in the measurement step is replaces by a dense optical flow algorithm (Dense-6D). Modern parallel hardware, an NVidia graphics adapter with CUDA capability in the implementation, together with sophisticated numerical computation schemes at the filtering process, enables us to assign Kalman filters to every single pixel of the input image sequence (of the size 640 px × 480 px) and to apply them in real-time (at 10 Hz with reasonable results in our example). At the beginning of the computation step from image It−1 to It , every pixel xt−1 on the discrete pixel grid is associated with one Kalman filter Kt−1 (xt−1 )
6.2 Kalman Filters for Scene Flow Vectors
95
and one sub-pixel component st−1 (xt−1 ). After having determined the optical flow field from It−1 to It , and after having updated the filters during the filtering step, Kt−1 (xt−1 ) → Kt (xt−1 ), the updated Kalman filter field Kt (xt−1 ) must be warped (or resampled) along the sub-pixel accurate optical flow u(xt−1 ), to receive the filter field Kt (xt ) on the new discrete pixel positions xt . The updates of the positions and the sub-pixel components are given by xt = xt−1 + st−1 (xt−1 ) + u(xt−1 ) + 0.5 , st (xt ) = st−1 (xt−1 ) + u(xt−1 ) + 0.5 mod 1 − 0.5. At every time step the sub-pixel component is updated due to the sub-pixel accurate optical flow, which is always taken from the discrete position of the pixel grid, since exact optical flow information is only available at these points. During the resampling step it is possible that not every pixel xt of the current image is referred by a flow vector u(xt−1 ). In this case a new filter has to be created with initial values and connected to the empty pixel. One can also think of an initialization based on the states and the covariances of the surrounding pixels. This is not implemented in our experiments due to performance reasons. On the other hand, if one pixel xt of the current image is referred by more than one flow vectors u(xt−1 ), one either has to decide which one of the filters will be used with the corresponding pixel for the next frame, or has to combine them to a new one. In this case, the covariances of the concurring filters can weight between them. It is also reasonable to use the depth information, so that the filter with the smallest Z value in the filter state can survive, while the other ones are deleted. In our implementation, the filter which is assigned as the last one, remains alive. A more complex solution violates the real-time capability significantly which outweighs the benefit in accuracy.
6.2.3 Filtered Variational Scene Flow: Variational-6D If the dense optical flow method is replaced by a variational scene flow scheme as ˙ proposed in Chap. 4, the direct estimation of the disparity change d(x) can be used as an additional measurement. In this case, the measurement vector for the Kalman filter reads ⎞ ⎛ xt ⎠, dt (xt ) (6.9) zt = ⎝ ˙ dt−1 (xt−1 ) + d(xt−1 ) and the extended projection matrix Π in (6.8) is replaced by ⎛ ⎞ ⎛ ⎞ Xfx x0 ⎟ ⎜ y0 ⎟ 1⎜ −Yf y ⎟ + ⎜ ⎟. z = Π(ξ ) = ⎜ Z ⎝ bfx ⎠ ⎝ 0 ⎠ bfx 0
(6.10)
96
6
Extensions of Scene Flow
The Kalman filter weights between the two disparity measurements regarding the covariance matrix. The additional measurement of the disparity change from the variational scene flow algorithm increases both robustness and accuracy of the motion field estimation without losing real-time capability.
6.2.4 Evaluation with Ground Truth Information In [75] the authors compared the following motion field estimation techniques described in this chapter on synthetic stereo image data: 1. 2. 3. 4.
Differential Motion field estimation from optical flow (Chap. 2) and stereo Variational scene flow from two frames (Chap. 4) Dense-6D: dense optical flow and stereo (Sect. 6.2.2) Variational-6D: Kalman filtered variational scene flow (Sect. 6.2.3).
The ground truth images used in the experiments were rendered using Povray. The experiments are conducted on a stereo sequence with an image resolution of 640 px × 480 px × 12 b and 250 frames. The experiments were conducted on an Intel Quad-Core Extreme Edition at 3 GHz with an NVidia GeForce 285 GTX graphics adapter. On this configuration, the dense optical flow calculation is performed in 24 ms, whereas the dense scene flow computation takes 65 ms. The 640 × 480 Kalman filters are processed in 12 ms. This enables us to achieve a frame rate of 25 Hz for the Dense-6D algorithm and about 10 Hz for the Variational-6D approach. The artificial traffic sequence consists of a scene with crossing and turning vehicles and a moving camera. The three-dimensional ground truth position and motion field is available and is used for the determination of the error distributions ˙ Y˙ , Z˙ as the quantities to analyze and χ ∗ as the corresponding ρ[χ] with χ = Z, X, ground truth. The error distributions, accumulated over the whole image Ω and the whole sequence [0, T ] are shown in Fig. 6.10. In addition, the median (ME) of the error distribution of χ and the root mean squared (RMS) error are computed as T 1 2 (6.11) ERMS [χ] = χt (x) − χt∗ (x) dx. N t=0
One can clearly see from Fig. 6.10 and Table 6.2 that Dense-6D outperforms the scene flow computation method known from literature with respect to accuracy and robustness. To demonstrate that this is not only the case when considering relatively linear, synthetic motion, we will show results from real-world traffic scenes in the next section. In Fig. 6.9 the crossing vehicle from the synthetic scene is shown in detail for the four evaluated approaches. It becomes clear that the common variational scene flow algorithm is not able to estimate the direction of the shown vehicle correctly due to the relatively high distance. However, this is not the case for the Dense-6D and Variational-6D algorithms.
6.2 Kalman Filters for Scene Flow Vectors
97
Fig. 6.10 Error distributions of the Z position and the velocity components calculated from the direct combination of optical flow and stereo (gray), the scene flow (black), the Dense-6D method (red), and the Variational-6D method (blue) Table 6.2 Median error (ME) and root mean square error (RMS) of the Z position and velocity components of the four evaluated methods Method
X˙ [m/s]
Z [m] ME
RMS
ME
Y˙ [m/s] RMS
ME
Z˙ [m/s] RMS
ME
RMS
Direct
0.0010
2.749
0.0462
42.0093
0.0004
15.370
0.4374
141.442
Scene flow
0.0080
2.807
0.0179
22.7186
0.0172
11.470 −0.1173
67.520
Dense-6D
0.0104
1.068 −0.0065
0.3623 −0.0044
0.339
0.0107
2.538
Variational-6D
0.0085
1.282 −0.0007
0.3712 −0.0040
0.319 −0.0044
2.537
Overall, we are not able to detect significant advantages regarding the accuracy of one of the dense algorithms presented in this section against the other one. Both approaches outperform dense real-time methods known from literature by magnitudes. The computational complexity puts the Dense-6D method in favor. Especially
98
6
Extensions of Scene Flow
Fig. 6.11 Left: Typical traffic scenes. Right: Motion field, estimated by the Dense-6D algorithm. The color encodes the velocity (from green to red) of the observed points
in traffic scenarios, where motion is mainly linear (especially the motion in depth), we therefore propose to use this method. However, further research is needed to evaluate the different scene flow methods in situations with mainly non-linear scene motion shares. Due to the additional disparity rate measurement, the Variational-6D method might then be the preferred method.
6.2.5 Real-World Results The two proposed new estimation methods are directly applicable in real-world scenarios, being able to build the basis for robust reliable object detection and segmentation. Figure 6.11 shows the result of the Dense-6D method for typical dynamic traffic scenes. We believe that this methodology will become a key element in future safety driver assistance systems.
Chapter 7
Conclusion and Outlook
Teaching machines to see is a challenge that has motivated numerous researchers over the last decades. In the 1970s, pioneers like Marvin Minsky believed that artificial intelligence would mostly be solved within a generation and that the vision problem could be solved in a summer project. Since then substantial research efforts have been invested, numerous journals have been established, large international conferences with many hundreds of participants each year, all dedicated to solving the “vision problem”. These developments have shown that the vision problem bears numerous intricate challenges which will likely occupy many generations to come. Among the central challenges in low-level vision is the estimation of geometry and motion of the world around us. Reliable and fast estimates of geometry and motion are a central prerequisite for solving higher-level challenges such as scene analysis and driver assistance. The goal of this book was to review some of the central advances in three decades of research on motion estimation and motion analysis from image sequences—from the early variational approach of Horn and Schunck for 2D optic flow estimation to recent variational approaches for 3D scene flow estimation and extensions. In particular, we discussed several algorithmic strategies for minimizing respective functionals, with particular emphasis on an alternation of correspondence optimization and vector field smoothing. This provided profound insights into the process of apparent motion analysis in image sequences. Based on stereo depth estimation and on extending the 2D optical flow to three dimensions, we presented a real-time capable approach to dense scene flow estimation. We discussed metrics to evaluate if a scene flow vector belongs to a moving object and to derive the accuracy of a scene flow vector’s velocity estimate. Two extensions to scene flow were presented in more detail: the segmentation of independently-moving objects in image sequences and Kalman filters for scene flow vectors which allow a stabilization of scene flow estimation in a predictor-estimator framework. We believe that the methods described in this book will advance many higherlevel vision challenges, including the problem of driver assistance. Accurate and robust estimates of geometry and motion from an on-board stereo camera system will A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_7, © Springer-Verlag London Limited 2011
99
100
7
Conclusion and Outlook
allow one to estimate which objects are moving where, thus enabling predictions of what will happen in the next fraction of a second. Understanding the relationship between objects and figures, predicting what is going to happen, and analyzing potential hazards are the main parts of ongoing research. While there remain many open challenges, we hope that this book will provide one step toward the goal of teaching machines to see.
Chapter 8
Appendix: Data Terms and Quadratic Optimization
8.1 Optical Flow Constraint Data Term For the optical flow constraint (2.2) the solution space is two-dimensional, hence u = (u1 , u2 ) ∈ R2 . The optical flow constraint itself computes as p(u) = It + ∇I u, hence p0 = λIt and p = λ∇I. The minimization problem then becomes 1 1 min (u1 − u1 )2 + (u2 − u2 )2 + λIt + ∇I u , (8.1) u1 ,u2 2 2 |p0 +p u|
where u = (u1 , u2 ) is a given approximate solution. A closer look reveals that (8.1) is equivalent to the minimization problem in (2.7) (up to a scale factor).
8.2 Adaptive Fundamental Matrix Constraint Within the spectrum of conceivable optic flow patterns in Computer Vision, flow patterns that correspond to 3D rigid body motion play a central role. This is not surprising, since they invariably arise for static scenes filmed by a moving camera or for objects moving rigidly. It is well known that in the case of rigid body motion the two-dimensional optic flow estimation problem is reduced to a one-dimensional search along the epipolar lines which can actually be solved quite efficiently. The challenge is that the epipolar lines are usually unknown and need to be estimated from established point correspondences themselves. This bootstrapping problem is usually solved iteratively and has one major drawback: If the scene is not stationary, both the epipolar lines and the optical flow suffer from the biasing prior inflicted by the other. Variational optical flow techniques with prior knowledge of the fundamental matrix geometry were presented using hard constraints [83] and soft constraints [97], the latter estimating simultaneously the optical flow and the fundamental matrix. In both approaches, the algebraic distance to the epipolar rays was evaluated. A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_8, © Springer-Verlag London Limited 2011
101
102
8
Appendix: Data Terms and Quadratic Optimization
In the following, an adaptive regularization is proposed which favors rigid body motion only if this is supported by the image data. In particular, an adaptive weighting γ (v) aims at engaging the rigid body constraint based on the amount of independent motion found within the scene. It is given by
λF , if Ω ρF (u, x)/u d2 x < δF , γ (u) = (8.2) 0, otherwise. In the formula, ρF (u, x) is the symmetric distance of the flow vector to the two epipolar lines, x+u x x˜ = F and u˜ = F . 1 1 With the 3 × 3 fundamental matrix F it is defined as ρF (u, x) =
x˜12
+ x˜22
1 |c + x˜1 u1 + x˜ 2 u2 |, + u˜ 21 + u˜ 22
(8.3)
where a sub-index i of a vector denotes its ith component and the constant c is computed as x x F . (8.4) c= 1 1 This formulation is symmetric w.r.t. the two input images and normalized in the sense that it does not depend on the scale factor used to compute F [56]. To this end, first an optic flow field is computed and a fundamental matrix is estimated by minimizing the non-linear criterion 2 1 x min u˜ . (8.5) 1 F x˜ 2 + x˜ 2 + u˜ 2 + u˜ 2 1
2
1
2
Then, the estimated fundamental matrix itself is used to drive the optical flow toward the epipolar lines. The crucial part is how to weight this data term in order to maintain robustness in dynamic scenes and increase accuracy in static scenes. This is where the adaptive weighting γ (v), which analyzes the amount of independent motion, becomes important. Essentially, a violation of the epipolar constraint denotes other moving objects in the scene while the counter-hypothesis does not generally hold (e.g. the motion of objects might coincide with the epipolar lines). However, simply comput
ing the average symmetric distance to the epipolar lines, Ω ρF (u, x) d2 x, does not yield useful results, as flow vectors in a dynamic scene might be relatively small in magnitude, yielding only small errors although the complete scene is dynamic. Here the relative symmetric distance, weighted by the inverse length of the computed optical flow is used. Such measure yields a rather robust estimate of the relative motion contained in the scene depicted by the two images. In summary, the adaptive rigid body regularization has the following effect:
8.3 Quadratic Optimization via Thresholding
103
• If the average relative deviation Ω ρF (u, x)/u d2 x is above a predefined threshold δF , the fundamental matrix regularization will be switched off so it does not bias the estimation of motion in dynamic scenes. • If on the other hand the relative deviation is smaller than δF , then the fundamental matrix regularization is imposed so as to favor the estimation of optic flow fields that are consistent with rigid body motion. Note that even in this case the fundamental matrix constraint is not enforced to be exactly fulfilled. Instead, deviations of the optic flow from rigid body motion are allowed wherever this is supported by the image data. The resulting linearized data term becomes
x˜1 u. p(u) = γ (u)ρF (u, x) = γ (u) 2 c + x˜2 x˜1 + x˜22 + u˜ 21 + u˜ 22 p0 1
λp
(8.6)
p
Note that the constant p0 and the scalar value λp involve the optical flow vector itself. Both values are calculated using flow vectors from the last iteration (hence lagged feedback is used here).
8.3 Quadratic Optimization via Thresholding Equation (2.11) is a combination of convex functions, and hence is convex itself. If (2.11) was differentiable, the minimum could be found by setting its derivative, with respect to u, equal to zero. However, the absolute function |x| is degenerate at |x| = 0 and not differentiable. More sophisticated quadratic optimization techniques have to be used to find the true minimum of (2.11). In this chapter the Karush–Kuhn–Tucker (KKT) conditions [47] are used to derive the thresholding steps for a single data term and two data terms in (2.11). For completeness, let us first reproduce the KKT conditions.
8.3.1 Karush–Kuhn–Tucker (KKT) Conditions For a convex quadratic minimization problem minx {f (x)}, under N linear inequality constraints gi (x) ≤ 0 where 0 ≤ i ≤ N , a global optimum of a solution x ∗ holds true if there exist constants μi such that the Karush–Kuhn–Tucker (KKT) conditions [47] are fulfilled: Stationarity: ∇f (x ∗ ) + i μi ∇gi (x ∗ ) = 0. Primal feasibility: gi (x ∗ ) ≤ 0. Dual feasibility: μi ≥ 0. Complementary slackness: μi gi (x) = 0 ∀i.
104
8
Appendix: Data Terms and Quadratic Optimization
8.3.2 Single Data Term For a single data term, the task is to find the vector u∗ which solves the minimization problem 1 u∗ = min u − u 2 + p0 + p u , (8.7) u 2 or its dual formulation
1 u = min u − u 2 + p u 2 ∗
(8.8)
with linear side conditions p0 + p u − p ≤ 0 and
(8.9)
−p0 − p u − p ≤ 0.
Taking the above equations into the KKT conditions for the global optimum of the dual formulation yields Stationarity:
u∗ − u 1
+ μ1
p −p + μ2 = 0. −1 −1
(8.10)
Primal feasibility:
p0 + p u∗ − p ≤ 0
−p0 − p u∗ − p ≤ 0
.
Dual feasibility:
Complementary slackness:
μ1 ≥ 0 . μ2 ≥ 0
μ1 (p0 + p u∗ − p ) = 0 μ2 (−p0 − p u∗ − p ) = 0
.
The minimum can be found by analyzing the three possible cases for the single data term for the primal formulation; p(u∗ ) < 0, p(u∗ ) = 0, and p(u∗ ) > 0. The KKT conditions yield a simple and efficient thresholding scheme to ensure the global optimum of the solution. In the first case (p(u∗ ) < 0), the minimum of (8.7) is given by taking the derivative w.r.t. u yielding u∗ = u + p. This implies for the dual variable p = p(u∗ ) = − p0 + p (u + p) = −p(u ) − p p.
8.3 Quadratic Optimization via Thresholding
105
Due to construction, side condition (8.9) is binding (its left side is zero), directly yielding primal feasibility for (8.9). The primal feasibility for (8.9) is fulfilled iff p0 + p u∗ − p ≤ 0
⇐⇒
p0 + p (u + p) − −p(u ) − p p ≤ 0
p0 + p u +p p + p(u ) + p p ≤ 0
⇐⇒
⇐⇒
p(u )
2 p(u ) + p p ≤ 0. If this equation holds, the following holds also; setting μ1 = 1 and μ2 = 0 directly yields dual feasibility, complementary slackness, and stationarity of the solution u∗ . Hence, if the primal feasibility check returns true, the optimum for the point is found. This can be checked by evaluating p(u ) < −p p. Equivalently, if p(u∗ ) > 0, u∗ = u − p is the global minimum, iff the primal feasibility check for (8.9) holds, yielding the thresholding check 2 p(u ) − p p ≤ 0. This can be evaluated by checking for p(u ) > p p. If the data term vanishes (i.e. p(u∗ ) = p = 0), both side conditions are binding. The minimum of minu { 12 u − u 2 } under the side condition p0 + p u = 0 yields the solution p(u ) u∗ = u − p. p p Primal feasibility and complementary slackness is directly fulfilled for both side conditions, because they are both binding. A final thresholding check has to ensure dual feasibility for μ1 and μ2 fulfilling the stationarity constraint. Plugging u∗ into (8.10) yields ∗ u − u p −p + μ1 0= + μ2 , (8.11) 1 −1 −1 ) u − p(u p −p p p − u p 0= + μ2 , (8.12) + μ1 −1 −1 1 p(u ) p p, p p −p p p p = μ1 + μ2 , (8.13) 1 1 1 p(u ) 1 −1 p p + μ2 . (8.14) = μ1 1 1 1
) ) The solution is given by μ1 = 12 (1 + p(u ) and μ1 = 12 (1 − p(u ) and dual feasibilp p p p
ity holds if |p(u )| ≤ p p. Analyzing the three cases p(u∗ ) < 0, p(u∗ ) = 0, and p(u∗ ) > 0 directly yields three distinct and complete thresholding steps. This yields a very efficient algorithm for evaluating the minimum of (8.7), first presented by [119]. Summarized, the solution computes as shown in Table 8.1
106
8
Table 8.1 Computation of the minimum u∗ for (8.7)
Appendix: Data Terms and Quadratic Optimization
Assume p(u∗ )
Thresholding check
Solution u∗ =
p(u∗ ) < 0
p(u ) < −p p
u + p
p(u∗ ) = 0
|p(u )| ≤ p p
u −
p(u∗ ) > 0
p(u ) > p p
u − p
p(u ) p p p
8.3.3 Two Data Terms In this section an efficient solution scheme to minimize the objective function (2.11) or, equivalently, its dual quadratic optimization problem (2.13) with inequality constraints (2.14) for two data terms is proposed and verified. To this end, we have two data terms, λ1 p1 (u) = a0 + a u and λ2 p2 (u) = b0 + b u where a0 and b0 are constant and a and b represent the linear parts of the data terms. The minimization problem states 1 2 (8.15) min u − u + a0 + a u + b0 + b u . u 2 λ1 p1 (u)
a
λ2 p2 (u)
b
Its dual formulation with dual variables and is 1 2 min u − u + a + b u 2 such that
(i) a0 + a u − a ≤ 0, (ii) −a0 − a u − a ≤ 0,
(8.16)
(iii) b0 + b u − b ≤ 0, (iv) −b0 − b u − b ≤ 0. Differentiating this and taking it into the KKT conditions for the global optimum of the dual formulation yield additional dual variables μ(i) , μ(ii) , μ(iii) , and μ(iv) and the following constraints. Stationarity: ⎛ ∗ ⎛ ⎛ ⎛ ⎛ ⎞ ⎞ ⎞ ⎞ ⎞ u − u a −a b −b ⎝ 1 ⎠ + μ(i) ⎝ −1 ⎠ + μ(ii) ⎝ −1 ⎠ + μ(iii) ⎝ 0 ⎠ + μ(iv) ⎝ 0 ⎠ = 0. 0 0 −1 −1 1 Primal feasibility: ⎫ ⎧ (i) a0 + a u∗ − a ≤ 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎨ (ii) −a − a u∗ − a ≤ 0 ⎪ 0 . ∗ ⎪ ⎪ ⎪ (iii) b0 + b u − b ≤ 0 ⎪ ⎪ ⎪ ⎭ ⎩ (iv) −b0 − b u∗ − b ≤ 0
8.3 Quadratic Optimization via Thresholding Table 8.2 Minimum u∗ if p1 (u∗ ) = 0 and p2 (u∗ ) = 0
Assume p1,2 (u∗ )
Thresholding checks
Solution u∗ =
p1 ≥ 0
a0 + a (u − a − b) ≥ 0
u − a − b
p2 ≥ 0
b0 + b (u − a − b) ≥ 0
p1 ≥ 0
a0 + a (u − a + b) ≥ 0
p2 ≤ 0
b0 + b (u − a + b) ≤ 0
p1 ≤ 0
a0 + a (u + a − b) ≤ 0 + b (u
p2 ≥ 0
b0
p1 ≤ 0
a0 + a (u + a + b) ≤ 0
p2 ≤ 0
Dual feasibility:
107
b0
+ b (u
u − a + b u + a − b
+ a − b) ≥ 0 u + a + b
+ a + b) ≤ 0
⎧ ⎫ μ(i) ≥ 0 ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ μ(ii) ≥ 0 . μ(iii) ≥ 0 ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ μ(iv) ≥ 0
Complementary slackness: ⎧ ⎫ μ(i) (a0 + a u∗ − a ) = 0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ μ (−a − a u∗ − a ) = 0 ⎪ ⎬ (ii) 0 . ∗ ⎪ ⎪ (b + b u − b ) = 0 μ (iii) 0 ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ μ(iv) (−b0 − b u∗ − b ) = 0 Looking at all possible combinations of the data terms |pi (u∗ )|, namely pi (u∗ ) ≤ 0, pi (u∗ ) ≥ 0, and pi (u∗ ) = 0, directly yields a thresholding scheme to minimize the objective function. If both p1 (u∗ ) and p2 (u∗ ) are strictly positive or negative (i.e. pi (u∗ ) = 0), the optimal solution u∗ is found iff the thresholding checks in Table 8.2 apply. Proof For p1 (u∗ ) = 0 and p2 (u∗ ) = 0 the solution can be found by setting the derivative of the unconstrained objective function equal to zero. Let us prove using the KKT conditions that this yields a global minimum iff the above thresholding steps succeed. Due to construction, exactly two inequality constraints are binding (the left side is 0). We set the μi for these constraints to 1 and the μi for the other two constraints to 0. This implies that the point is stationary because the solution is constructed to yield a derivative of zero. It directly follows that complementary slackness and dual feasibility hold. The thresholding check ensures primal feasibility of the solution and hence, iff the thresholding check succeeds, a global minimal solution is found. Vanishing Data Terms If (at least) one of the data terms is binding, the solution space is restricted to yield either p1 (u∗ ) = 0 or p2 (u∗ ) = 0. In this subsection
108
8
Table 8.3 Minimum u∗ if p1 (u∗ ) = 0 and p2 (u∗ ) = 0
Appendix: Data Terms and Quadratic Optimization
Assume p1 (u∗ ) = 0
Solution (checks follow)
p2 (u∗ ) ≥ 0
u∗ = u − b −
a (a a a 0
+ a u − a b)
p2 (u∗ ) ≤ 0
u∗ = u + b −
a (a a a 0
+ a u + a b)
thresholding checks to verify the necessary and sufficient conditions of a global minimum for the local solution in these restricted cases are derived. For the following analysis, let us assume that the first data term at the global minimum vanishes, i.e. p1 (u∗ ) = 0. The case p2 (u∗ ) is equivalent (simply exchange the two data terms) and not handled explicitly. For a vanishing p1 (u∗ ), three cases need to be examined: p2 (u∗ ) < 0, p2 (u∗ ) = 0, and p2 (u∗ ) > 0. The case p2 (u∗ ) = 0 is left out in the analysis. If all other cases do not yield a global minimum, it directly follows that both data terms must vanish; the unknown parameter vector u∗ can then be calculated from the two data term equations. For now we stick with p1 (u∗ ) = 0 and assume either p2 (u∗ ) > 0 or p2 (u∗ ) < 0. The (possible) global minimum u∗ is computed by setting the derivative of the objective function (8.15) equal to zero using the assumptions made on the pi (u∗ ) (see Table 8.3). Again, the KKT conditions have to be checked to verify a global optimum. Due to construction we have a = 0, and the first two inequality constraints are binding (hence primal feasible). Out of the remaining two inequality constraints, one is primal feasible due to construction as p2 (u∗ ) ≤ 0 or p2 (u∗ ) ≥ 0 directly yield one inequality constraint that is binding. The last constraint is checked by the thresholding step in Proposition 8.1. Proposition 8.1 Check for primal feasibility using the following thresholding for the case p1 (u∗ ) = 0: for p2 (u∗ ) ≥ 0 check a b0 + b u − b − a0 + a u − a b ≥ 0, a a for p2 (u∗ ) ≤ 0 check
a b0 + b u + b − a0 + a u + a b ≤ 0. a a
The complementary slackness condition states that the μi corresponding to the only non-binding inequality constraint has to be zero. This leaves us with three more μi , which have to be positive to fulfill the dual feasibility condition. Example Let us assume, p2 (u∗ ) ≥ 0. It follows that the inequality constraint (iii) is binding and the inequality constraint (iv) in (8.16) is fulfilled. This directly implies that μ(iv) has to be zero (as stated above).
8.3 Quadratic Optimization via Thresholding
109
Using the stationarity condition, an equation system to solve for the remaining μi is derived: ⎤ ⎤ ⎡ ⎡ ⎤⎡ u − u a −a b μ(i) ⎣ −1 −1 0 ⎦ ⎣ μ(ii) ⎦ = ⎣ −1 ⎦ . −1 μ(iii) 0 0 −1 From this it follows that μ(iii) = 1 and the remaining system of equations yields a unique solution for μ(i) and μ(ii) . This can be seen when plugging back in the possible solutions for u∗ , yielding a a(μ(i) − μ(ii) ) = a0 + a u − a b , a a μ(i) + μ(ii) = 1. The solution is given by 1 1 (8.17) a0 + a u − a b + , 2a a 2 1 −1 (8.18) μ(ii) = a0 + a u − a b + . 2a a 2 The KKT dual feasibility constraint states that both μ(i) and μ(ii) have to be positive. μ(i) =
This can be checked very efficiently: Proposition 8.2 Check for dual feasibility using the following thresholding for the case p1 (u∗ ) = 0: for p2 (u∗ ) ≥ 0 check 1 a a a0 + a u − a b ≤ 1, for p2 (u∗ ) ≤ 0 check
1 a a a0 + a u + a b ≤ 1.
Iff all KKT conditions hold, u∗ yields a global minimum of the objective function. Following the arguments above, the check for p2 (u∗ ) = 0 is straightforward. The presented thresholding scheme yields a very efficient and exact total variation scheme for the optical flow with two (linear) data terms.
Chapter 9
Appendix: Scene Flow Implementation Using Euler–Lagrange Equations
Image flow estimates according to the constraints formulated in Sect. 4.2 can be computed in a variational framework by minimising an energy functional consisting of a data term (ED , derived from the constraints) and a smoothness term, ES , that enforces smooth and dense image flow. By integrating over the image domain Ω we obtain ESF = (ED + ES ) dx dy. Ω
Using the constraints from (4.5) in Sect. 4.2 yields the following data term: 2 2 2 + occ(x, y)Ψ ERF + occ(x, y)Ψ EDF , ED = Ψ ELF √ where Ψ (s 2 ) = s 2 + ε (ε = 0.0001) denotes a robust function that compensates for outliers [13] and the function occ(x, y) returns 0 if there is no disparity known at (x, y) (due to occlusion or sparse stereo method), or 1 otherwise. The smoothness term penalizes the local deviations in the image flow components and employs the same robust function as the data term in order to deal with existing discontinuities in the velocity field: ES = λΨ |∇u|2 + λΨ |∇v|2 + γ Ψ |∇d |2 with ∇ = (∂/∂x, ∂/∂y) . The parameters λ and γ regulate the importance of the smoothness constraint, weighting for optic flow and disparity change, respectively. Interestingly, due to the fill-in effect of the above regularization, the proposed variational formulation provides dense image flow estimates [u, v, d ] , even if the disparity d is non-dense.
9.1 Minimization of the Scene Flow Energy For minimizing the above energy, its Euler–Lagrange equations are computed: 2 2 2 ELF IxL + cΨ ERF ERF IxR + cΨ EDF EDF IxR Ψ ELF − λdiv ∇u Ψ (ES ) = 0, A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9_9, © Springer-Verlag London Limited 2011
111
112
9
Appendix: Scene Flow Implementation Using Euler–Lagrange Equations
2 2 2 Ψ ELF ELF IyL + cΨ ERF ERF IyR + cΨ EDF EDF IyR − λdiv ∇v Ψ (ES ) = 0, 2 2 ERF Ixr + cΨ EDF EDF Ixr cΨ ERF − γ div ∇d Ψ (ES ) = 0, where Ψ (s 2 ) denotes the derivative of Ψ with respect to s 2 . Partial derivatives of I (x + u, y + v, t)L and I (x + d + d + u, y + v, t)R are denoted by subscripts x and y. For simplicity, c = occ(x, y). These equations are non-linear in the unknowns, so we stick to the strategy of two nested fixed point iteration loops as suggested in [13]. This boils down to a warping scheme as also employed in [59]. The basic idea is to have an outer fixed point iteration loop that contains the linearization of the ELF , ERF , and EDF . In each iteration, an increment of the unknowns is estimated and the second image is then warped according to the new estimate. The warping is combined with a coarse-tofine strategy, where the equations are evaluated on down-sampled images that are successively refined with the number of iterations. For real-time estimates of the image flow variables, four iterations with two outer fixed point iterations at each scale are used. The image intensity functions for the left and right images are linearized according to the following equations, where k denotes the iteration index. The iterations are started with [u0 , v 0 , d 0 ] = (0, 0, 0) for all [x, y] : L L I x + uk + δuk , y + v k + δv k , t ≈ I x + uk , y + v k , t + δuk IxL + δv k IyL , R I x + d + d k + δd k + uk + δuk , y + v k + δv k , t R R R ≈ I x + d + d k + uk , y + v k , t + δuk I(x+d) + δd k I(x+d) + δv k IyR . From these expressions the linearized versions of ELF , ERF , and EDF are derived. The remaining non-linearity in the Euler–Lagrange equations is due to the robust function. In the inner fixed point iteration loop the Ψ expressions are kept constant and are recomputed after each iteration. This finally leads to the following linear equations: k+1 2 k Ψ ELF ELF + IxL,k δuk + IyL,k δv k IxL,k k+1 2 k + cΨ ERF ERF + IxR,k δuk + δd k + IyR,k δv k IxR,k − λdiv ES,k+1 ∇ uk + δuk = 0, k+1 2 k ELF + IxL,k δuk + IyL,k δv k Iyl,k Ψ ELF k+1 2 k + cΨ ERF ERF + IxR,k δuk + δd k + IyR,k δv k IyR,k − λdiv ES,k+1 ∇ v k + δv k = 0, k+1 2 k ERF + IxR,k δuk + δd k + IyR,k δv k IxR,k cΨ ERF k+1 2 k + cΨ EDF EDF + IxR,k δd k IxR,k − γ div ES,k+1 ∇ d k + δd k = 0
9.2 Implementation of Scene Flow
113
Fig. 9.1 The table indicates the real-time applicability of our algorithm if implemented on a modern GPU. The input images used (on different resolution scales) are shown below the table
with 2 2 2 ES,k+1 = λΨ ∇uk+1 + λΨ ∇v k+1 + γ Ψ ∇d k+1 . We omitted the iteration index of the inner fixed point iteration loop to keep the notation uncluttered. Expressions with the iteration index k + 1 are computed using the current increments δuk , δv k , δd k . We see that some terms from the original Euler–Lagrange equations have vanished. This is due to the use of I (x + d, y, t)r = I (x, y, t)l from the linearized third constraint, (4.4). After discretization, the corresponding linear system is solved via successive over-relaxation. It is worth noting that, for efficiency reasons, it is advantageous to update the Ψ after a few iterations of SOR.
9.2 Implementation of Scene Flow The scene flow algorithm was implemented in C++, obtaining a speed of 5 Hz on a 3.0 GHz Intel®Core™2 CPU for QVGA images of 320 × 240 pixels. The implementation in CUDA for the GPU (NVidia®GeForce GTX 480) allows for a frame rate of 20 Hz as indicated in Fig. 9.1. The settings for the computational times were two outer iterations (warps), 15 inner iterations, and three SOR iterations at each pyramid level. The parameters used are λ = 0.06, γ = 0.6, and ω = 1.99 for the over-relaxation. Since we are interested in real-time estimates, we use four levels in the pyramid, with a down-sampling rate of 0.5, i.e., each image is half the dimensions in both the x and y directions so |Ω| is cut down by 75% at each level. Although the energy on a smaller pyramid level is not exactly the same, it is a close approximation of the energy on the higher resolved images.
114
9
Appendix: Scene Flow Implementation Using Euler–Lagrange Equations
Fig. 9.2 Break down of computational time for our algorithm (3.0 GHz Intel®Core™2 and NVidia®GeForce GTX 480) on 640 × 480 px images
1: for all levels do 2: for all outer iterations do 3: Compute Structure (Algorithm 9.2) 4: Compute Diffusivity (Algorithm 9.3) 5: utmp = u 6: vtmp = v 7: ptmp = p 8: for all inner iterations do 9: Build Equation System (Algorithm 9.4) 10: for all SOR iterations do 11: SOR Step (Algorithm 9.5) 12: end for 13: end for 14: Warp L(x, y, t) and R(x + d, y, t) using u, v and p. 15: end for 16: Warp u, v and p to upper level (double size and interpolate). 17: end for
Algorithm 9.1: Scene flow pseudo-code
Figure 9.2 shows the break down of the computational time for our scene flow algorithm. Note that the CPU processing includes memory management and the computation of the image pyramids, which offers some potential for optimization. The overview in pseudo-code for implementing the scene flow algorithm is shown in Algorithm 9.1. The scene flow variables in the algorithm are referred to as (u, v, p) instead of (u, v, d ) in order to avoid cluttering. It calls subroutines described in Algorithms 9.2 to 9.5. Algorithm 9.2 computes the spatial and temporal derivatives of the warped input images for the energy equations. The spatial derivatives are the average of the two contributing images. ∇ is computed using central differences and any reference out-
9.2 Implementation of Scene Flow
115
1: for all pixels do 2: [ Lx Ly ] = 12 (∇L(x + u, y + v, t + 1) + ∇L(x, y, t)) 3: Lt = L(x + u, y + v, t + 1) − L(x, y, t) 4: if disparity d known then 5: [ Rx Ry ] = 12 (∇R(x + u + d + p, y + v, t + 1) + ∇R(x + d, y, t)) 6: Rt = R(x + u + d + p, y + v, t + 1) − R(x + d, y, t) ∂ 7: Dx = 12 ( ∂x R(x + u + d + p, y + v, t + 1) 8: 9: 10: 11: 12: 13:
∂ L(x + u, y + v, t + 1)) + ∂x Dt = R(x + u + d + p, y + v, t + 1) − L(x + u, y + v, t + 1) else [ Rx Ry ] = 0 Dx = 0 end if end for Algorithm 9.2: Compute structure
1: for all pixels do 2: for all α ∈ {u, v, p} do 1 3: Rα,north = (α(x, y) − α(x, y − 1))2 + 16 (α(x + 1, y) − α(x − 1, y) 4: 5: 6: 7: 8: 9:
+ α(x + 1, y − 1) − α(x − 1, y − 1))2 1 Rα,east = (α(x, y) − α(x − 1, y))2 + 16 (α(x, y + 1) − α(x, y − 1) + α(x − 1, y + 1) − α(x − 1, y − 1))2 1 Rα,south = (α(x, y + 1) − α(x, y))2 + 16 (α(x + 1, y + 1) − α(x − 1, y+1) + α(x + 1, y) − α(x − 1, y))2 1 Rα,west = (α(x + 1, y) − α(x, y))2 + 16 (α(x + 1, y + 1) − α(x + 1, y − 1) + α(x, y + 1) − α(x, y − 1))2 end for for all dir ∈ {north, east, south, west} do λ Rdir = 2 Ru,dir +Rv,dir + λ 2 Rp,dir γ
10: end for 11: end for
Algorithm 9.3: Compute diffusivity
side Ω is clamped to the boundary (reflecting boundary conditions for the central difference operator). The second algorithm within the outer loops, Algorithm 9.3, computes the diffusivities of the three-dimensional scene flow field. It consists of a combination of forward differences and central differences as proposed by Brox in [11]. All values outside Ω are clamped to the boundary (reflecting boundary conditions for the central difference operator).
116
9
Appendix: Scene Flow Implementation Using Euler–Lagrange Equations
1: for all pixels do 2: ELF = Lt + (utmp − u)Lx + (vtmp − v)Ly = 1 3: ΨLF 2 +ε 2 ELF
4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
ERF = Rt + (utmp + ptmp − u − p)Rx + (vtmp − v)Ry = 1 ΨRF 2 +ε2 ERF
EDF = Dt + (ptmp − p)Dx = 1 ΨDF
2 +ε 2 EDF R R Auu = ΨLF Lx Lx + ΨRF x x Auv = ΨLF Lx Ly + ΨRF Rx Ry L L +Ψ R R Avv = ΨLF y y RF y y R R Aup = ΨRF x x R R Avp = ΨRF x y R R +Ψ D D App = ΨRF x x DF x x L (L + u bu = ΨLF x t tmp Lx + vtmp Ly ) R (R + (u + ΨRF x t tmp + ptmp )Rx + vtmp Ry ) bv = ΨLF Ly (Lt + utmp Lx + vtmp Ly ) R (R + (u + ΨRF y t tmp + ptmp )Rx + vtmp Ry ) bp = ΨRF Rx (Rt + (utmp + ptmp )Rx + vtmp Ry ) D (D + p + ΨDF x t tmp Dx )
17: end for
Algorithm 9.4: Build equation system
Note that this is computed once per warp before the current scene flow variables (u, v, p) are copied into temporal variables (utmp , vtmp , ptmp ) to solve for the updates (δu, δv, δp). However, instead of solving for the update and updating the original flow variables, the temporary variables are used inside Algorithm 9.4 to directly solve for the resulting flow field. Note the instances of utmp − u where the delta-updates are actually needed. We found that this trick speeds up the implementation of the flow field considerably. Last, Algorithm 9.5 executes the inner iterations of the successive overrelaxation. On the GPU, this is implemented using the over-relaxed red-black Gauss–Seidel approach; see [88].
9.2 Implementation of Scene Flow 1: for all pixels do 2: Rsum = Rnorth + Rsouth + Rwest + Reast + ε2 3: uR = Rnorth u(x, y − 1) + Rsouth u(x, y + 1) + Rwest u(x − 1, y) 4: 5: 6: 7: 8: 9:
+ Reast u(x + 1, y) ω u(x, y) = (1 − ω)u(x, y) + Auu +R (uR − bu − Auv v(x, y) sum − Aup p(x, y)) vR = Rnorth v(x, y − 1) + Rsouth v(x, y + 1) + Rwest v(x − 1, y) + Reast v(x + 1, y) ω v(x, y) = (1 − ω)v(x, y) + Avv +R (vR − bv − Auv u(x, y) sum − Avp p(x, y)) pR = Rnorth p(x, y − 1) + Rsouth p(x, y + 1) + Rwest p(x − 1, y) + Reast p(x + 1, y) ω p(x, y) = (1 − ω)p(x, y) + App +R (pR − bp − Aup u(x, y) sum − Avp v(x, y)) end for Algorithm 9.5: SOR step
117
Glossary
Aperture problem The aperture problem arises as a consequence of motion ambiguity when an object is viewed through a small aperture. For optical flow estimation the two-dimensional optical flow vector (two unknowns) cannot be uniquely determined from the gray value consistency assumption of a single pixel (only one observation). In this case, if the subspace of observations for a set of pixels is onedimensional, this is known as the aperture problem. The aperture problem arises for instance if an untextured object moves within the image. Then the motion inside an object cannot be recovered without additional information. Eigenvalue If A is a square matrix and a non-zero vector v exists with A v = λv then λ is called an eigenvalue of A . If A is symmetric, it can be decomposed into A = U Σ U where U is a orthogonal rotation matrix and Σ is a diagonal matrix with the squared eigenvalues on the diagonal. Epipolar lines A epipolar line in the right image is the projection of a viewing ray of the left camera image. Therefore, for each pixel in the left camera image, there exists one distinct epipolar line in right camera image (and vice versa). Fundamental matrix The matrix describing the relative orientation (outer orientation) and camera parameters (inner orientation) of the two cameras of a stereo camera system. Gaussian distribution The normal distribution is also called the Gaussian distri2 ), where μ is bution, known as the Gaussian function f (x) = √ 1 2 exp(− (x−μ) 2πσ 2 2πσ
the mean and σ 2 is the variance. Graph cut Computes the minimal solution to an energy functional by maximizing the flow over the corresponding network (determined by the bottleneck of the graph). The network consists of a directed graph, where edges are assigned weights, and distinct source and sink (terminal) nodes. See also s–t-separating cut. Image function The image sequence is considered as a function I (x, y, t) of the gray values at pixel position x = (x, y) and at time t. Kalman filter The Kalman filter is a sensor fusion algorithm. It uses a physically based dynamic model together with control inputs and measurements to obtain an estimate of a systems state. It is optimal if the dynamic system is linear. A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9, © Springer-Verlag London Limited 2011
119
120
Glossary
Laplace distribution The Laplace distribution is also called the double exponential 1 distribution, defined as f (x) = 2b exp(− |x−μ| b ) with the location parameter μ (the mean) and the scale b > 0. The variance is σ 2 = 2b2 . Mahalanobis distance A distance measure based on correlations between variables. The Mahalanobis distance of two random vectors x and y with the same distribution with covariance matrix S is defined as d(x, y) = (x − y) S −1 (x − y). It becomes the Euclidean distance if the covariance matrix is the identity matrix. Optical flow The two-dimensional apparent motion field of two consecutive images in an image sequence. Scene flow The three-dimensional motion field in image coordinates (image scene flow) or world coordinates (also world flow). In image coordinates it describes the change in image position, (x, y), as well as the change in disparity d. In world coordinates it describes the three coordinates of the translation vector of a world point (X, Y, Z) in between two images. s–t-separating cut A cut in the network of a directed graph consisting of nodes, edges with assigned weights, and two distinct nodes, the source node (s) and the sink node (terminal node, t). A s–t-separating is a cut such that no connection exists between the source and terminal nodes. See also graph cut. Triangulation The triangulation of a world point from two corresponding pixels of the two stereo camera images. Warping The resampling of an image according to a non-linear pixel access function (e.g. the stereo disparity or optic flow offset).
References
1. Alvarez, L., Esclarín, J., Lefébure, M., Sánchez, J.: A PDE model for computing the optical flow. In: Proc. XVI Congreso de Ecuaciones Diferenciales y Aplicaciones, Gran Canaria, Spain, pp. 1349–1356 (1999) 2. Aubert, G., Deriche, R., Kornprobst, P.: Computing optical flow via variational techniques. SIAM J. Appl. Math. 60(1), 156–182 (1999) 3. Aujol, J.F., Gilboa, G., Chan, T.F., Osher, S.J.: Structure-texture image decomposition: modeling, algorithms, and parameter selection. Int. J. Comput. Vis. 67(1) (2006). doi:10.1007/s11263-006-4331-z 4. Badino, H.: A robust approach for ego-motion estimation using a mobile stereo platform. In: Proc. International Workshop on Complex Motion, Günzburg, Germany, pp. 198–208 (2004) 5. Baker, S., Matthews, I.: Lucas-Kanade 20 years on: a unifying framework. Int. J. Comput. Vis. 56(3), 221–255 (2004) 6. Baker, S., Roth, S., Scharstein, D., Black, M.J., Lewis, J.P., Szeliski, R.: A database and evaluation methodology for optical flow. In: Online-Proc. International Conference on Computer Vision, Rio de Janeiro, Brazil, October 2007 7. Barni, M., Cappellini, V., Mecocci, A.: Fast vector median filter based on Euclidean norm approximation. IEEE Signal Process. Lett. 1(6), 92–94 (2004) 8. Barth, A., Franke, U.: Where will the oncoming vehicle be the next second? In: Proc. IEEE Intelligent Vehicles Symposium, Eindhoven, Netherlands, pp. 1068–1073 (2008) 9. Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: Proc. International Conference on Computer Vision, Nice, France, pp. 231–236 (1993) 10. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004) 11. Brox, T.: From pixels to regions: partial differential equations in image analysis. Ph.D. thesis, Faculty of Mathematics and Computer Science, Saarland University, Germany (2005) 12. Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. (2010). doi:10.1109/TPAMI.2010.143 13. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Proc. European Conference on Computer Vision, Prague, Czech Republic, pp. 25–36 (2004) 14. Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.-P.: High accuracy optical flow serves 3-D pose tracking: exploiting contour and flow based constraints. In: Proc. European Conference on Computer Vision, Graz, Austria, pp. 98–111 (2006) 15. Brox, T., Bregler, C., Malik, J.: Large displacement optical flow. In: Proc. International Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, June 2009 A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9, © Springer-Verlag London Limited 2011
121
122
References
16. Bruhn, A., Weickert, J.: Towards ultimate motion estimation: combining highest accuracy with real-time performance. In: Proc. Tenth International Conference on Computer Vision, vol. 1, pp. 749–755 (2005) 17. Bruhn, A., Weickert, J.: A confidence measure for variational optic flow methods. In: Geometric Properties for Incomplete Data, pp. 283–298 (2006) 18. Bruhn, A., Weickert, J., Feddern, C., Kohlberger, T., Schnörr, C.: Variational optic flow computation in real-time. IEEE Trans. Image Process. 14(5), 608–615 (2005) 19. Bruhn, A., Weickert, J., Kohlberger, T., Schnörr, C.: A multigrid platform for real-time motion computation with discontinuity-preserving variational methods. Int. J. Comput. Vis. 70(3), 257–277 (2006) 20. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 20(1), 89–97 (2004) 21. Chambolle, A.: Total variation minimization and a class of binary MRF models. In: Proc. International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, St. Augustine, FL, USA, pp. 136–152 (2005) 22. Chambolle, A., Caselles, V., Cremers, D., Novaga, M., Pock, T.: An introduction to total variation for image analysis. In: Theoretical Foundations and Numerical Methods for Sparse Recovery. De Gruyter, Berlin (2010) 23. Chan, T.F., Golub, G.H., Mulet, P.: A nonlinear primal-dual method for total variation-based image restoration. SIAM J. Appl. Math. 20(10), 1964–1977 (1999) 24. Cohen, I.: Nonlinear variational method for optical flow computation. In: Scandinavian Conf. on Image Analysis, pp. 523–523 (1993) 25. Coleman, T.F., Hulbert, L.A.: A direct active set algorithm for large sparse quadratic programs with simple bounds. Math. Program., Sers. A, B 45(3), 373–406 (1989) 26. Cooke, T.: Two applications of graph-cuts to image processing. In: Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, pp. 498–504 (2008) 27. Corpetti, T., Memin, E., Perez, P.: Dense estimation of fluid flows. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 365–380 (2002) 28. Costeira, J., Kanande, T.: A multi-body factorization method for motion analysis. In: Proc. International Conference on Computer Vision, pp. 1071–1076 (1995) 29. Cremers, D., Soatto, S.: Motion competition: a variational framework for piecewise parametric motion segmentation. Int. J. Comput. Vis. 62(3), 249–265 (2005) 30. Deriche, R., Kornprobst, P., Aubert, G.: Optical flow estimation while preserving its discontinuities: a variational approach. In: Proc. Asian Conference on Computer Vision, Singapore, pp. 290–295 (1995) 31. Devillard, N.: Fast median search: an ANSI C implementation (1998). http://ndevilla.free. fr/median/median/index.html 32. Felsberg, M.: On the relation between anisotropic diffusion and iterated adaptive filtering. In: Pattern Recognition (Proc. DAGM), Munich, Germany, pp. 436–445 (2008) 33. Franke, U., Joos, A.: Real-time stereo vision for urban traffic scene understanding. In: Proc. IEEE Intelligent Vehicles Symposium, Dearborn, pp. 273–278 (2000) 34. Goldluecke, B., Cremers, D.: Convex relaxation for multilabel problems with product label spaces. In: Proc. European Conference on Computer Vision (2010) 35. Gong, M.: Real-time joint disparity and disparity flow estimation on programmable graphics hardware. Comput. Vis. Image Underst. 113(1), 90–100 (2009) 36. Gong, M., Yang, Y.-H.: Disparity flow estimation using orthogonal reliability-based dynamic programming. In: Proc. International Conference on Pattern Recognition, pp. 70–73. IEEE Computer Society, Los Alamitos (2006) 37. Hadamard, J.: Sur les problèmes aux d’eriv’ees partielles et leur signification physique. Princeton University Bulletin, pp. 49–52 (1902) 38. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 39. Haussecker, H., Fleet, D.: Estimating optical flow with physical models of brightness variation. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 661–673 (2001)
References
123
40. Hirschmüller, H.: Stereo vision in structured environments by consistent semi-global matching. In: Proc. International Conference on Computer Vision and Pattern Recognition, New York, NY, USA, pp. 2386–2393 (2006) 41. Hirschmüller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2008) 42. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981) 43. Huguet, F., Devernay, F.: A variational method for scene flow estimation from stereo sequences. In: Online-Proc. International Conference on Computer Vision, Rio de Janeiro, Brazil, October 2007 44. Isard, M., MacCormick, J.: Dense motion and disparity estimation via loopy belief propagation. In: Proc. Asian Conference on Computer Vision, Hyderabad, India, pp. 32–41 (2006) 45. Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82(D), 35–45 (1960) 46. Kanatani, K., Sugaya, Y.: Multi-stage optimization for multi-body motion segmentation. IEICE Trans. Inf. Syst. E87-D(7), 1935–1942 (2004) 47. Karush, W.: Minima of functions of several variables with inequalities as side constraints. Ph.D. thesis, Dept. of Mathematics, University of Chicago (1939) 48. Klappstein, J.: Optical-flow based detection of moving objects in traffic scenes. Ph.D. thesis, University of Heidelberg, Heidelberg, Germany (2008) 49. Klappstein, J., Stein, F., Franke, U.: Monocular motion detection using spatial constraints in a unified manner. In: Proc. IEEE Intelligent Vehicles Symposium, Tokyo, Japan, pp. 261–267 (2006) 50. Klappstein, J., Stein, F., Franke, U.: Applying Kalman filtering to road homography estimation. In: Workshop on Planning, Perception and Navigation for Intelligent Vehicles in Conjunction with IEEE International Conference on Robotics and Automation (ICRA), Rome, Italy, April 2007 51. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? In: Proc. European Conference on Computer Vision, Copenhagen, Denmark, pp. 65–81 (2002) 52. Kolmogorov, V., Criminisi, A., Blake, A., Cross, G., Rother, C.: Bi-layer segmentation of binocular stereo video. In: Proc. International Conference on Computer Vision and Pattern Recognition, pp. 1186–1192 (2005) 53. Lempitsky, V., Roth, S., Rother, C.: Fusionflow: discrete-continuous optimization for optical flow estimation. In: Online-Proc. International Conference on Computer Vision and Pattern Recognition, Anchorage, USA, June 2008 54. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. of the 7th International Joint Conference on Artificial Intelligence (IJCAI), Vancouver, British Columbia, Canada, pp. 674–679 (1981) 55. Luenberger, D.G.: Linear and Nonlinear Programming. Addison-Wesley, Reading (1984) 56. Luong, Q., Faugeras, O.: The fundamental matrix: theory, algorithms, and stability analysis. Int. J. Comput. Vis. 17(1), 43–75 (1996) 57. Mahalanobis, P.C.: On the generalised distance in statistics. In: Proc. of the National Institute of Science of India, India, vol. 12, pp. 49–55 (1936) 58. Masoomzadeh-Fard, A., Venetsanopoulos, A.N.: An efficient vector ranking filter for colour image restoration. In: Proc. Canadian Conference on Electrical and Computer Engineering, Vancouver, BC, Canada, pp. 1025–1028 (1993) 59. Mémin, E., Pérez, P.: Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Process. 7(5), 703–719 (1998) 60. Mémin, E., Pérez, P.: A multigrid approach for hierarchical motion estimation. In: Proc. International Conference on Computer Vision, Bombay, India, pp. 933–938 (1998) 61. Meyer, Y.: Oscillating patterns in image processing and in some nonlinear evolution equations. In: The Fifteenth Dean Jacquelines B. Lewis Memorial Lectures, March 2001 62. Mileva, Y., Bruhn, A., Weickert, J.: Illumination-robust variational optical flow with photometric invariants. In: Pattern Recognition (Proc. DAGM), pp. 152–162 (2007)
124
References
63. Min, D., Sohn, K.: Edge-preserving simultaneous joint motion-disparity estimation. In: Proc. International Conference on Pattern Recognition, Hong Kong, China, pp. 74–77 (2006) 64. Nagel, H.H., Enkelmann, W.: An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 8(5), 565–593 (1986) 65. Nagel, H.H.: Constraints for the estimation of displacement vector fields from image sequences. In: Proc. Eighth Int. Conf. Artif. Intell., Karlsruhe, Germany, pp. 945–951 (1983) 66. Nagel, H.-H., Enkelmann, W.: An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 8(5), 565–593 (1986) 67. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. Int. J. Comput. Vis. 67(2), 141–158 (2006) 68. Patras, I., Alvertos, N., Tziritas, G.: Joint disparity and motion field estimation in stereoscopic image sequences. In: Proc. International Conference on Pattern Recognition, Vienna, Austria, pp. 359–363 (1996) 69. Patras, I., Hendriks, E.A., Tziritas, G.: A joint motion/disparity estimation method for the construction of stereo interpolated images in stereoscopic image sequences. In: Proc. Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, June 1997 70. Pock, T.: Fast total variation for computer vision. Ph.D. thesis, Institute for Computer Graphics and Vision, University of Graz, Graz, Austria (2008) 71. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: Global solutions of variational models with convex regularization. SIAM J. Imag. Sci. 3(4), 1122–1145 (2010) 72. Pons, J.-P., Keriven, R., Faugeras, O.: Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score. Int. J. Comput. Vis. 72(2), 179–193 (2007) 73. Rabe, C., Volmer, C., Franke, U.: Kalman filter based detection of obstacles and lane boundary. Autonome Mobile Systeme 19(1), 51–58 (2005) 74. Rabe, C., Franke, U., Gehrig, S.: Fast detection of moving objects in complex scenarios. In: Proc. IEEE Intelligent Vehicles Symposium, Istanbul, Turkey, pp. 398–403 (2007) 75. Rabe, C., Müller, T., Wedel, A., Franke, U.: Dense, robust, and accurate motion field estimation from stereo image sequences in real-time. In: Proc. European Conference on Computer Vision, pp. 582–595. Springer, Berlin (2010) 76. Rao, S.R., Tron, R., Vidal, R., Ma, Y.: Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In: Proc. International Conference on Computer Vision and Pattern Recognition (2008) 77. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. Int. J. Comput. Vis. 73(3), 243–262 (2007) 78. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992) 79. Ruhnau, P., Schnoerr, C.: Variational estimation of experimental fluid flows with physicsbased spatio-temporal regularization. Meas. Sci. Technol. 18, 755–763 (2007) 80. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In: Proc. International Conference on Computer Vision, pp. 7–42. IEEE Computer Society, Los Alamitos (2002) 81. Schnörr, C.: Segmentation of visual motion by minimizing convex non-quadratic functionals. In: 12th Int. Conf. on Pattern Recognition, Jerusalem, Israel, pp. 661–663 (1994) 82. Shimizu, M., Okutomi, M.: Precise sub-pixel estimation on area-based matching. In: Proc. International Conference on Computer Vision, Vancouver, Canada, pp. 90–97 (2001) 83. Slesareva, N., Bruhn, A., Weickert, J.: Optic flow goes stereo: a variational method for estimating discontinuity-preservingdense disparity maps. In: Pattern Recognition (Proc. DAGM), Vienna, Austria, pp. 33–40 (2005) 84. Spies, H., Kirchge¨sner, N., Scharr, H., Jähne, B.: Dense structure estimation via regularised optical flow. In: Proc. Vision, Modeling, and Visualization, Saarbrücken, Germany, pp. 57– 64 (2000)
References
125
85. Stein, F.: Efficient computation of optical flow using the Census transform. In: Pattern Recognition (Proc. DAGM), Tübingen, Germany, pp. 79–86 (2004) 86. Steinbruecker, F., Pock, T., Cremers, D.: Large displacement optical flow computation without warping. In: IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan (2009) 87. Stewart, E.: Intel Integrated Performance Primitives: How to Optimize Software Applications Using Intel Ipp. Intel Press, Santa Clara (2004) 88. Stüben, K., Trottenberg, U.: Multigrid Methods: Fundamental Algorithms, Model Problem Analysis and Applications. Lecture Notes in Mathematics, vol. 960. Springer, Berlin (1982) 89. Sun, J., Zhang, W., Tang, X., Shum, H.: Background cut. In: Proc. European Conference on Computer Vision, Graz, Austria, pp. 628–641 (2006) 90. Sun, D., Roth, S., Lewis, J.P., Black, M.J.: Learning optical flow. In: Proc. European Conference on Computer Vision, Marseille, France, pp. 83–91 (2008) 91. Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: Proc. International Conference on Computer Vision and Pattern Recognition, pp. 2432–2439 (2010) 92. Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical Report CMUCS-91-132, Carnegie Mellon University, April 1991 93. Torr, P.H.S., Zisserman, A.: Concerning Bayesian motion segmentation, model averaging, matching and the trifocal tensor. In: Proc. European Conference on Computer Vision, Freiburg, Germany, pp. 511–528 (1998) 94. Trobin, W., Pock, T., Cremers, D., Bischof, H.: Continuous energy minimization via repeated binary fusion. In: European Conference on Computer Vision (ECCV), Marseille, France, October 2008 95. Trobin, W., Pock, T., Cremers, D., Bischof, H.: An unbiased second-order prior for highaccuracy motion estimation. In: Pattern Recognition (Proc. DAGM), Munich, Germany, pp. 396–405 (2008) 96. Unger, M., Pock, T., Bischof, H.: Continuous globally optimal image segmentation with local constraints. In: Computer Vision Winter Workshop, February 2008 97. Valgaerts, L., Bruhn, A., Weickert, J.: A variational approach for the joint recovery of the optical flow and the fundamental matrix. In: Pattern Recognition (Proc. DAGM), Munich, Germany, pp. 314–324 (2008) 98. van de Weijer, J., Gevers, T.: Robust optical flow from photometric invariants. In: Int. Conf. on Image Processing, Singapore, pp. 1835–1838 (2004) 99. Vaudrey, T., Gruber, D., Wedel, A., Klappstein, J.: Space-time multi-resolution banded graph-cut for fast segmentation. In: Pattern Recognition (Proc. DAGM), Munich, Germany, pp. 203–213 (2008) 100. Vaudrey, T., Rabe, C., Klette, R., Milburn, J.: Differences between stereo and motion behaviour on synthetic andreal-world stereo sequences. In: Online-Proc. Image and Vision Computing New Zealand, Christchurch, NZ, November 2008 101. Vaudrey, T., Wedel, A., Rabe, C., Klappstein, J., Klette, R.: Evaluation of moving object segmentation comparing 6D-Vision and monocular motion constraints. In: Online-Proc. Image and Vision Computing New Zealand, Christchurch, New Zealand, November 2008 102. Vaudrey, T., Wedel, A., Klette, R.: A methodology for evaluating illumination artifact removal for corresponding images. In: Proc. International Conference on Computer Analysis of Images and Patterns, Münster, Germany, September 2009 103. Vedula, S., Baker, S., Rander, P., Kanade, R.C.T.: Three-dimensional scene flow. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 475–480 (2005) 104. Vidal, R., Sastry, S.: Optimal segmentation of dynamic scenes from two perspective views. In: Proc. International Conference on Computer Vision and Pattern Recognition, Madison, WI, USA, vol. 2, pp. 281–285 (2003) 105. Wedel, A.: 3d motion analysis via energy minimization. Ph.D. thesis, Faculty of Mathematics and Natural Sciences, Bonn University, Germany, December 2009
126
References
106. Wedel, A., Pock, T., Braun, J., Franke, U., Cremers, D.: Duality TV-L1 flow with fundamental matrix prior. In: Online-Proc. Image and Vision Computing New Zealand, Christchurch, New Zealand, November 2008 107. Wedel, A., Pock, T., Zach, C., Cremers, D., Bischof, H.: An improved algorithm for TVL1 optical flow. In: Workshop on Statistical and Geometrical Approaches to Visual Motion Analysis, Schloss Dagstuhl, Germany, September 2008 108. Wedel, A., Rabe, C., Vaudrey, T., Brox, T., Franke, U., Cremers, D.: Efficient dense scene flow from sparse or dense stereo data. In: Proc. European Conference on Computer Vision, Marseille, France, pp. 739–751 (2008) 109. Wedel, A., Vaudrey, T., Meissner, A., Rabe, C., Brox, T., Franke, U., Cremers, D.: Decoupling motion and position of image flow with an evaluation approach to scene flow. In: Workshop on Statistical and Geometrical Approaches to Visual Motion Analysis, Schloss Dagstuhl, Germany, September 2008 110. Wedel, A., Pock, T., Cremers, D.: Structure- and motion-adaptive regularization for high accuracy optic flow. In: Proc. International Conference on Computer Vision, Kyoto, Japan, August 2009 111. Wedel, A., Rabe, C., Meissner, A., Franke, U., Cremers, D.: Detection and segmentation of independently moving objects from dense scene flow. In: Proc. International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition, Bonn, Germany, August 2009 112. Wedel, A., Brox, T., Vaudrey, T., Rabe, C., Franke, U., Cremers, D.: Stereoscopic scene flow computation for 3d motion understanding. Int. J. Comput. Vis. 1–23 (2010). doi:10.1007/s11263-010-0404-0 113. Weickert, J., Brox, T.: Diffusion and regularization of vector- and matrix-valued images. In: Inverse Problems, Image Analysis and Medical Imaging. Contemporary Mathematics vol. 313, pp. 251–268. AMS, Providence (2002) 114. Weickert, J., Schnörr, C.: A theoretical framework for convex regularizers in PDE-based computation of image motion. Int. J. Comput. Vis. 45(3), 245–264 (2001) 115. Yan, J., Pollefeys, M.: A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In: Proc. European Conference on Computer Vision, vol. 3954, pp. 94–106. Springer, Berlin (2006) 116. Yang, Z., Fox, M.D.: Speckle reduction and structure enhancement by multichannel median boosted anisotropic diffusion. EURASIP J. Appl. Signal Process. 2492–2502 (2004). doi:10.1155/S1110865704402091 117. Yin, W., Goldfarb, D., Osher, S.: Image cartoon-texture decomposition and feature selection using the total variation regularized L1 functional. In: Variational, Geometric, and Level Set Methods in Computer Vision, Beijing, China, pp. 73–80 (2005) 118. Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Proc. European Conference on Computer Vision, Prague, Czech Republic, pp. 151–158 (2004) 119. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Pattern Recognition (Proc. DAGM), Heidelberg, Germany, pp. 214–223 (2007) 120. Zach, C., Gallup, D., Frahm, J.M.: Fast gain-adaptive KLT tracking on the GPU. In: OnlineProc. International Conference on Computer Vision and Pattern Recognition, Anchorage, AK, June 2008 121. Zhang, Y., Kambhamettu, C.: On 3d scene flow and structure estimation. In: Proc. International Conference on Computer Vision and Pattern Recognition, Kauai Marriott, Hawaii, pp. 778–785 (2001) 122. Zhang, G., Jia, J., Xiong, W., Wong, T.T., Heng, P.A., Bao, H.: Moving object extraction with a hand-held camera. In: Online-Proc. International Conference on Computer Vision, Rio de Janeiro, Brazil, October 2006
Index
Symbols 6D-Vision, 92 A Aperture problem, 7 C Census flow, 8 Color images, 6 D Data term multiple terms, 22 optimization, 18 single term, 21 two terms, 22 Dense-6D, 94 E Epipolar constraint, 87 F FlowCut, 82 G Graph cut, 82 Gray value consistency, 6 H Horn Schunck method, 11 I Image derivative, 31 function, 6 pyramid, 10 warping, 31 IMO, 85
K Kalman filter, 81, 92 6D-Vision, 92 Dense-6D, 94 Variational-6D, 95 Karush–Kuhn–Tucker conditions, 21 KKT conditions, 103 L Lagged feedback, 19 Lucas Kanade method, 10 O Optical flow color scheme, 6 optical flow constraint, 9 optical flow equation, 9 R Residual images, 35 ROF denoising, 14 S Smoothness term anisotropic diffusion, 28 Felsberg denoising, 28 median filter, 25 structure-adaptive smoothing, 28 TV-L1 denoising, 26 TV-L2 denoising, 27 Stair-casing effect, 43 Strictly convex approximation, 14 Structure–texture decomposition, 35
A. Wedel, D. Cremers, Stereo Scene Flow for 3D Motion Analysis, DOI 10.1007/978-0-85729-965-9, © Springer-Verlag London Limited 2011
127
128 T Texture image blended version, 37 Toy example data term, 24 denoising, 29
Index pyramids, 32 warping, 32 V Variances pixel-wise, 88 Variational-6D, 95