Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
7082
Daniel Cremers Marcus Magnor Martin R. Oswald Lihi Zelnik-Manor (Eds.)
Video Processing and Computational Video International Seminar Dagstuhl Castle, Germany, October 10-15, 2010 Revised Papers
13
Volume Editors Daniel Cremers Technische Universität München, Germany E-mail:
[email protected] Marcus Magnor Technische Universität Braunschweig, Germany E-mail:
[email protected] Martin R. Oswald Technische Universität München, Germany E-mail:
[email protected] Lihi Zelnik-Manor The Technion, Israel Institute of Technology, Haifa, Israel E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-24869-6 e-ISBN 978-3-642-24870-2 DOI 10.1007/978-3-642-24870-2 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011938798 CR Subject Classification (1998): I.4, I.2.10, I.5.4-5, F.2.2, I.3.5 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
With the swift development of video imaging technology and the, drastic improvements in CPU speed and memory, both video processing and computational video are becoming more and more popular. Similar to the digital revolution in photography of fifteen years ago, today digital methods are revolutionizing the way television and movies are being made. With the advent of professional digital movie cameras, digital projector technology for movie theaters, and 3D movies, the movie and television production pipeline is turning all-digital, opening up numerous new opportunities for the way dynamic scenes are acquired, video footage can be edited, and visual media may be experienced. This book provides a compilation of selected articles resulting from a worshop on “Video Processing and Computational Video”, held at Dagstuhl Castle, Germany in October 2010. During this workshop, 43 researchers from all over the world discussed the state of the art, contemporary challenges, and future research in imaging, processing, analyzing, modeling, and rendering of real-world, dynamic scenes. The seminar was organized into 11 sessions of presentations, discussions, and special-topic meetings. The seminar brought together junior and senior researchers from computer vision, computer graphics, and image communication, both from academia and industry to address the challenges in computational video. For five days, workshop participants discussed the impact of as well as the opportunities arising from digital video acquisition, processing, representation, and display. Over the course of the seminar, the participants addressed contemporary challenges in digital TV and movie production; pointed at new opportunities in an all-digital production pipeline; discussed novel ways to acquire; represent and experience dynamic content; accrued a wish-list for future video equipment; proposed new ways to interact with visual content; and debated possible future mass-market applications for computational video. Viable research areas in computational video identified during the seminar included motion capture of faces, non-rigid surfaces, and entire performances; reconstruction and modeling of non-rigid objects; acquisition of scene illumination, time-of-flight cameras; motion field and segmentation estimation for video editing; as well as free-viewpoint navigation and video-based rendering. With regard to technological challenges, seminar participants agreed that the “rolling shutter” effect of CMOS-based video imagers currently poses a serious problem for existing computer vision algorithms. It is expected, however, that this problem will be overcome by future video imaging technology. Another item on the seminar participants’ wish list for future camera hardware concerned high frame-rate acquisition to enable more robust motion field estimation or timemultiplexed acquisition. Finally, it was expected that plenoptic cameras will hit
VI
Preface
the commercial market within the next few years, allowing for advanced postprocessing features such as variable depth-of-field, stereopsis, or motion parallax. The papers presented in these post-workshop proceedings were carefully selected through a blind peer-review process with three independent reviewers for each paper. We are grateful to the people at Dagstuhl Castle for supporting this seminar. We thank all participants for their talks and contributions to discussions and all authors who contributed to this book. Moreover, we thank all reviewers for their elaborate assessment and constructive criticism, which helped to further improve the quality of the presented articles. August 2011
Daniel Cremers Marcus Magnor Martin R. Oswald Lihi Zelnik-Manor
Table of Contents
Video Processing and Computational Video Towards Plenoptic Raumzeit Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . Martin Eisemann, Felix Klose, and Marcus Magnor Two Algorithms for Motion Estimation from Alternate Exposure Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anita Sellent, Martin Eisemann, and Marcus Magnor Understanding What We Cannot See: Automatic Analysis of 4D Digital In-Line Holographic Microscopy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laura Leal-Taix´e, Matthias Heydt, Axel Rosenhahn, and Bodo Rosenhahn 3D Reconstruction and Video-Based Rendering of Casually Captured Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aparna Taneja, Luca Ballan, Jens Puwein, Gabriel J. Brostow, and Marc Pollefeys
1
25
52
77
Silhouette-Based Variational Methods for Single View Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eno T¨ oppe, Martin R. Oswald, Daniel Cremers, and Carsten Rother
104
Single Image Blind Deconvolution with Higher-Order Texture Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manuel Martinello and Paolo Favaro
124
Compressive Rendering of Multidimensional Scenes . . . . . . . . . . . . . . . . . . . Pradeep Sen, Soheil Darabi, and Lei Xiao
152
Efficient Rendering of Light Field Images . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Jung and Reinhard Koch
184
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
213
Towards Plenoptic Raumzeit Reconstruction Martin Eisemann, Felix Klose, and Marcus Magnor Computer Graphics Lab, TU Braunschweig, Germany
Abstract. The goal of image-based rendering is to evoke a visceral sense of presence in a scene using only photographs or videos. A huge variety of different approaches have been developed during the last decade. Examining the underlying models we find three different main categories: view interpolation based on geometry proxies, pure image interpolation techniques and complete scene flow reconstruction. In this paper we present three approaches for free-viewpoint video, one for each of these categories and discuss their individual benefits and drawbacks. We hope that studying the different approaches will help others in making important design decisions when planning a free-viewpoint video system. Keywords: Free-Viewpoint Video, Image-Based Rendering, Dynamic Scene-Reconstruction.
1
Introduction
As humans we perceive most of our surroundings through our eyes, and visual stimuli affect all of our senses, drive emotion, arouse memories and much more. That is one of the reasons why we like to look at pictures. A major revolution occurred with the introduction of moving images, or videos. The dimension of time was suddenly added, which gave incredible freedom to film- and movie makers to tell their story to the audience. With more powerful hardware, computation power and clever algorithms we are now able to add a new dimension to videos, namely the third spatial dimension. This will give users or producers the possibility to change the camera viewpoint on the fly. But there is a difference between the spatial dimension and time. While free-viewpoint video allows to change the viewpoint not only to positions captured by the input cameras but also to any other position in-between, the dimension of time is usually only captured at discrete time steps, caused by the recording framerate of the input cameras. For a complete scene flow representation, not only space but also time needs to be reconstructed faithfully. In this paper we present three different approaches for free-viewpoint video and space-time interpolation. After reviewing related work in the next section we will continue with our Floating Textures [1] in Section 3 as an example for high quality free-viewpoint video with discrete time steps. For each discrete time step a geometry of the scene is reconstructed and textured by multiview projective texture mapping, as this process is error-prone we present a warping-based refinement to correct for the resulting artifacts. In Section 4 we will describe D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 1–24, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
M. Eisemann, F. Klose, and M. Magnor
the transition from discrete to continuous space-time interpolation, by discarding the geometry and working on images correspondences alone we can create perceptually plausible image interpolations [2,3,4]. As purely image-based approaches place restrictions on the viewing position, we introduce an algorithm towards complete scene flow reconstruction in Section 5 [5]. All three approaches have their benefits and drawbacks and the choice should always be based on the requirements one has for his application.
2
Related Work
In a slightly simplified version, the plenoptic function P (x, y, z, θ, φ, t) describes radiance as a function of 3-D position in space (x, y, z), direction (θ, φ) and time t [6]. With sufficiently many input views a direct reconstruction of this function is possible. Initially developed for static scenes light field rendering [7] is possibly the most puristic and closest variant to direct resampling. Light field rendering can be directly extended to incorporate discrete [8,9] or even continuous time steps [10]. To cover a larger degree of viewing angles at acceptable image quality, however, a large number of densely packed images are necessary [11,12,13]. By employing a prefiltering step the number of necessary samples can be reduced, but at the cost of more blurry output images [14,15,16]. Wider camera spacings require more sophisticated interpolation techniques. One possibility is the incorporation of a geometry proxy. Given the input images a 3D representation for each discrete time step is reconstructed and used for depth guided resampling of the plenoptic function [17,18,19,20,21,22]. For restricted scene setups the incorporation of template models proves beneficial [23,24,25]. Only few approaches reconstruct a temporally consistent mesh, which allows for continuous time interpolation [26,27]. To deal with insufficient reconstruction accuracy Aganj et al. [28] and Takai et al. [29] deform the input images or both the mesh and the input images, respectively, to diminish rendering artifacts. Unfortunately none of these approaches allow for real-time rendering without a time intensive preprocessing phase. Instead of reconstructing a geometry proxy, purely image-based interpolation techniques rely only on image correspondences. If complete correspondences between image pixels can be established, accurate image warping becomes possible [30]. Mark et al. [31] follow the seminal approach of Chen et al. [30] but also handle occlusion and discontinuities during rendering. While useful to speed up rendering performance, their approaches are only applicable to synthetic scenes. Beier et al. [32] propose to use a manually guided line-based warping method to interpolate between two images, known from its use in Michael Jackson’s music video ”Black & White”. A physically-valid view synthesis by image interpolation is proposed by Seitz et al. [33,34]. For very similar images, optical flow techniques have proven useful [35,36]. Highly precise approaches exist, which can be computed at real-time or (almost) interactive rates [37,38,39]. Einarsson et al. [40] created a complete acquisition system, the so-called light stage 6, for acquiring and relighting human locomotion. Due to the high amount of images acquired they could directly incorporate
Towards Plenoptic Raumzeit Reconstruction
3
optical flow techniques to create virtual camera views in a light field renderer, by direct warping of the input images. Correspondence estimation is only one part of an image-based renderer. The image interpolation itself is another critical part. Fitzgibbon et al. [41] use imagebased priors, i.e., they enforce similarity to the input images, to remove any ghosting artifacts. Drawbacks are very long computation times and the input images must be relatively similar in order to achieve good results. Mahajan et al. [42] proposed a method for plausible image interpolation that searches for the optimal path for a pixel transitioning from one image to the other in the gradient domain. As each output pixel in the interpolated view is taken from only a single source image, ghosting or blurring artifacts are avoided, but if wrong correspondences are estimated unaesthetic deformations might occur. Linz et al. [43] extend the approach of Mahajan et al. [42] to space-time interpolation with multi-image interpolation based on Graph-cuts and symmetric optical flow. In the unstructured video rendering by Ballan et al. [44] the static background of a scene is directly reconstructed, while the actor in the foreground is projected onto a billboard and the view switches at a specific point between the cameras where the transition is least visible.
3
Floating Textures
Image-based rendering (IBR) systems using a geometry proxy have the benefit of free camera movement for each reconstructed time step. The drawback is that any reconstruction method with an insufficient number of input images is imprecise. While this may be no big problem when looking at the mesh alone, it will become rather obvious when the mesh is to be textured again. Therefore the challenge is generating a perceptually plausible rendering with only a sparse setup of cameras and a possibly imperfect geometry proxy. Commonly in IBR the full bidirectional reflectance distribution function, i.e. how a point on a surface appears depending on the viewpoint and lighting, is approximated by projective texture mapping [45] and image blending. Typically the blending factors are based on the angular deviation of the view vector to capture view-dependent effects. If too few input images are available or the geometry is too imprecise, ghosting artifacts will appear as the projected textures will not match on the surface. In this section we assume that a set of input images, the corresponding, possibly imprecise, calibration data and a geometry proxy are given. The task is to find a way to texture this proxy again without noticeable artifacts and shadowing the imprecision of the underlying geometry. Without occlusion, any novel viewpoint can, in theory, be rendered directly from the input images by warping, i.e. by simply deforming the images, so that the following property holds: Ij = WIi →Ij ◦ Ii
,
(1)
where WIi →Ij ◦ Ii warps an image Ii towards Ij according to the warp field WIi →Ij . The problem of determining the warp field WIi →Ij between two images
4
M. Eisemann, F. Klose, and M. Magnor
!t Fig. 1. Rendering with Floating Textures [1]. The input photos are projected from camera positions Ci onto the approximate geometry GA and onto the desired image plane of viewpoint V . The resulting intermediate images Ivi exhibit mismatch which is compensated by warping all Ivi based on the optical flow to obtain the final image IvFloat .
Ii , Ij is a heavily researched area in computer graphics and vision. If pixel distances between corresponding image features are not too large, algorithms to robustly estimate per-pixel optical flow are available [37,46]. The issue here is that in most cases these distances will be too large. In order to relax the correspondence finding problem, the problem can literally be projected into another space, namely the output image domain. By first projecting the photographs from cameras Ci onto the approximate geometry surface GA and rendering the scene from the desired viewpoint Cv , creating the intermediate images Ivi , the corresponding image features are brought much closer together than they have been in the original input images, Figure 1. This opens up the possibility of using optical flow estimation to the intermediate images Ivi to robustly determine the pairwise flow fields WIvi →Ivj . To compensate for more than two input images, a linear combination of the flow fields according to (3) can be applied to all intermediate images Ivi , which can then be blended together to obtain the final rendering result IvFloat . To reduce computational cost, instead of establishing for n input photos (n − 1)n flow fields, it often suffices to consider only the 3 closest input images to the current viewpoint. If more than 3 input images are needed, the quadratic effort can be reduced to linear complexity by using intermediate results. The processing steps are summarized in the following functions and visualized in Figure 1: IvFloat =
n
(WIvi ◦ Ivi )ωi
i=1 n
WIvi =
j=1
ωj WIvi →Ivj
(2) (3)
Towards Plenoptic Raumzeit Reconstruction
5
v WIvi is the combined flow field which is used for warping image n Ii . The weight v map ωi contains the weights of Ii for each output pixel with i=1 ωi = 1 based on the camera position [47].
3.1
Soft Visibility
Up to now only occlusion-free situations can be precisely handled, which is seldom the case in real-world scenarios. Simple projection of imprecisely calibrated photos onto an approximate 3-D geometry model typically causes unsatisfactory results in the vicinity of occlusion boundaries, Figure 2(a). Texture information from occluding parts of the mesh projects incorrectly onto other geometry parts. With respect to Floating Textures, this not only affects rendering quality but also the reliability of flow field estimation.
(a)
(b)
(c)
(d)
Fig. 2. (a) Projection errors occur if occlusion is ignored. (b) Optical flow estimation goes astray if occluded image regions are not properly filled. (c) Final result after texture projection using a weight map with binary visibility. (d) Final result after texture projection using a weight map with soft visibility. Note that most visible seams and false projections have been effectively removed.
A common approach to handle the occlusion problem is to establish a binary visibility map for each camera, multiply it with the weight map ωi , and normalize the weights afterwards so they sum up to one. This efficiently discards occluded pixels in the input cameras for texture generation, in theory. One drawback of such an approach is that it must be assumed that the underlying geometry is precise, and cameras are precisely calibrated. In the presence of coarse geometry, the usage of such binary visibility maps can create occlusion boundary artifacts at pixels where the value of the visibility map suddenly changes, Figure 2(c). To counter these effects, a “soft” visibility map Ω for the current viewpoint and every input camera can be generated using a distance filter on the binary map: ⎧ if the scene point is occluded ⎨0 Ω(x, y) = occDist(x,y) (4) if occDist(x, y) ≤ r r ⎩ 1 else Here r is a user-defined radius, and occDist(x, y) is the distance to the next occluded pixel. If Ω is multiplied with the weight map ω, (4) makes sure that occluded regions stay occluded, while hard edges in the final weight map are
6
M. Eisemann, F. Klose, and M. Magnor
removed. Using this “soft” visibility map the above mentioned occlusion artifacts effectively disappear, Figure 2(d). To improve optical flow estimation, occluded areas in the projected input images Ivi need to be filled with the corresponding color values from that camera whose weight ωi for this pixel is highest, as the probability that this camera provides the correct color is the highest. Otherwise, the erroneously projected part could seriously influence the result of the Floating Texture output as wrong correspondences could be established, Figure 2(b). Applying the described filling procedure noticeably improves the quality of the flow calculation, Figure 2(d). 3.2
GPU Implementation
The non-linear optimization, namely the optical flow calculation, before the blending step is computationally very intensive and cannot be sufficiently calculated in advance. Therefore, for immediate feedback it is important to compute the whole rendering part on-the-fly. The geometry representation can be of almost arbitrary type, e.g., a triangle mesh, a voxel representation, or a depth map (even though correct occlusion handling with a single depth map is not always possible due to the 2.5D scene representation). First, given a novel viewpoint, the closest camera positions are queried. For sparse camera arrangements, typically the two or three closest input images are chosen. The geometry model is rendered from the cameras’ viewpoints into different depth buffers. These depth maps are then used to establish for each camera a binary visibility map for the current viewpoint. These visibility maps are used as input to the soft visibility shader which can be efficiently implemented in a two-pass fragment shader. Next, a weight map is established by calculating the camera weights per output pixel, based on the Unstructured Lumigraph weighting scheme [47]. The final camera weights for each pixel in the output image are obtained by multiplying the weight map with the visibility map and normalizing the result. To create the input images for the flow field calculation, the geometry proxy is rendered from the desired viewpoint into multiple render targets, projecting each input photo onto the geometry. If the weight for a specific camera is 0 for a pixel, the color from the input camera with the highest weight at this position is used instead. To compute the optical flow between two images efficient GPU implementations are needed [1,46]. Even though this processing step is computationally expensive and takes approximately 90% of the rendering time, interactive to real-time speedups are possible with modern GPUs. Once all needed computations have been carried out, the results can be combined in a final render pass, which warps and blends the projected images according to the weight map and flow fields. The benefits of the Floating Textures approach are best visible in the images in Figure 3, where a comparison of different image-based rendering approaches is given.
Towards Plenoptic Raumzeit Reconstruction
4
7
View and Time Interpolation in Image Space
Up to now we considered the case where at least an approximate geometry could be reconstructed. In some cases however it is beneficial not to reconstruct any geometry at all, but instead work solely in image space. 3-D reconstruction poses several constraints on the acquisition setup. First of all, many methods only reconstruct foreground objects, which can be easily segmented from the rest of the image. Second, the scene to be reconstructed must be either static or the recording cameras must be synchronized, so that frames are captured at exactly the same time instance, otherwise reconstruction will fail for fast moving parts. Even though it is possible to trigger synchronized capturing for modern state-ofthe-art cameras, it still poses a problem in outdoor environments or for moving cameras, due to the amount of cables and connectors. Third, if automatic reconstruction fails, laborious modelling by hand might be necessary. Additionally sometimes even this approach seems infeasible due to fine, complicated structures in the image like e.g. hair. Working in image-space directly can solve or at least ease most of the aforementioned problems as the problem is altered from a 3-D reconstruction problem to a 2-D correspondence problem. If perfect correspondences are found between every pixel of two or more images, morphing techniques can create the impression of a real moving camera to the human observer, plus time and space can be treated equally in a common framework. While this enforces some constraints, as e.g. limiting the possible camera movement to the camera hull, it also opens up new possibilities: easier acquisition and rendering of more complex scenes. Because a perceptually plausible motion is interpreted as a physically correct motion by a human observer, we can rely on the capabilities of the human visual system to understand the visual input correctly. It is thus sufficient to focus on the aspects that are important to human motion perception to solve the interpolation problem. 4.1
Image Morphing and Spatial Transformations
Image morphing aims at creating smooth transitions between pairs or arbitrary numbers of images. For simplicity of explanation we will first examine the case of two images. The basic procedure is to warp, i.e. to deform, the input images I1 and I2 towards each other depending on some warp functions WI1 →I2 , WI2 →I1 and a time step α, with α ∈ [0, 1] so that αWI1 →I2 ◦ I1 = (1 − α)WI2 →I1 ◦ I2 and vice versa in the best case. This optimal warp function can usually only be approximated, so to achieve more convincing results when warping image I1 towards I2 , one usually also computes the corresponding warp from I2 towards I1 and blends the results together. The mathematical formulation is the same as in Eq. 2, only ω is now a global parameter for the whole image. 4.2
Image Deformation Model for Time and View Interpolation
Analyzing properties of the human visual system shows that it is sensitive to three main aspects [49,50,51,52]. These are:
8
M. Eisemann, F. Klose, and M. Magnor
Fig. 3. Comparison of different texturing schemes in conjunction with a number of image-based modelling and rendering (IBMR) approaches. From left to right: Ground truth image (where available), bandlimited reconstruction [11], Filtered Blending [16], Unstructured Lumigraph Rendering [47], and Floating Textures. The different IBMR methods are (from top to bottom): Synthetic data set, Polyhedral Visual Hull Rendering [48], Free-Viewpoint Video [23], SurfCap [19], and Light Field Rendering [7].
1. Edges 2. Coherent motion for parts belonging to the same object 3. Motion discontinuities at object borders. It is therefore important to pay special attention to these aspects for high-quality interpolation.
Towards Plenoptic Raumzeit Reconstruction
9
Observing our surroundings we might notice that objects in the real world are seldom completely flat, even though many man-made objects are quite flat. However they can be approximated quite well by flat structures, like planes or triangles, as long as these are small enough. Usually this limit is given by the amount of detail the eye can actually perceive. In computer graphics it is usually set by the screen resolution. If it is assumed that the world consists of such planes, then the relation between two projections of such a 3-D plane can be directly described via a homography in image space. Such homographies for example describe the relation between a 3-D plane seen from two different cameras, the 3-D rigid motion of a plane between two points in time seen from a single camera or a combination of both. Thus, the interpolation between images depicting a dynamic 3-D plane can be achieved by a per pixel deformation according to the homography directly in image space without the need to reconstruct the underlying 3-D plane, motion and camera parameters explicitly. Only the assumption that natural images can be decomposed into regions, for which the deformation of each element is sufficiently well described by a homography has to be made, which is surprisingly often the case. Stich et al. [2,3] introduced translets which are homographies that are spatially restricted. Therefore a translet is described by a 3 × 3 matrix H and a corresponding image segment. To obtain a dense deformation, they enforce that the set of all translets is a complete partitioning of the image and thus each pixel is part of exactly one translet, an example can be seen in Figure 4.
Fig. 4. An image (left) and its decomposition into its homogeneous regions (middle). Since the transformation estimation is based on the matched edgelet, only superpixels that contain actual edgelet (right) are of interest.
The first step in estimating the parameters of the deformation model is to find a set of point correspondences between the images from which the translet transformation can be derived. This may sound contradictive as we stated earlier that this is the overall goal. However at this stage we are not yet interested in a complete correspondence field for every pixel. Rather we are looking for a subset for which the transformation can be more reliably established and which conveys already most of the important information concerning the apparent motion in the image. As it turns out classic point features such as edges and corners, which have a long history of research in computer vision, are best suited for this task. This is in accordance to the human vision, which measures edge- and corner-features early on.
10
M. Eisemann, F. Klose, and M. Magnor
Using the Compass operator [53], a set of edge pixels, called edgelet, is obtained in both images. The task is now to find for each edgelet in image I1 a corresponding edgelet in image I2 and this matching should be an as complete as possible 1-1 matching. This problem can be posed as a maximum weighted bipartite graph matching problem, or in other words, one does not simply assign the best match to each edgelet, but instead tries to minimize an energy function to find the best overall solution. Therefore descriptors for each edgelet need to be established. The shape context descriptor [54] has been shown to perform very well at capturing the spatial context Cshape of edgelets and is robust against the expected deformations. To reduce computational effort and increase robustness for the matching process only the k nearest neighbor edgelet are considered as potential matches for each edgelet. Also one can assume that edgelets will not move from one end of the image I1 to the other in image I2 as considerable overlap is always needed to establish a reliable matching. Therefore an additional distance term Cdist can be added. One prerequisite for the reformulation is that for each edgelet in the first set a match in the second set exists, otherwise the completeness cannot be achieved. While this is true for most edgelets, some will not have a correspondence in the other set due to occlusion or small instabilities of the edge detector at faint edges. However, this is easily addressed by inserting virtual occluder edgelet for each edgelet in the first edgelet set. Each edge pixel of the first image is connected by a weighted edge to its possibly corresponding edge pixels in the second image and additionally to its virtual occluder edgelet. The weight or cost function for edgelet ei in I1 and ej in I2 is then defined as C(ei , ej ) = Cdist (ei , ej ) + Cshape (ei , ej )
(5)
where the cost for the shape is the χ2 -test between the two shape contexts and the cost for the distance is defined as Cdist (ei , ej ) =
a −b ||ei −ej ||
(1 + e
)
(6)
with a, b > 0 such that the maximal cost for the euclidean distance is limited by a. The cost Coccluded to assign an edgelet to its occluder edgelet is user defined and controls how aggressively the algorithm tries to find a match with an edgelet of the second image. The lower Coccluded the more conservative the resulting matching will be, as more edges will be matched to their virtual occluder edgelet. Now that the first reliable matches have been found this information can be used to find good homographies for the translets of both images. But first the spatial support for these translets need to be established, i.e. the image needs to be segmented into coherent, disjoint regions. From Gestalt theory [55] it is known that for natural scenes, these regions share not only a common motion but in general also share other properties such as similar color and texture. Superpixel segmentation [56] can be exploited to find an initial partitioning of the image into regions to become translets, based on neighboring pixel similarities. Then from the matching between the edge pixels of the input images, local homographies for each set of edge pixels in the source image that are within one
Towards Plenoptic Raumzeit Reconstruction
11
superpixel are robustly estimated using 4-point correspondences and RANSAC [57]. Usually still between 20% to 40% of the computed matches are outliers and thus some translets will have wrongly estimated transformations. Using a greedy iterative approach, the most similar transformed neighboring translets are merged into one, as depicted in Figure 5, until the ratio of outliers to inliers is lower than a user defined threshold. When two translets are merged, the resulting translet then contains both edgelet sets and has the combined spatial support. The homographies are re-estimated based on the new edgelet set and the influence of the outliers is again reduced by the RANSAC filtering.
Fig. 5. During optimization, similar transformed neighboring translets are merged into a single translet. After merging, the resulting translet consists of the combined spatial support of both initial translets (light and dark blue) and their edgelet (light and dark red).
Basically in this last step a transformation for each pixel in the input images towards the other image was established. Assuming linear motion only, the deformation vector d(x) for a pixel x is thus computed as d(x) = Ht x − x.
(7)
Ht is the homography matrix of the translet t with x being part of the spatial support of t. However, when only a part of a translet boundary is at a true motion discontinuity, noticeably incorrect discontinuities still produce artifacts along the rest of the boundary. Imagine for example the motion of an arm in front of the body. It is discontinuous along the silhouette of the arm, while the motion at the shoulder changes continuously. We can then resolve the per pixel smoothing by an anisotropic diffusion [58] on this vector field using the diffusion equation δI/dt = div( g(min(|∇d|, |∇I|) ∇I) (8) which is dependent on the image gradient ∇I and the gradient of the deformation vector field ∇d. The function g is a simple mapping function as defined in [58]. Thus, the deformation vector field is smoothed in regions that have similar color or similar deformation, while discontinuities that are both present in the color image and the vector field are preserved. During the anisotropic diffusion, edgelet that have an inlier match, meaning they are only slightly deviating from the planar model, are considered as boundary conditions of the diffusion process. This results in exact edge transformations handling also non-linear deformations for each translet and significantly improves the achieved quality.
12
4.3
M. Eisemann, F. Klose, and M. Magnor
Optimizing the Image Deformation Model
There are three ways to further optimize the image deformation model from the previous section: using motion priors, using coarse-to-fine translet estimation, and using a scale-space hierarchy. Since the matching energy function (Eq. 5) is based on spatial proximity and local geometric similarity, a motion prior can be introduced by pre-warping the edgelet with a given deformation field. The estimated dense correspondences described above can be used as such a prior. So the algorithm described in Section 4.2 can be iterated using the result from the i-th iteration as the input to the (i + 1)-th iteration. To overcome local matching minima a coarse to fine iterative approach on the translets can be applied. In the first iteration, the number of translets is reduced until the coarsest possible deformation model with only one translet is obtained. Thus the underlying motion is approximated by a single perspective transformation. During consecutive iterations, the threshold is decreased to allow for more accurate deformations as the number of final translets increases. Additionally, solving on different image resolutions similar to scale-space [59] further improves robustness. Thus a first matching solution is found on the coarse resolution images and is then propagated to higher resolutions. Using previous solutions as motion prior significantly reduces the risk to getting stuck in local matching minima. In rare cases, some scenes still cannot be matched automatically sufficiently well. For example, when similar structures appear multiple times in the images the matching can get ambiguous and can only be addressed by high level reasoning. To resolve this, a fallback on manual intervention is necessary. Regions can be selected in both images by the user and the automatic matching is computed again only for the so selected subset of edgelet. Due to this restriction of the matching, the correct match is found and used to correct the solution. 4.4
Rendering
Given the pixel-wise displacements from the previous sections the rendering can then be efficiently implemented on graphics hardware to allow for realtime image interpolation. Two problems arise with a simple warping at motion discontinuities: Fold-overs and missing regions. Fold-overs occur when two or more pixels in the image end up in the same position during warping. This is the case when the foreground occludes parts of the background. Consistent with motion parallax it is assumed that the faster moving pixel in x-direction is closer to the viewpoint to resolve this conflict. When on the other hand regions get disoccluded during warping the information of these regions is missing in the image and must be filled in from the other image. During rendering we place a regular triangle mesh over the image plane and use a connectedness criterion to decide whether the triangle should be drawn or not. If the motion of a vertex differs more than a given threshold from the neighboring vertices, the triangles spanned between these vertices are simply discarded. An example is given in Figure 6.
Towards Plenoptic Raumzeit Reconstruction
13
Fig. 6. Left: Per-vertex mesh deformation is used to compute the forward warping of the image, where each pixel corresponds to a vertex in the mesh. The depicted mesh is at a coarser resolution for visualization purposes. Right: The connectedness of each pixel that is used during blending to avoid a possibly incorrect influence of missing regions.
Opposed to recordings with cameras, rendered pixels at motion boundaries are no longer a mixture of background and foreground color but are either foreground or background color. In a second rendering pass, the color mixing of foreground and background at boundaries can be modelled using a small selective lowpass filter applied only to the detected motion boundary pixels. This effectively removes the artifacts with a minimal impact on rendering speed and without affecting rendering quality in the non-discontinuous regions. As can be seen in Table 1 the proposed algorithm produces high-quality results, e.g. using the Middlebury examples [60].
Table 1. Interpolation, Normalized Interpolation and Angular errors computed on the Middlebury Optical Flow examples by comparison to ground truth with results obtained by our method and by other methods taken from Baker et al. [60] Venus Stich et al. Pyramid LK Bruhn et al. Black et al. Mediaplayer Zitnick et al.
Int.. Norm. Int. 2.88 0.55 3.67 0.64 3.73 0.63 3.93 0.64 4.54 0.74 5.33 0.76
Ang. 16.24 14.61 8.73 7.64 15.48 11.42
Dimetrodon Stich et al. Pyramid LK Bruhn et al. Black et al. Mediaplayer Zitnick et al.
Int Norm. Int 1.78 0.62 2.49 0.62 2.59 0.63 2.56 0.62 2.68 0.63 3.06 0.67
Ang. 26.36 10.27 10.99 9.26 15.82 30.10
The results have been obtained without user interaction. As can be seen the approach is best when looking at the interpolation errors and best or up to par in the sense of the normalized interpolation error. It is important to point out that from a perception point of view the normalized error is less expressive than the unnormalized error since discrepancies at edges in the image (e.g. large gradients) are dampened. Interestingly, relatively large angular errors are observed with the presented method emphasizing that the requirements of optical flow estimation and image interpolation are different.
14
5
M. Eisemann, F. Klose, and M. Magnor
Plenoptic Raumzeit Reconstruction
The last two proposed methods are different approaches towards a semi-complete reconstruction of the plenoptic function. However none of them is able to allow for the full degrees of freedom in space and time. In this section we present a complete space-time reconstruction. The idea is to represent the 4D space-time by a 6D scene representation. Each point in the scene is therefore categorized by not only a 3D position but also a 3D velocity vector, assuming linear motion between two discrete time steps. As we treated space and time similarly in the last section we treat time and scene movement similarly here. 5.1
Overview
We assume that the input video streams show multiple views of the same scene. Since we aim to reconstruct a geometric model, we expect the scene to consist of opaque objects with mostly diffuse reflective properties and the camera parameters to be given, e.g. by [61]. In a preprocessing step the sub-frame time offsets between the cameras are determined automatically [62,63]. Our scene model represents the scene geometry as a set of small tangent plane patches. The goal is to reconstruct a tangent patch for each point of the entire visible surface. Each patch is described by its position, normal and velocity vector. The presented algorithm processes an image group at a time, which consists of images chosen by their respective temporal and spatial parameters. The image group timespan is the time interval ranging from the acquisition of the first image of the group to the time the last selected image was recorded. The result of our processing pipeline is a patch cloud. While it is unordered in scene space, we save a list of patches intersected by the according viewing ray for each pixel in the input cameras. A visualization of our quasi-dense scene reconstruction is shown in Fig. 7.
Fig. 7. Visualization of reconstructed scenes. The patches are textured according to their reference image. Motion is visualized by red arrows.
5.2
Image Selection and Processing Order
To reconstruct the scene for a given time t a group of images is selected from the input images. The image group G contains three consecutive images from each camera, where the second image is the image from the camera taken closest to t in time.
Towards Plenoptic Raumzeit Reconstruction
15
The acquisition time time(Ij ) = coffset + cm of an image from the camera Cj fps is determined by the camera time offset coffset , the camera framerate cfps and the frame number m. During the initialization step of the algorithm, the processing order of the images is important and it is favorable to use the center images first. For camera setups where the cameras roughly point at the same scene center the following heuristic is used to sort the image group in ascending order: |Cj − Ci | (9) s(Ij ) = Ii ∈G
Where Ci is the position of the camera that acquired the image Ii . When at least one camera is static, s(Ij ) can evaluate to identical values for different images. Images with identical values s(Ij ) are ordered by the distance of their acquisition time from t . 5.3
Initialization
To reconstruct an initial set of patches it is necessary to find pixel correspondences within the image group. Due to the assumption of asynchronous cameras, no epipolar geometry constraints can be used to reduce the search region for the pixel correspondence search. We compute a list of interest points for each image Ii ∈ G using Harris’s corner detector. The intention is to select points which can be identified across multiple images. A local maximum suppression is performed, i.e., only the strongest response within a small radius is considered. Every interest point is then described by a SURF[64] descriptor. In the following, an interest point and its descriptor is referred to as a feature. For each image Ii , every feature f extracted from that image is serially processed and potentially transformed into a space-time patch. A given feature f0 is matched against all features for all Ii ∈ G. The best match for each image is added into a candidate set C. To find a robust subset of C without outliers a RANSAC based method is used: First a set S of m features is randomly sampled from C. Then the currently processed feature f0 is added to the set S. The value of |S| = m + 1 can be varied depending on the input data. For all our experiments we chose m = 6. The sampled features in S are assumed to be evidence of a single surface. Using the constraints from feature positions and camera parameters and assuming a linear motion model, a center position c and a velocity v are calculated. The details of the geometric reconstruction are given below. The vectors c and v represent the first two parameters of a new patch P . The next RANSAC step is to determine which features from the original candidate set C consent to the reconstructed patch P . The patch center is reprojected into the images Ii ∈ G and the distance from the projected position to the feature position in Ii is evaluated. After multiple RANSAC iterations the largest set
16
M. Eisemann, F. Klose, and M. Magnor
T ⊂ C of consenting features found is selected. For further robustness additional features f ∈ C \ T are added to T if the reconstructed patch from T = T ∪ {f } is consenting on all features in T . After the enrichment of T the final set of consenting features is used to calculate the position and velocity for the patch P . As reference image the image the original feature was taken from is chosen. The surface orientation of P is coarsely approximated by the vector pointing from c to the center of the reference image camera. When the patch has been fully initialized, it is added to the initial patch set. After all image features have been processed the initial patch generation is optimized and filtered once before the expand and filter iterations start. Geometric Patch Reconstruction. Input for the geometric patch reconstruction is a list of corresponding pixel positions in multiple images combined with the temporal and spatial position of the cameras. The result is a patch center c and velocity v. Assuming a linear movement of the scene point, its position x(t) at the time t is specified by a line x(t) = c + t · v. (10) To determine c and v, a linear equation system is formulated. The line of movement, Eq. 10, must intersect the viewing rays q i that originate from the camera center Ci and are cast through the image plane at the pixel position where the patch was observed in image Ii at time ti = time(Ii ): ⎛ ⎜ ⎜ ⎝
Id
3×3
Id
3×3
.. .
3×3
Id
.. .
3×3
Id
· t0 −q0 · ti
0
T
⎞ cT T ⎜v ⎟ 0 0 ⎜ ⎟ ⎟ ⎜ a0 ⎟ .. ⎟ ⎟ ·⎜ . ⎠ ⎜ . ⎟ = ⎜ . ⎟ T ⎝ . ⎠ 0 −qi aj ⎞
⎛
⎛ ⎜ ⎜ ⎝
⎞ C0 T ⎟ ... ⎟ ⎠ Ci T
(11)
The variables a0 to aj give the scene depth with respect to the camera center C0 to Cj and are not further needed. The overdetermined linear system is solved with a SVD solver. Patch Visibility Model. There are two sets of visibilities associated with every patch P . The set of images where P might be visible V (P ) and the set of images where P is considered truly visible V t (P ) ⊆ V (P ). The two different sets exist to deal with specular highlights or not yet reconstructed occluders. During the initialization process the visibilities are determined by thresholding a normalized cross correlation. Let ν(P, I) be the normalized cross correlation calculated from the reference image of P to the image I within the patch region, then V (P ) = {I|ν(P, I) > α} and V t (P ) = {I|ν(P, I) > β} are determined. The threshold parameters used in all our experiments are α = 0.45 and β = 0.8. The correlation function ν takes the patch normal into account when determining the correlation windows. In order to have an efficient lookup structure for patches later on, we overlay a grid of cells over every image. In every grid cell all patches are listed, that when
Towards Plenoptic Raumzeit Reconstruction
17
Fig. 8. Computing cross correlation of moving patches. (a) A patch P is described by its position c, orientation, recording time tr and its reference image Ir . (b) Positions of sampling points are obtained by casting rays through the image plane (red) of Ir and intersecting with plane P . (c) According to the difference in recording times (t − tr ) and the motion v of the patch, the sampling points are translated, before they are projected back to the image plane of I . Cross correlation is computed using the obtained coordinates in image space of I .
projected to the image plane, fall into the given cell and are considered possibly or truly visible in the given image. The size of the grid cells λ determines the final resolution of our scene reconstruction as only one truly visible patch in each cell in every image is calculated. After the initialization and during the algorithm iterations, the visibility of P is estimated by a depth comparison. All images for which P is closer to the camera than the currently closest patch are added to V (P ). The images I ∈ V t (P ), where the patch is considered truly visible, are determined using a similar thresholding as before V t (P ) = {I|I ∈ V (P ) ∧ ν(P, I) > β}. The β in this comparison is lowered with increasing expansion iteration count to cover poorly textured regions. 5.4
Expansion Phase
The initial set of patches is usually very sparse. To incrementally cover the entire visible surface, the existing patches are expanded along the object surfaces. The expansion algorithm processes each patch from the current generation. In order to verify if a given patch P should be expanded, all images I ∈ V t (P ) where P is truly visible are considered. Given the patch P and a single image I, the patch is projected into the image plane and the surrounding grid cells are inspected. If a cell is found where no truly visible patch exists yet, a surface expansion of P to the cell is calculated. A viewing ray is cast through the center of the empty cell and intersected with the plane defined by the patches position at time(I) and its normal. The intersection point is the center position for the newly created patch P . The velocity and normal of the new patch are initialized with the values from
18
M. Eisemann, F. Klose, and M. Magnor
the source patch P . At this stage, P is compared to all other patches listed in its grid cell and is discarded if another similar patch is found. To determine whether two patches are similar in a given image, their position x0 , x1 and normals n0 , n1 are used to evaluate the inequality (x0 x1 ) · n0 + (x1 x0 ) · n1 < κ.
(12)
The comparison value κ is calculated from the pixel displacement of λ pixels in image I and corresponds to the depth displacement which can arise within one grid cell. We usually start with λ ≥ 2 and approach λ = 1 in successive iterations of the algorithm. If the inequality holds, the two patches are similar. Patches that are not discarded are processed further. The reference image of the new patch P is set to be the image I in which the empty grid cell was found. The visibility of P is estimated by a depth comparison as described in 5.3. Because the presence of outliers may result in a too conservative estimation of V (P ), the visibility information from the original patch is added V (P ) = V (P ) ∪ V (P ) before calculating V t (P ). After the new patch is fully initialized, it is handed into the optimization process. Finally, the new patch is accepted into the current patch generation, if |V t (P )| ≥ φ. The least number of images to accept a patch is dependent on the camera setup and image type. With increasing φ less surface can be covered with patches on the outer cameras, since each surface has to be observed multiple times. Choosing φ too small may result in unreliable reconstruction results. 5.5
Patch Optimization
The patch parameters calculated from the initial reconstruction or the expansion are the starting point for a conjugate gradient based optimization. The function ρ maximized is a visibility score of the patch. To determine the visibility score a normalized cross correlation ν(P, I) is calculated from the reference image of P to all images I ∈ V (P ) where P is expected to be visible: ⎛ ⎞ ρ(P ) =
1 ⎝ ν(P, I) + |V (P )| + a · |V t (P )| I∈V (P )
a · ν(P, I)⎠
(13)
I∈V t (P )
The weighting factor a accounts for the fact that images from V t (P ) are considered reliable information, while images from V (P ) \ V t (P ) might not actually show the scene point corresponding to P . The visibility function ρ(P ) is then maximized with a conjugate gradient method. To constrain the optimization, the position of P is not changed in three dimensions, but in a single dimension representing the depth of P in the reference image. The variation of the normal is specified by two rotation angles and at last the velocity is left as three dimensional vector. 5.6
Filtering
After the expansion step the set of surface patches possibly contains visual inconsistencies. These inconsistencies can be put in three groups. The outliers outside
Towards Plenoptic Raumzeit Reconstruction
19
the surface, outliers that lie inside the actual surface and patches that do not satisfy a regularization criterion. Three distinct filters are used to eliminate the different types of inconsistencies. The first filter deals with outliers outside the surface. To detect an outlier a support value s and a doubt value d is computed for each patch P . The support is the patch score Eq. (13) multiplied by the number of images where P is truly visible s = ρ(P ) · |V t (P )|. Summing the score of all patches P that are occluded by P gives a measure for visual inconsistency introduced by P and is the doubt d. If the doubt outweighs the support d > s the patch is considered an outlier and removed. Patches lying inside the surface will be occluded by the patch representing the real surface, therefore the visibilities of all patches are recalculated as described in 5.3. Afterwards, all patches that are not visible in at least φ images are discarded as outliers. The regularization is done with the help of the patch similarity defined in Eq. (12). In the images where a patch P is visible all surrounding c patches are evaluated. The quotient of the number c of patches similar to P in relation to the total surrounding patches c is the regularization criterion: cc < z. The quotient of the similarly aligned patches was z = 0.25 in all our experiments. Fig.9 shows some results of the presented approach for both a synthetic and a real-world scenario. The most important characteristics of the scene, like continuous depth changes on the floor plane or walls as well as depth discontinuities are well preserved as can be seen in the depth maps. The small irregularities
(a)
(b)
(c)
Fig. 9. (a) Input views, (b) quasi-dense depth reconstruction and (c) optical flow to the next frame. For the synthetic windmill scene, high-quality results are obtained. When applied to a more challenging real-world scene, the results are still robust and accurate. The conservative filtering prevents the expansion to ambiguous regions. E.g., most pixels in the asphalt region in the skateboarder scene are not recovered.
20
M. Eisemann, F. Klose, and M. Magnor
(a)
(b)
(c)
Fig. 10. Reconstruction results from the Middlebury MVS evaluation datasets. (a) Input views. (b) Closed meshes from reconstructed patch clouds. (c) Textured patches. While allowing the reconstruction of all six degrees of freedom (including 3D motion), the dynamic patch reconstruction approach still reconstructs the static geometry faithfully.
where no patch was created stem from the conservative filtering step. Plausible motion is created even for smoothly changing areas as the wings of the stonemill where the motion is decreased towards the center. To demonstrate the static reconstruction capabilities results obtained from the Middlebury ”ring” datasets [65] are shown in Fig. 10. Poisson surface reconstruction[66] was used to create the closed meshes. The static object is retrieved, although no prior knowledge about the dynamics of the scene was given, i.e., all six degrees of freedom for reconstruction were used.
6
Discussion and Conclusion
In this paper we presented three different approaches towards complete freeviewpoint video. Our first presented method, the Floating Textures [1], deal with the weaknesses in the classic technique of reconstructing a proxy geometry for discrete timesteps and retexturing it with the input images. To deal with geometric uncertainties, camera calibration errors and visibility problems, Floating Textures proposes to warp the input images in the output domain and reweight the warping and blending paramters based on their angular and occlusion distance. This approach is especially appealing for real-time scenarios, like sports events where only fast, and therefore imprecise 3D reconstruction methods can be applied. Our second presented approach is based on a perceptually plausible image interpolation method [2,3,4]. Though sacrificing full viewpoint control, as the
Towards Plenoptic Raumzeit Reconstruction
21
camera is restricted to move on the camera manifold, this approach has the benefit that space and time can be treated equally, as there is no difference between both in image correspondence estimation and high-quality results are possible. If the complete degrees of freedom for space and time interpolation are needed, our third approach is the method of choice [5]. The scene flow plus dynamic surface reconstruction, or as we would like to call it plenoptic Raumzeit, is described by small geometric patches representing the scene geometry and according velocity vectors representing the movement over time. The additional degrees of freedom however come at the cost of lower rendering quality. Improving the renderings is a fruitful direction for further research. To conclude, image-base rendering systems vary largely in the way they approach the problem of image interpolation for free-viewpoint video, and your decision on which approach to base your application is one that needs to be specified early on in the development process. Do you have synchronized cameras? Are discrete time steps enough or do you need a continuous representation of time? Do you need full view control or is a restricted movement enough? We hope that the knowledge provided in this paper helps to make the important design decisions necessary when building a free-viewpoint video system. Acknowledgements. We would like to thank Jonathan Starck for providing us with the SurfCap test data (www.ee.surrey.ac.uk/CVSSP/VMRG/surfcap.htm) and the Stanford Computer Graphics lab for the buddha light field data set. The authors gratefully acknowledge funding by the German Science Foundation from project DFG MA2555/4-2.
References 1. Eisemann, M., De Decker, B., Magnor, M., Bekaert, P., de Aguiar, E., Ahmed, N., Theobalt, C., Sellent, A.: Floating Textures. Computer Graphics Forum 27, 409–418 (2008) 2. Stich, T., Linz, C., Wallraven, C., Cunningham, D., Magnor, M.: Perceptionmotivated Interpolation of Image Sequences. In: Symposium on Applied Perception in Graphics and Visualization, pp. 97–106 (2008) 3. Stich, T., Linz, C., Albuquerque, G., Magnor, M.: View and Time Interpolation in Image Space. Computer Graphics Forum 27, 1781–1787 (2008) 4. Stich, T.: Space-Time Interpolation Techniques. PhD thesis, Computer Graphics Lab, TU Braunschweig, Germany (2009) 5. Klose, F., Lipski, C., Magnor, M.: Reconstructing Shape and Motion from Asynchronous Cameras. In: Proceedings of Vision, Modeling, and Visualization (VMV 2010), Siegen, Germany, pp. 171–177 (2010) 6. Adelson, E.H., Bergen, J.R.: The Plenoptic Function and the Elements of Early Vision. In: Landy, M., Movshon, J.A. (eds.) Computational Models of Visual Processing, pp. 3–20 (1991) 7. Levoy, M., Hanrahan, P.: Light Field Rendering. In: SIGGRAPH, pp. 31–42 (1996) 8. Fujii, T., Tanimoto, M.: Free viewpoint TV system based on ray-space representation. In: SPIE, vol. 4864, pp. 175–189 (2002)
22
M. Eisemann, F. Klose, and M. Magnor
9. Matusik, W., Pfister, H.: 3D TV: A scalable system for real-time acquisition, transmission, and autostereoscopic display of dynamic scenes. ACM Transactions on Graphics 23, 814–824 (2004) 10. Wilburn, B., Joshi, N., Vaish, V., Talvala, E.V., Antunez, E., Barth, A., Adams, A., Horowitz, M., Levoy, M.: High performance imaging using large camera arrays. ACM Transactions on Graphics 24, 765–776 (2005) 11. Chai, J.X., Chan, S.C., Shum, H.Y., Tong, X.: Plenoptic Sampling. In: SIGGRAPH, pp. 307–318 (2000) 12. Lin, Z., Shum, H.Y.: On the Number of Samples Needed in Light Field Rendering with constant-depth assumption. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 579–588 (2000) 13. Lin, Z., Shum, H.Y.: A Geometric Analysis of Light Field Rendering. International Journal of Computer Vision 58, 121–138 (2004) 14. Stewart, J., Yu, J., Gortler, S.J., McMillan, L.: A New Reconstruction Filter for Undersampled Light Fields. In: Eurographics Workshop on Rendering, pp. 150–156 (2003) 15. Zwicker, M., Matusik, W., Durand, F., Pfister, H.: Antialiasing for Automultiscopic 3D Displays. In: Eurographics Symposium on Rendering, pp. 107–114 (2006) 16. Eisemann, M., Sellent, A., Magnor, M.: Filtered Blending: A new, minimal Reconstruction Filter for Ghosting-Free Projective Texturing with Multiple Images. In: Vision, Modeling, and Visualization, pp. 119–126 (2007) 17. Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: The Lumigraph. In: SIGGRAPH, pp. 43–54 (1996) 18. Debevec, P.E., Taylor, C.J., Malik, J.: Modeling and Rendering Architecture from Photographs: A Hybrid Geometry- and Image-Based Approach. In: SIGGRAPH, pp. 11–20 (1996) 19. Starck, J., Hilton, A.: Surface capture for performance based animation. IEEE Computer Graphics and Applications 27, 21–31 (2007) 20. Hornung, A., Kobbelt, L.: Interactive pixel-accurate free viewpoint rendering from images with silhouette aware sampling. Computer Graphics Forum 28, 2090–2103 (2009) 21. Goldluecke, B., Cremers, D.: A superresolution framework for high-accuracy multiview reconstruction. In: Proc. DAGM Pattern Recognition, Jena, Germany (2009) 22. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Towards Internet-scale multiview stereo. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1434–1441 (2010) 23. Carranza, J., Theobalt, C., Magnor, M., Seidel, H.P.: Free-viewpoint video of human actors. ACM Transaction on Graphics 22, 569–577 (2003) 24. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM Transactions on Graphics 27(3), 1–10 (2008) 25. Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.P.: A statistical model of human pose and body shape. Computer Graphics Forum 28 (2009) 26. Goldluecke, B., Magnor, M.: Weighted Minimal Hypersurfaces and Their Applications in Computer Vision. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 366–378. Springer, Heidelberg (2004) 27. Vedula, S., Baker, S., Kanade, T.: Image based spatio-temporal modeling and view interpolation of dynamic events. ACM Transactions on Graphics 24, 240–261 (2005) 28. Aganj, E., Monasse, P., Keriven, R.: Multi-view texturing of imprecise mesh. In: Asian Conference on Computer Vision, pp. 468–476 (2009)
Towards Plenoptic Raumzeit Reconstruction
23
29. Takai, T., Hilton, A., Matsuyama, T.: Harmonized Texture Mapping. In: International Symposium on 3D Data Processing, Visualization and Transmission, pp. 1–8 (2010) 30. Chen, S.E., Williams, L.: View Interpolation for Image Synthesis. In: SIGGRAPH, pp. 279–288 (1993) 31. Mark, W., McMillan, L., Bishop, G.: Post-Rendering 3D Warping. In: Symposium on Interactive 3D Graphics, pp. 7–16 (1997) 32. Beier, T., Neely, S.: Feature-based Image Metamorphosis. In: SIGGRAPH, pp. 35–42 (1992) 33. Seitz, S., Dyer, C.: Physically-valid view synthesis by image interplation. In: IEEE Workshop on Representation of Visual Scenes, pp. 18–26 (1995) 34. Seitz, S., Dyer, C.: View Morphing. In: SIGGRAPH, pp. 21–30 (1996) 35. Horn, B., Schunck, B.: Determining Optical Flow. Artificial Intelligence 16, 185– 203 (1981) 36. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 37. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 38. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 39. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic Huber-L1 optical flow. In: British Machine Vision Conference (2009) 40. Einarsson, P., Chabert, C.F., Jones, A., Ma, W.C., Lamond, B., Hawkins, T., Bolas, M., Sylwan, S., Debevec, P.: Relighting Human Locomotion with Flowed Reflectance Fields. In: Eurographics Symposium on Rendering, pp. 183–194 (2006) 41. Fitzgibbon, A., Wexler, Y., Zisserman, A.: Image-based rendering using imagebased priors. International Journal of Computer Vision 63, 141–151 (2005) 42. Mahajan, D., Huang, F.C., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving Gradients: A Path-Based Method for Plausible Image Interpolation. ACM Transactions on Graphics 28, 1–10 (2009) 43. Linz, C., Lipski, C., Magnor, M.: Multi-image Interpolation based on Graph-Cuts and Symmetric Optic Flow. In: Vision, Modeling and Visualization, pp. 115–122 (2010) 44. Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Transactions on Graphics, 1–11 (2010) 45. Segal, M., Korobkin, C., van Widenfelt, R., Foran, J., Haeberli, P.: Fast Shadows and Lighting Effects using Texture Mapping. Computer Graphics 26, 249–252 (1992) 46. Pock, T., Urschler, M., Zach, C., Beichel, R., Bischof, H.: A Duality Based Algorithm for TV-L1-Optical-Flow Image Registration. In: International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 511–518 (2007) 47. Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured Lumigraph Rendering. In: Conference on Computer Graphics and Interactive Techniques, SIGGRAPH, pp. 425–432 (2001) 48. Franco, J.S., Boyer, E.: Exact polyhedral visual hulls. In: British Machine Vision Conference, pp. 329–338 (2003)
24
M. Eisemann, F. Klose, and M. Magnor
¨ 49. Wallach, H.: Uber visuell wahrgenommene Bewegungsrichtung. Psychologische Forschung 20, 325–380 (1935) 50. Reichardt, W.: Autocorrelation, A principle for the evaluation of sensory information by the central nervous system. In: Rosenblith, W. (ed.) Sensory Communication, pp. 303–317. MIT Press-Willey, New York (1961) 51. Qian, N., Andersen, R.: A physiological model for motion-stereo integration and a unified explanation of Pulfrich-like phenomena. Vision Research 37, 1683–1698 (1997) 52. Heeger, D., Boynton, G., Demb, J., Seidemann, E., Newsome, W.: Motion opponency in visual cortex. Journal of Neuroscience 19, 7162–7174 (1999) 53. Ruzon, M., Tomasi, C.: Color Edge Detection with the Compass Operator. In: Conference on Computer Vision and Pattern Recognition, pp. 160–166 (1999) 54. Belongie, S., Malik, J., Puzicha, J.: Matching Shapes. In: International Conference on Computer Vision, pp. 454–461 (2001) 55. Wertheimer, M.: Laws of organization in perceptual forms. In: Ellis, W. (ed.) A Source Book of Gestalt Psychology, pp. 71–88. Trubner & Co. Ltd., Kegan Paul (1938) 56. Felzenszwalb, P., Huttenlocher, D.: Efficient Graph-Based Image Segmentation. International Journal of Computer Vision 59, 167–181 (2004) 57. Fischler, M., Bolles, R.: Random Sample Consensus. A Paradigm for Model Fitting With Applications to Image Analysis and Automated Cartography. Communications of the ACM 24, 381–395 (1981) 58. Perona, P., Malik, J.: Scale-Space and Edge Detection using Anisotropic Diffusion. Transactions on Pattern Analysis and Machine Intelligence 12, 629–639 (1990) 59. Yuille, A.L., Poggio, T.A.: Scaling theorems for zero crossings. Transactions on Pattern Analyis and Machine Intelligence 8, 15–25 (1986) 60. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A Database and Evaluation Methodology for Optical Flow. In: International Conference on Computer Vision, pp. 1–8 (2007) 61. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. ACM Transactions on Graphics 25, 835–846 (2006) 62. Meyer, B., Stich, T., Magnor, M., Pollefeys, M.: Subframe Temporal Alignment of Non-Stationary Cameras. In: British Machine Vision Conference (2008) 63. Hasler, N., Rosenhahn, B., Thorm¨ ahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 224–231 (2009) 64. Bay, H., Ess, A., Tuytelaars, T., Gool, L.V.: Surf: Speeded up robust features. Computer Vision and Image Understanding 110, 346–359 (2008) 65. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 519–528 (2006) 66. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Proceedings of the Fourth Eurographics Symposium on Geometry Processing, pp. 61–70 (2006)
Two Algorithms for Motion Estimation from Alternate Exposure Images Anita Sellent, Martin Eisemann, and Marcus Magnor Institut f¨ ur Computergraphik, TU Braunschweig, Germany
Abstract. Most algorithms for dense 2D motion estimation assume pairs of images that are acquired with an idealized, infinitively short exposure time. In this work we compare two approaches that use an additional, motion-blurred image of a scene to estimate highly accurate, dense correspondence fields. We consider video sequences that are acquired with alternating exposure times so that a short-exposure image is followed by a long-exposure image that exhibits motion-blur. For both motion estimation algorithms we employ an image formation model that relates the motion blurred image to two enframing short-exposure images. With this model we can decipher the motion information encoded in the long-exposure image, but also estimate occlusion timings which are a prerequisite for artifactfree frame interpolation. The first approach solves for the motion in a pointwise least squares formulation while the second formulates a global, total variation regularized problem. Both approaches are evaluated in detail and compared to each other and state-of-the-art motion estimation algorithms. Keywords: motion estimation, motion blur, total variation.
1
Introduction
Estimating the dense motion field between two consecutive images has been a heavily investigated field of computer vision research for decades [1, 2]. To approximate the actual 2D motion field, typically the optical flow between consecutive video frames is estimated. If regarded individually, however, short-exposure images capture no motion information at all. Instead, traditional optical flow methods reconstruct motion indirectly by motion-modeling the image difference. Sampling theoretic considerations show that this approach is prone to temporal aliasing if the maximum 2D displacement in the image plane exceeds one pixel [3]. To prevent aliasing, multi-scale optical flow methods pre-filter the image globally in the image domain because the motion is a priori unknown [3]. This, however, is not the correct temporal filter: high spatial frequencies should be suppressed only in those Fourier domain regions where aliasing actually occurs, i.e., only in the direction of local motion. There exists a simple way to achieve correct temporal pre-filtering by exposing the image sensor for an extended period of time [4]. In long-exposure images D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 25–51, 2011. c Springer-Verlag Berlin Heidelberg 2011
26
A. Sellent, M. Eisemann, and M. Magnor
I1 IB I2 0
1
t
(a)
(b)
(c)
(d)
Fig. 1. Alternate exposure imaging: (a) exposure timing diagram of (b) a shortexposure image I1 followed by (c) a long-exposure image IB and (d) another shortexposure image I2
of a moving scene high image frequencies that can cause aliasing are suppressed only in motion direction. Motion estimation from motion blurred images is often performed as one step of blind deblurring approaches [5]. In poor lighting conditions, long-exposure times are necessary to obtain a reasonable signal to noise ratio. Motion in the scene suppresses high image frequencies in motion direction which deblurring approachs then try to reconstruct solving a severly ill-posed problem. Alternate exposure imaging combines short-exposure images that capture high frequency content with long-exposure images that integrate the motion of scene points (Fig. 1). Apart from capturing motion information directly, long-exposure images bear the advantage that occlusion enters into the image formation process. A scene point and its motion contribute to a motion-blurred image exactly for as long as the point is not occluded. Only recently have optical flow algorithms begun to address occlusion [6, 7, 8, 9], assigning occlusion labels per pixels. The moment of occlusion, however, cannot be easily determined from short-exposure images. Two approaches to motion and occlusion estimation from alternate exposure images have been proposed in literature [10,11]. They are based on the same image formation model that is equally valid for occluded and non-occluded points and incorporates occlusion time estimation. Each of these approaches adds different additional assumptions to make motion estimation computationally manageable. To compare the two approaches, we give a detailed description of the assumptions and evaluate them on synthetic as well as on real test scenes.
2
Related Work
The number of articles on optical flow computation is tremendous which indicates the significance of the problem as well as its severity [12, 1, 2]. Related to our work, scale-space approaches [13] and iterative warping [14, 15] obtain reliable optical flow results in the presence of disparities larger than a few pixels. Alternatively, Lim et al. circumvent the problem by employing high-speed camera recordings [16]. None of these approaches, however, consider occlusion.
Motion Estimation from Alternate Exposure Images
27
In contrast, Alvarez et al. determine occlusion masks by calculating forward and backward optical flow and checking for consistency [8]. Areas with large forward/backward optical flow discrepancies are considered occluded and are excluded from further computations. Xiao et al. propose interpolating motion into occluded areas from nearby regions by bilateral filtering [6]. This approach is refined by Sand and Teller [7] in the context of particle video. Xu et al. consider the in-flow into a traget pixel as occlusion measure [9] (see also [17]). While explicit occlusion handling is incorporated, the moment of occlusion cannot be determined. The advantages of occlusion handling and occlusion timings for image interpolation are demonstrated by Mahajan et al. [18]. Similar to the alternate exposure approach they use a path-based image formation model. However, paths are calculated between two short-exposed images based on a discrete optimization framework yielding only full pixel accuracy. Motion estimation is also possible from a single, motion-blurred image. Assuming spatially invariant, constant velocity motion, Yitzhaky and Kopeika determine direction and extend of motion-blur via autocorrelation [19]. Their approach was extended to rotational motion by Pao and Kuo [20]. Similarly, Rekleitis obtains locally constant motion by considering the Fourier spectrum of a motion-blurred image [21]. The recent user-assisted approach of Jia [22] and the fully automatic approach of Dai and Wu [23] are both able to estimate constant velocity motion by formulating a constraint on the alpha channel of the blurred image, shifting the problem from motion estimation to the ill-posed problem of alpha-matte extraction [24]. Motion estimation from a single motion-blurred image is also part of blind image deconvolution approaches [5]. As blind deconvolution is, in general, illposed, these approaches are restricted to spatially invariant point spread functions (PSF) [5, 25, 26] or a locally invariant PSF [27, 28]. Other deconvolution approaches use additional images to gain information about the underlying motion as well as on the frequencies suppressed by the PSF: Tico and Vehvilainen use pairs of blurred and noisy images to determine a spatially invariant blur kernel after image registration [29]. Yuan et al. [30] and Lim and Silverstein [31] assume small offsets between the blurred and the noisy image and include them into the spatially invariant blur kernel estimation. Additionally, they use the noisy image to reduce ringing artifacts during deconvolution. The hybrid camera of Ben-Esra and Nayar acquires a long-exposed image of the scene, while a detector with lower spatial and higher temporal resolution acquires a sequence of short-exposed images to detect the camera motion [32]. A recent extension of the hybrid camera permits the kernel to be a local mixture of predefined basis kernels, which can be handled by modern deblurring methods [33]. The deconvolution approaches of Rav-Acha and Peleg use two motion-blurred images with spatially invariant linear motion-blurs in different directions to obtain improved deconvolution results [34, 35]. However, for a dynamic scene and a static camera, different motion-blur directions are hard to obtain. Therefore,
28
A. Sellent, M. Eisemann, and M. Magnor
Cho et al. use two cameras for motion blur estimation that are accelerated in orthogonal directions [36]. The motion-from-smear approach of Chenet al. [37, 38] as well as the approaches of Favaro and Soatto [39] and Agrawal et al. [40] therefore employ images with different degrees of motion-blur, i.e., different exposure times, making different simplifying assumptions about the motion. These assumptions range from constant motion [37] over object-wise constant motion [38, 39] to motion computable from neighboring frames with the same exposure time [40]. Pixelwise varying motion and occlusion are not considered. In our approach, we are interested in recovering high-quality, dense motion fields that may vary from pixel to pixel and that are accurate enough to be used for a broad range of applications. In addition, we are interested in adequate motion estimates also for occluded points and a well-founded estimate of occlusion timings.
3
Image Formation Model
In order to exploit the information provided by the additional long-exposure image, we need an image formation model that relates the acquired images via a dense 2D motion field. As input, we assume two short-exposure images I1 , I2 : Ω → R, Ω ⊂ R2 which are taken before and after the exposure time of a third, long-exposure input image IB : Ω → R. An image formation model that describes a motion-blurred image B : Ω → R in terms of I1 and I2 and the unknown motion was introduced in [10]. It exploits that during the exposure time of the long-exposure images, a certain set of scene-points contributes to the color of the motion blurred image at any point x ∈ Ω in the image plane. Assuming that a scene-point is visible either in I1 or I2 the model s(x) 1−s(x) B(x, p1 , p2 , s) = I1 (p1 (x, t)) dt + I2 (p2 (x, t)) dt. (1) 0
0
is established. p1 (x, ) : [0, 1] → Ω and p2 (x, ) : [0, 1] → Ω are spatially varying, planar curves on the image plane with p1 (x, 0) = x and p2 (x, 0) = x and s(x) ∈ [0, 1] is the occlusion time that is well defined only at points where occlusion actually takes place. In the case of no occlusion, any choice of s yields the same intensity B(x) and differentiating with respect to s yields 0 = I1 (p1 (x, s)) − I2 (p2 (x, 1 − s)) ,
(2)
a generalization of the brightness constancy assumption used in optical flow estimation. The image formation also gives rise to the frame interpolation I1 (p1 (x, t)) if t ≤ s(x) (3) It (x) = I2 (p2 (x, 1 − t)) if t > s(x). for intermediate frames It for any t ∈ [0, 1]. This interpolation formulation allows to interpolate occluded and disoccluded points correctly without the need for explicit occlusion detection.
Motion Estimation from Alternate Exposure Images
29
In the image formation model described so far, general motion curves were used. To simplify computations and obtain a parameterization with the minimum number of unknowns, a linear motion model is adopted so that p1 (x, t) = x − t w1 (x) and p2 (x, t) = x + t w2 (x),
(4)
w (x) for j ∈ {1, 2}. This turns out to be where wj : Ω → R2 , wj (x) = wj,1 j,2 (x) a suitable approximation also for more general types of motion, Sect. 6. Since for many applications a forward or backward motion field is needed, motion curves are warped according to the estimated motion and occlusion parameters to obtain a displacement field for I1 and I2 , respectively.
4
Least Squares Approach
The image formation model for a motion-blurred image B considered in the previous section yields a pointwise error measure for estimates of the motion paths [10]. Given two short-exposure images I1 , I2 and a long-exposure image IB , i.e., the actual measurement, we can compare the blurred image IB to the result B predicted by the model (1): e(x, w1 , w2 , s) = B(x, w1 , w2 , s) − IB (x).
(5)
In this distance measure there are 5 unknowns for every pixel x in the image domain. The minimization of e with respect to these variables can have several equally valid solutions, e.g., by letting s = 0 for an unoccluded point the backward motion path w2 can be chosen arbitrarily. In the next section we give a first approach to make the problem computationally manageable by introducing additional assumptions. Different additional assumptions which give rise to a second approach are introduced in Sect. 5. 4.1
Additional Assumptions
In order to reduce the number of unknowns in the energy formulation, we first consider a point that is neither occluded nor disoccluded during the exposure interval. It is reasonable to assume that motion within one object changes only slightly, so that we can approximate the forward and backward paths to be equal w1 ≈ w ≈ w2 . For a non-occluded point all occlusion times s are equally valid so we can additionally evaluate the integral for a fixed sequence 0 ≤ s1 < . . . < sN ≤ 1. Fixing the occlusion times not only renders the estimation of s superfluous, but also provides us with N equations, each contributing to find the correct motion path, i.e. the two remaining unknowns per pixel. If a point is occluded, forward and backward motion differ. Thus optimization under the assumption w1 ≈ w ≈ w2 is expected to lead to a comparably high residual. Only for points with high residual, we assume different forward and backward motion paths. To enable computation of the occlusion time - a crucial
30
A. Sellent, M. Eisemann, and M. Magnor
variable for occluded points - the assumption of locally constant motion paths is made, so that the motion information can be inferred from neighboring nonoccluded pixels. Applying the above assumptions, we now consider the resulting optimization problem and its solution more specifically. An overview of the algorithm is shown in Fig. 2.
Step 2
Step 1 Select sequence si ∈ [0, 1]
Mark high residual neighborhoods
Initialize w = 0
For every marked pixel
Frame Interpolation
Determine w1 = w2 by superpixel similarity
For each level of the image pyramid Minimize Eq. (7)
Optimize for occlusion time
Reject outliers
Motion Fields
Fig. 2. The workflow of the least squares approach assumes forward and backward motion paths to be symmetric in the first step. Only in the second step the possibility of occlusion is considered for points with a high residual. With the motion paths and occlusion timings, images can be interpolated directly, or traditional motion vector fields for each pixel in the short-exposure images can be determined.
4.2
Pointwise Optimization Problem
With the assumption w1 ≈ w ≈ w2 , for a fixed sequence 0 ≤ s1 < . . . < sN ≤ 1 and for each i ∈ {1, . . . , N } we consider the deviation of the measured motionblurred image from the model value for a given motion path w ∈ R2 using the differentiable squared distance, Fi (x, w) =
IB (x) −
si
1−si
I1 (x − tw) dt −
0
2 I2 (x + tw) dt .
(6)
0
If all the assumptions hold exactly, Fi = 0 for the true motion path and for all i ∈ {1, . . . , N }. Using different values for s allows to restrict the solution space for the symmetric motion path w. Increasing the number N of samples for s also increases the amount of computation. We keep N small, e.g. N = 5, and additionally include the differentiated 2 version, Eq. (2), for s = 12 as FN +1 (x, w) = I1 (x − 12 w) − I2 (x + 12 w) , with FN +1 = 0 for the true motion path. We now try to find a w ∈ R2 that minimizes the pointwise energy ELS (x, w) =
N +1 i=1
Fi (x, w) .
(7)
Motion Estimation from Alternate Exposure Images
31
Dennis and Schnabel [41] describe several numerical methods to solve this nonlinear least squares problem. We use a model-trust region implementation of the well-known Levenberg-Marquardt algorithm because of its robustness and reasonable speed. The path integral over the images is calculated using linear interpolation for the image functions I1 and I2 . The derivatives of the function F = (F1 , . . . , FN +1 )T are determined numerically. We solve this non-linear optimization problem on a multi-scale image pyramid. In order to attenuate the impact of local noise, we iterate the optimization and smooth intermediate results by replacing motion paths differing more than 0.25 pixels from the motion paths of the majority of its 8 neighbors by the average motion path of the majority. 4.3
Occlusion Detection
In occluded regions we expect the pointwise energy ELS in Eq. (7) to remain high as the symmetry of the paths w1 ≈ w ≈ w2 is invalid. We therefore mark a pixel and its immediate eight neighbors as possibly occluded, if ELS exceeds a threshold TE . Instead of setting the threshold TE absolutely and thus also in dependency of N , we choose a percentage of occluded points, e.g. 10%, and set TE to the corresponding quantile TE = Q.90 of all optimization residuals in the image. For an occluded/disoccluded pixel, there are two motion paths and the occlusion time necessary to describe the gray value in the blurred image. We extrapolate forward and backward motion paths in the occluded regions from neighboring non-occluded regions. Given estimates for the motion paths, we determine the occlusion time on the basis of these estimates. Considering a possibly occluded point we build two clusters Ca and Cb from the motion paths of probably unoccluded points in a neighborhood with a radius of r = 20 pixels. With the center of these clusters, we obtain two motion paths. We use superpixel segmentation [42] to determine which motion path is appropriate for which image. Let Six be the superpixel of Ii (x), Sia and Sib the union of superpixels in Ii containing the pixels that contribute to Ca and Cb respectively and d(·, ·) the superpixel distance also defined in [42]. The superpixel of an occluded point and the superpixels containing the background motion should belong to the same object in the first short-exposure image and thus the superpixel distance between them is expected to be small or zero. In the second image, the superpixel of the occluded point belongs to the foreground and is therefore expected to be similar or equivalent to the superpixels of the foreground motion in this image. More generally, if the inequality d(Six , Sia ) + d(Sjx , Sjb ) < d(Six , Sib ) + d(Sjx , Sja )
(8)
holds for i = 1 and j = 2 or for i = 2 and j = 1 we assign the motion of Ca to wi and that of Cb to wj . Else we deduce that the point is not occluded after all and assign the motion path with the smallest residual in Eq. (7).
32
A. Sellent, M. Eisemann, and M. Magnor
Given motion paths w1 and w2 only the occlusion time s remains to be estimated. We minimize 2 s 1−s Es (x, s) = IB (x) − I1 (x − tw1 ) dt − I2 (x + tw2 ) dt (9) 0
0
by a straightforward line search algorithm as described in [43]. 4.4
Parameter Sensitivity
The above algorithm depends on the choice of the intermediate timings si and of the occlusion threshold TE . We test the parameter sensitivity on the basic test scene square (Fig. 6, first row) where the foreground translates 10 pixels horizontally and the background translates 15 pixels vertically. We evaluate the average angular error (AAE) and the average endpoint error (AEE) between known ground-truth motion and the displacement fields obtained from the estimated motion paths [2] to measure the impact of the parameter. In the first experiment, we vary the number N of intermediate values for s while keeping all other parameters fixed, i.e., using a 6 level image pyramid, 3 iterations on each scale and an outlier threshold of 0.25 pixels. To obtain optimal cover for any length of the motion paths, we distribute the si equally in the interval [0, 1], i.e., si = Ni−1 −1 for i ∈ {1, . . . , N }. Table 1. Increasing the number N of equi-distant intermediate values for the occlusion times s also increases the computation time (3.06 GHz processor, non-optimized, pointwise MATLAB code). Fixing the threshold for occlusion detection TE = Q.90 , the smallest average angular error (AAE) and the smallest average endpoint error (AEE) are obtained for N = 5. N 2 3 4 5 6 7 8 9 10 time [sec] 7529 7612 7621 7797 7846 7912 8065 8139 8180 AAE [◦ ] 7.81 6.78 6.82 6.24 6.68 6.62 6.50 6.49 6.51 AEE [px] 2.28 1.90 1.85 1.73 1.82 1.81 1.77 1.74 1.76
If the number N of the equidistant intermediate values for s is chosen 3 or larger it has only a small influence on the resulting error, Tab. 1. Also, as the pointwise optimization implementation works with a minimum number of function evaluations, the impact of N on the total computation time is small compared to the total computation time. Apart from determining the number and the spacing of the si , the number N also influences the weight of the color constancy assumption in FN+1 . As a trade-off between the equations Fi , i ∈ {1 . . . N } based on the motion-blurred image and the equation FN+1 based on the short-exposure images, N = 5 results in the smallest angular error and the smallest endpoint error. In the next experiment, we fix N = 5 and change the number of points that are considered as occluded by setting TE to the corresponding quantile. Considering
Motion Estimation from Alternate Exposure Images
33
Table 2. Fixing the number N = 5 of intermediate values for s, the smallest average angular error (AAE) and the smallest average endpoint error (AEE) are obtained for TE = Q.90 , i.e., when considering 90% of the pixels as non-occluded TE Q.95 Q.90 Q.85 Q.80 Q.75 Q.70 Q.65 Q.60 Q.55 Q.50 AAE [◦ ] 6.96 6.24 6.90 6.88 6.84 6.83 7.32 7.40 7.75 7.76 AEE [px] 1.96 1.73 1.82 1.80 1.78 1.78 1.91 1.92 2.02 2.04
up to 30% of the pixels as occluded has only a small impact on the AAE and AEE, Tab. 2. Figs. 3c and 3d show occlusion maps for TE = Q.90 and TE = Q.75 , respectively, using the color code in Fig. 3b. While in the first case mainly truly occluded points are assigned an occlusion time, many unoccluded points obtain an occlusion label in the second case. Their motion estimate is disabled in the superpixel comparison. Nevertheless, some occluded points are still not detected in the case TE = Q.75 . Their arbitrary motion estimate is considered in the superpixel comparison. Changing the balance from correct motion estimates to arbitrary motion estimates, an occlusion threshold that is too conservative deteriorates the quality of the motion estimation. Still the interval where the motion estimation is robust is quite large.
5
Total Variation Approach
Considering the difference between a recorded motion-blurred image and the blurred image predicted by the image formation model gives a pointwise error measure for the path vectors and the occlusion time. As the solution to this problem is not unique for all image points, additional assumptions were introduced in the last section. Yet, these assumptions impose new restrictions onto the motion. This section considers different, less restrictive assumptions on the motion paths by considering the similarity of path vectors and occlusion times for neighboring pixels [11]. 5.1
Additional Assumptions
In natural images spatially neighboring pixels often belong to the same real-world object and therefore exhibit similar properties such as color, texture or motion. For the underdetermined pointwise error function, Eq. (5), we can therefore look for the solution of the pointwise problem that is most similar to the solution of neighboring pixels. We can achieve this by adding a regularization term to the pointwise error functional. Regularization is a typical way to estimate solutions of under-determined problems [44] and often applied in optical flow estimation to overcome the aperture problem [1,2]. For image points belonging to the same objects, the spatial gradient of the motion field is assumed to be small. Yet, at object boundaries, motion changes abruptly and the spatial gradient of the motion field is large. As demonstrated in previous work [45], using the total variation as a regularizer for flow fields yields promising results. While
34
A. Sellent, M. Eisemann, and M. Magnor
1
0
(a)
(b)
(c)
(d)
Fig. 3. (a) The color map used to display flow fields in this work. (b) Where defined, occlusion timings are encoded with a continuous scale between green for s = 0 and red for s = 1, else they are set to blue. Evaluating the scene squares (Fig. 6), thresholding the optimization residual for occlusion detection considers mainly truly occluded points (c) for T = Q.90 but does not detect all occluded points. (d) Setting T = Q.75 considers also many non-occluded points as occluded but still does not detect all occluded points.
the total variation for a steep monotonous function and a smoothly increasing, monotonous function with the same endpoints is the same, the customary squared norm of the gradient punishes large deviations from a constant function much severer than a gradual change (Fig 4). Total variation regularization of the motion field allows piecewise constant vector fields which is in accordance with our understanding of only slightly deforming scene objects moving with individual velocities. 5.2
Global Optimization Problem
The central part of the optimization problem is, as before, the pointwise comparison of the recorded motion-blurred image IB and the result B predicted by the image √ formation model. We consider the data-term with a robust penalizer φ (x) = x2 + where = 10−3 , i.e., we consider G1 (x, s, w1 , w2 ) = φ (B(x) − IB (x)). (10) Introduced to motion estimation by Black and Anandan [46], robust penalizers like φ are a differentiable version of the absolute value and allow for accurate motion estimation also in the presence of outliers and deviations from the assumptions. As in Sect. 4.2, we also include the differentiated version and consider it as an additional data-term 1 1 G2 (x, s, w1 , w2 ) = φ (I1 (x − w1 ) − I2 (x + w2 )) . 2 2
(11)
Integrating the weighted sum of the pointwise errors over the image domain, we obtain the data-term Edata (s, w1 , w2 ) = G1 (x, s, w1 , w2 ) + γ G2 (x, s, w1 , w2 ) dx (12) Ω
with γ ≥ 0. Regularizing both path vectors as well as the occlusion time with their total variation results in the final energy functional
Motion Estimation from Alternate Exposure Images
35
1.5
1
0.5
y
0
f(x) = arctan(x) g(x) = 0.1arctan(10)x
−0.5
−1
−1.5 −10
−8
−6
−4
−2
0 x
2
4
6
8
10
Fig. 4. The total variation of the steep function f and the continuously increasing function g are equal. The squared value of the gradient of g is much smaller than the squared value of the gradient of f . Thus total variation regularization models the assumption of object-wise smooth motion fields better than regularization with the squared value of the gradient.
ET V (s, w1 , w2 ) =
G1 + γ G2 + α Ω
2
(|∇w1,i | + |∇w2,i |) + β|∇s| dx
(13)
i=1
where α, β > 0 are two free parameters of the approach. This energy functional interconnects the pointwise error measure given by G1 and G2 via the regularization terms so that now a global minimization has to be performed. The absolute value considered in the total variation is not differentiable and we therefore adopt a minimization scheme that is presented in the next section. 5.3
TV-L1 Minimization
Our minimization scheme is based on the primal-dual algorithm used for TV-L1 optical flow [45]. We briefly review the method here and show how we adopt the framework to minimize our more complex energy functional in the next section. For the very general case of minimizing a total variation energy of the form E(u) =
λ ψ(ρ(u)) + Ω
k
|∇ui | dx
(14)
i=1
where for a constant λ > 0, a scalar function ψ : R → R+ , a k-dimensional function u = u(x) = (u1 , . . . , uk ) on the domain Ω and ρ(u) = ρ(u, x) a pointwise error term, an auxiliary vector field v = v(x) = (v1 , . . . , vk ) on Ω is introduced and the approximation Eθ (u, v) =
λ ψ(ρ(v)) + Ω
k 1 u − v2 + |∇ui | dx 2θ i=1
(15)
36
A. Sellent, M. Eisemann, and M. Magnor
is considered instead. If θ is small, v will be close to u near the minimum, and thus E will be close to Eθ . The key result of [45] is that Eq. (15) can be minimized very efficiently using an alternating scheme that iterates between solving a global minimization problem for each ui , keeping v fixed 1 argmin (ui − vi )2 + |∇ui | dx, (16) 2θ ui Ω and a minimization problem for v with fixed u 1 argmin λ ψ(ρ(v)) + u − v2 dx, 2θ v Ω
(17)
which can be solved pointwise. Details and proof of convergence can be found in [45, 47]. Eq. (16) searches for a differentiable, scalar field ui that is on the one hand close to the fixed field vi but has on the other hand small total variation. Chambolle has introduced a very elegant, quickly computable and globally convergent solution to this problem, which we will also employ in our minimization framework [47]. In Eq. (17) we use the alternate exposure image formation model and its differentiated version as data-term ρ(v). In the next section we show in more detail, how we employ the minimization scheme in our framework. 5.4 Implementation In our case, we employ some small modifications adapted to our problem of minimizing the energy in terms of w1 , w2 and s. First, we employ the above
Initialize w1 = 0, w2 = 0, s = 0.5 For each level of the image pyramid Frame Interpolation For a number of warps Compute error from current estimates For each unknown w1 , w2 , s Motion Fields
For a number of iterations Solve pointwise problem Eq.
(17)
Solve denoising problem Eq.
(16)
Fig. 5. The workflow of the total variation approach determines forward and backward motion paths and occlusion times iteratively
Motion Estimation from Alternate Exposure Images
37
scheme, i.e., iterating between Eq. (16) and Eq. (17), by considering u = w1 , u = w2 or u = s, respectively, to solve for each of the unknowns given a fixed approximation of the others. As the thresholding scheme of [45] is not directly applicable to our non-linear data-term we apply a descent scheme for Eq. (17), profiting from the use of the differentiable function φ . In order to speed up convergence, we implemented the algorithm on a scale pyramid of factor 0.5, initializing with s = 0.5 for the occlusion timing, and w1 , w2 = 0 on the coarsest level. On each level of the pyramid we compute the remaining error with the current estimates and use this error to solve for s, w1 and w2 . For each variable an instance of Eq. (16) and Eq. (17) has to be solved (Fig. 5). For Eq. (16), we employ the dual formulation detailed in [45], Proposition 1, using 5 iterations and a time step of τ = 18 . For all experimental results with the total variation algorithm we use a 5-level image pyramid, 10 error-update iterations and 10 iterations to solve Eq. (17) and Eq. (16). Suitable values for the parameter α, β, γ and θ were found experimentally. For normalized intensity values we found θ ∈ (0, 1], α, β ∈ (0, 0.1] and γ ∈ [0, 0.5] to be suitable value ranges. An evaluation of the sensitivity of the algorithm on the parameter choice was performed in [4] and showed that the algorithm yields high quality results quite independent from the actual value of the parameters. Working on the 320 × 225 pixel test scene square (Fig. 6) the computation time of 191 seconds on a 3.06 GHz processor is independent of the parameters.
6
Comparison of Different Motion Estimation Algorithms
To evaluate motion field estimation with alternate exposure imaging we consider synthetic test data as well as real-world recordings. For synthetic scenes with known ground-truth motion fields we estimate motion fields with our algorithms as well as with related approaches [16,7,45]. We interpolate intermediate frames using estimated motion paths and occlusion timings and compare them to ground-truth images and images interpolated with ground-truth motion. We also show results for real-world recordings. The recordings were made with a PointGrey Flea2 camera that is able to acquire short- and long-exposure images alternatingly. 6.1
Motion Fields for Synthetic Test Scenes
We consider synthetic test scenes containing different kinds of motion. The scene square (Fig. 6, first row) combines 10 pixels per time unit horizontal translational motion of the square with 15 pixels per time unit vertical motion of the background on a 225 × 320 pixels image. The 300 × 380 pixels scene Ben (Fig. 6, second row) contains 14 pixels per time unit translational motion in front of a static background. The scene windmill (Fig. 6, third row) contains 7◦ per time unit rotational motion approximately parallel to the image plane in front of a static background on 800 × 600 pixels images. In the 512 × 512 pixels images of the wheel scene,
38
A. Sellent, M. Eisemann, and M. Magnor
(Fig. 6, fourth row) the wheel in the background is rotating 7◦ per time unit while the foreground remains static. The challenge of the 800× 600 pixels images in the scene corner (Fig. 6, fifth row) is out-of-plane rotation of 10◦ around an axis parallel to the vertical image dimension, while the 320 × 240 pixels images of the scene fence (Fig. 6, sixth row) contain translational motion of the same extent as the moving object’s width. To obtain the motion-blurred image IB we render and average 220 − 500 images. The first and the last rendered image represent the short-exposure images I1 and I2 . Ground-truth 2D motion is determined from the known 3D scene motion. First of all, we test our pointwise least squares approach from Sect. 4 and the total variation approach from Sect. 5 on the synthetic datasets. We compare the results to state-of-the-art optical flow algorithms, [16,7,45]. For fair comparison, we provide the competing optical flow algorithms also with the short-exposure image I1.5 , depicting the scene halfway between I1 and I2 . We estimate the motion fields between I1 and I1.5 as well as between I1.5 and I2 . The two results are then concatenated before comparing them to the ground-truth displacement field. As optical flow works best for small displacements [16], the error of the concatenation is considerably smaller than estimating the motion field between I1 and I2 directly. We choose the algorithm of Zach et al. [45], because it relies on the same mathematical framework as our total variation approach. However, our method uses a long-exposure image instead of a higher frame rate of short-exposure images. We also compare to the algorithm of Sand and Teller [7] on three images, because both our methods and their approach consider occlusion effects while estimating motion. As our algorithms are based on signal-theoretical ideas to prevent temporal aliasing, we incorporate a comparison to the algorithm of Lim et al. [16] that requires high-speed recordings as input. We simulate the high-speed camera with intermediate images such that motion between two frames is smaller than 1 pixel. Tab. 3 shows that the total variation algorithm has the smallest average angular error (AAE) for all test scenes. Also, in all test scenes, except for the rotational motion parallel to the image plane of the scenes windmill and wheel, Table 3. Comparison of different motion estimation methods for six synthetic test scenes: the motion fields computed using the total variation approach to alternate exposure imaging (AEI) consistently yields a smaller average angular error (AAE) than the least squares approach and competitive optical flow algorithms given three images [45, 7] or sequences of temporally oversampled images [16] AAE [◦ ] Sand, Teller [7] Zach et al. [45] Lim et al. [16] AEI, least squares AEI, total variation
Ben 8.42 5.81 9.01 6.31 4.27
square windmill wheel corner fence 6.48 6.78 13.39 6.40 19.12 2.25 4.87 2.59 5.05 19.44 12.19 49.63 27.29 38.40 34.17 6.24 8.64 4.19 12.87 34.41 1.70 4.56 2.21 4.57 12.97
Motion Estimation from Alternate Exposure Images
(a) I1
(b) IB
(c) Ground-truth
39
(d) Lim et al. [16]
Fig. 6. (a) Short-exposure images I1 and (b) motion blurred images IB were rendered so that (c) the ground-truth motion field is known in each of the scenes (from top to bottom) square, Ben, windmill, wheel, corner and fence. For comparison motion fields are calculated with several different algorithms. (d) The algorithm of Limet al. [16] needs a high number of input images and returns noisy motion fields. (e) While the approach of Sand and Teller [7] is prone to over-smoothing, (f) the approach of Zach et al. [45] assigns unpredictable motion fields to occluded points. (g) Spurious assignments at occlusion boundaries and insufficient regularization in textureless regions deteriorates the quality of our least-squares approach. (h) The total variation approach to alternate exposure images shows the most promising motion fields of all approaches.
40
A. Sellent, M. Eisemann, and M. Magnor
(e) Sand, Teller [7]
(f) Zach et al. [45]
(g) Least squares
Fig. 6. Continued.
(h) Total variation
Motion Estimation from Alternate Exposure Images
41
Table 4. For the six synthetic test scenes, the average endpoint error (AEE) of the total variation approach to alternate exposure imaging is among the smallest in comparison to competitive optical flow estimation algorithms given three images [45,7] or sequences of temporally oversampled images [16] AEE [px] Sand, Teller [7] Zach et al. [45] Lim et al. [16] AEI, least squares AEI, total variation
Ben 0.91 0.59 1.46 0.99 0.57
square windmill wheel corner 5.72 2.95 1.27 2.85 0.62 1.69 0.60 1.27 4.88 7.69 1.82 7.73 1.73 5.47 1.02 6.30 0.52 2.16 0.61 0.92
fence 3.36 14.75 5.23 12.64 2.62
the total variation algorithm has the smallest average endpoint error (AEE), Tab. 4. The rotation within the image plane directly violates the assumption of linear motion paths in the image formation model, so here the alternate exposure algorithm is outperformed by the TV-L1 optical flow which does not model the motion paths in the intermediate time between the frames. However, in the corner scene with out-of-plane rotation and severe self-occlusion, the total variation algorithm is able to produce the most accurate motion fields in average angular error as well as in average endpoint error. The least squares approach shows a higher numerical error than the total variation approach in all test cases. Though not competitive to the highly accurate approach of Zach et al. [45] the least squares approach outperforms the anti-aliased approach of Lim et al. [16] in all but the fence scene. In the fence scene the least squares approach fails to assign correct motion to the large occluded areas, as nearly all moving points in the image are occluded or disoccluded between I1 and I2 . For the test scenes with planar motion, the least squares algorithm achieves results competitive to the occlusion aware optical flow algorithm of Sand and Teller [7], while the motion field for the out-of-plane rotation of the corner scene with the changing motion at the occluded points is less accurate. Visual comparison of the motion fields (Fig. 6) shows, that the small numerical error of the total variation approach is due to several reasons: While the algorithm of Limet al. [16] returns noisy motion fields (Fig. 6d) the algorithm of Sand and Teller [7] tends to over-smooth motion discontinuities (Fig. 6e). The TVL1 optical flow algorithm [45] assigns outlier motion vectors to occluded points (Fig. 6f). The quality of the least squares alternate exposure algorithm suffers considerably from noisy motion path detection and spurious motion assignments at non-detected occluded points (Fig. 6g). In contrast, the total variation approach to alternate exposure imaging stands out due to sharp motion boundaries and appropriate motion assignment at occlusion borders (Fig. 6h). As the explicit occlusion detection of the pointwise approach and the implicit occlusion detection of the global optimization approach are hard to compare visually (Fig. 7) we compare the results via frame interpolation under the consideration of occlusion, Eq. (3).
42
A. Sellent, M. Eisemann, and M. Magnor
(a)
(b)
(c)
(d)
Fig. 7. Shown for the scenes Ben and Ball : Occlusion timings of the least squares approach are determined only where the optimization residual exceeds a threshold (a) and (b). With the total variation approach occlusion timings are determined for every pixel, but are only well-defined at occlusion boundaries (c) and (d). Easier comparison of occlusion timings can be obtained by considering frame interpolation (see Fig. 9).
6.2
Frame Interpolation for Synthetic Test Scenes
We evaluate the estimated motion fields and occlusion timings of the alternate exposure imaging in frame interpolation. For comparison we also interpolate intermediate frames between I1 and I1.5 using the method introduced by Baker et al. [2] and using blending of forward and backward warped images. None of the two methods considers occlusion. We compare the interpolated frames to the ground-truth intermediate images. Fig. 8 gives an overview of the sum of squared differences (SSD) for all test scenes. Note that though the least squares algorithm has a higher AAE/AEE than the optical flow algorithm of Zach et al. [45] the interpolation error for some of the images, e.g. in the scene Ben, is considerably smaller than using the optical flow algorithm with any of the two interpolation methods. The interpolation with the motion paths from the total variation approach consistently shows better interpolation results than the optical flow based interpolation. Especially for translational motion, both, the least squares and the total variation algorithm occasionally obtain a smaller SSD than interpolation with ground-truth motion. This is due to the fact that inaccuracies in the motion fields can be balanced by the successful handling of occlusion boundaries (Fig. 9). 6.3
Real-World Recordings
We also test our methods on real-world recordings. We use the built-in HDR mode of a PointGrey Flea2 camera to alter exposure time and gain between successive frames. By adjusting the gain, we ensure that corresponding pixels of static regions in the short-exposure and long-exposure images are approximately of the same intensity. With the HDR mode we are able to acquire I1 , IB and I2 with a minimal time gap between the images. The remaining gap is due to the fixed 30 fps camera frame rate and the readout time of the sensor. As for the synthetic test scenes, we also record a number of real test scenes with different challenges. Thereby, all images are recorded with the same PointGrey Flea2 camera and a resolution of 640 × 480 pixels.
Motion Estimation from Alternate Exposure Images Ben
43
square
12
18 16
10 14 12 SSD
SSD
8 6 4
10 8 6 4
2 2 0 0
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
wheel
windmill
9
12
8 10
7 6 SSD
SSD
8 6
5 4 3
4
2 2
1 0 0
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
fence
corner 14
30
12
25
10
20
8
SSD
SSD
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
15
6 10 4 5
2 0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 t
Fig. 8. The sum of squared differences (SSD) between interpolated images and groundtruth images. The dashed green (circled) line shows the SSD for forward interpolation with optical flow [45], while the continuous green (circled) line shows the SSD for forward-backward interpolation using the same optical flow. Red (crossed) dashed and continuous lines indicate the SSD for forward interpolation [2] or forward-backward interpolation, respectively, using ground-truth motion fields. The SSD obtained using least squares optimization for motion paths from alternate exposure imaging is indicated by the blue dashed line (diamonds) and the SSD obtained using total variation regularization for the motion paths is indicated by the blue continuous line (squares).
44
A. Sellent, M. Eisemann, and M. Magnor
(a)
(b)
(c)
(d)
Fig. 9. (a) Interpolation at t = 0.25 with the method proposed in [2] and (b) blending of forward- and backward-warped images show artifacts at occlusion boundaries even when ground-truth motion fields are used, because occlusion information is not available. (c) Thresholded occlusion detection in the least squares approach to alternate exposure imaging fails to detect occlusion at some boundaries and exhibits remaining artifacts. (d) Interpolation with total variation regularized motion paths and occlusion timings reduces artifacts at occlusion boundaries.
Motion Estimation from Alternate Exposure Images
45
The scene juggling (Fig. 10 first row) contains large motion of a small ball, that additionally vanishes from the field of view of the camera. To ensure that the short-exposure images contain no or only little motion-blur, their exposure time is set to 6.02 ms. However, the camera can only process an image every 33.33 ms. Using only short-exposure images, this would lead to 27.31 ms of unrecorded motion between sharp images. For our method, we record a long-exposure image with an exposure time of 39.65 ms. With our camera setup we measured a remaining gap between IB and the succeeding short-exposure image of 0.48 ms which is due to readout time of the sensor and other hardware constraints. IB reduces the gap and provides us with temporally anti-aliased information. The same camera setting was used for the walking scene (Fig.10, second row) where a person walks by on a street and the leg moves in the order of magnitude of its width. The scenes model train 1 and 2 (Fig.10, third and fourth row) are also recorded with the same camera setting. Challenges in these scenes are the moving shadows and the highlight on the wagons that violate the assumption that motion is the only reason for brightness changes in the scene. To test the flexibility of the approach to different foreground and background motions, the scene tracking (Fig.10, fifth row) was recorded with a camera following the motion of the person in the foreground, i.e., objects in the background have a relative motion to the camera according to their depth. For the waving scene (Fig.10, sixth row) we use exposure times of 20.71 ms and 124.27 ms, resulting in measured gaps of 12.45 ms and 0.48 ms, respectively. This scene provides different motions, i.e., that of the hands moving in opposite direction and the static background and the occluded texture of the eye. The motion fields estimated with the least squares and the total variation approach are also shown in Fig. 10. While motion fields estimated by the least squares approach are mainly dominated by noise, closer inspection shows that in places where motion actually occurs it is often detected correctly, for example the ball flying out of the image in the juggling scene. Only the large, sparsely textured regions in the background do not provide enough information for the pointwise approach, so that any noise in the image is able to produce pronounced incorrect motion estimates. The results of the total variation approach look more promising. Although the background often provides only little texture, motion is generally estimated correctly. In the walking scene, the total variation approach is not only able to detect the motion of the leg moving approximately as far as its width, but also the motion of the hand faithfully. In the scenes model train 1 and 2 the total variation approach shows robustness to moving shadows and the highlights on the last wagon. In the tracking scene both algorithms detect the motion of the dark backpack in front of the the dark background correctly, and the total variation algorithm additionally is able to faithfully detect the motion of both hands. In the waving scene, the total variation algorithm is able to cope with the motion and the occluded texture.
46
A. Sellent, M. Eisemann, and M. Magnor
(a)
(b)
(c)
(d)
(e)
Fig. 10. The built-in HDR mode of PointGrey cameras is able to alter exposure time and gain between succeeding frames so that (a) short, (b) long and (c) short exposures can be successively acquired at comparable brightness and with minimal temporal gap between frames. Motion fields for the real-world scenes (from top to bottom) juggling, walking, model train 1, model train 2, tracking and waving are estimated with (d) the least squares approach, Sect. 4 and with (e) the total variation approach, Sect. 5.
7
Discussion
In this section we first discuss the advantages and disadvantages of the two approaches to alternate exposure motion estimation, i.e., of the least squares approach, Sect. 4, and the total variation approach, Sect. 5. Then we compare both to optical flow approaches that consider only short-exposure images. 7.1
Comparison of the Two Alternate Exposure Approaches
The least squares approach to alternate exposure imaging is able to estimate motion paths, forward/backward motion fields and occlusion timings from a set
Motion Estimation from Alternate Exposure Images
47
of three alternate exposure images. It makes some additional assumptions, e.g. symmetry, on the motion paths but requires no further regularization such as a smoothness constraint. The resulting error functional can be evaluated pointwise. Occlusion is detected by thresholding the optimization residual. For occluded pixels, motion paths are inferred based on superpixel comparison, but neighboring pixels are not considered in motion path assignment. Although the assumption of symmetric forward and backward motion paths is actually only satisfied if an object moves parallel to the image plane we also evaluate the algorithm on test scenes with more complex motion. In some of the synthetic scenes it outperforms some modern optical flow algorithms [16, 7], that are designed to handle occlusion or deal with temporal aliasing. As no regularization is necessary and the approach solves ambiguities by additional assumptions, the resulting motion fields seem visually quite noisy but are of reasonable accuracy. In the real-world test-scenes, the least squares algorithm turns out to be very susceptible to noise and inaccuracies in the gain correction of long- and shortexposure images. This is partially due to the pointwise estimation that assigns large motion to noisy pixels, especially in regions with little texture. Additionally, using a squared error term weights every outlier between the N + 1 equations very heavily, occasionally pushing it far from the desired solution to satisfy the contribution of one noisy pixel. The total variation approach requires regularization to solve the ambiguities of the image formation model for unoccluded points but makes no further assumptions. Considering spatial gradients in the regularization requires to solve for the motion paths of all pixels simultaneously so that a more sophisticated solution-framework has to be applied. Occlusion time estimation is incorporated into the optimization process, so that a separate occlusion detection step is no longer necessary. Due to the regularization, the estimated motion fields look visually more pleasing and a desirable fill-in effect of motion into textureless regions occurs, while over-smoothing is prevented by the choice of the total variation as regularizer. Numerical evaluation for synthetic scenes shows that the estimated motion fields are indeed more accurate than comparable state-ofthe art optical flow algorithms [45,7,16]. Due to implicit occlusion handling, the total variation approach can also deal with objects where every moving pixel is an occluding pixel - a situation like in the fence scene where the least squares approach fails. The images interpolated using the motion paths and occlusion timings of the total variation approach have also more exact occlusion borders than using the least squares approach, where undetected occlusion borders occasionally corrupt the interpolation. Finally, the total variation approach estimates convincing motion fields also for real-world recordings. 7.2
Limitations and Advantages from Alternate Exposure Imaging in Image Based Motion Estimation
Motion field estimation from alternate exposure imaging shares some of the limitations inherent to all optical flow methods. Like in all purely image-based
48
A. Sellent, M. Eisemann, and M. Magnor
methods, motion in poorly textured regions cannot be detected robustly. This can be seen in the black background of the waving scene (Fig. 10). Also common to all optical flow methods, we assume that motion is the only source of change in brightness, disregarding highly reflecting and transparent surfaces from the calculations. Furthermore, we made the assumption that the short-exposure images are free of motion blur. Practically this is true if motion during the short exposure time is smaller than half a pixel. Image noise is also a common problem in motion estimation. While the least squares approach is indeed susceptible to noise, the use of a suitable penalizer for the data-term and the total variation regularization deals with noise successfully. Additionally, for non-occluded points the total variation algorithm can choose the occlusion timing s so that noise with zero mean in the path integral can cancel out much better than in the customary comparison of two single pixels. In contrast to most optical flow methods, we are able to include occlusion explicitly into our image formation model. With the total variation approach arbitrarily large occlusion as well as disocclusions can be handled under the assumption that a scene point changes its state of visibility only once. This assumption on the visibility state infers that, e.g. for a static background point an occluding object can move at most as far as its width before the background point reappears. Our image formation model works with motion paths instead of displacement fields. While motion paths can theoretically have arbitrary forms, the assumption that they are linear allows for a simple parametrization. Actually, linear motion paths imply that the displacement of all pixels on the path is uniform and of constant speed. But as motion paths are allowed to vary for neighboring pixels, the approach can successfully handle also much more complex motions. Finally, while recording the alternate exposure sequence, we replace one shortexposure image with a long-exposure image. To show the sequence to a viewer uninterested in motion detection, the long exposed frame may simply be skipped, or, to ensure a sufficient frame rate, intermediate images can be easily and quite faithfully interpolated with the proposed method.
8
Conclusion
Alternate exposure imaging has been introduced to record anti-aliased motion information as well as high frequency content of a scene. From an image formation model connecting a long-exposure image with a preceding and a succeeding short-exposure image via motion paths and occlusion information, two algorithms can be derived that estimate motion fields as well as occlusion timings. The first algorithm is able to perform the estimation without any regularization, that is usually necessary to solve the aperture problem in optical flow estimation. Although competitive on synthetic data, the lack of regularization makes the pointwise least squares approach susceptible to image noise and gain maladjustment in real-world recordings. In contrast, the total variation approach
Motion Estimation from Alternate Exposure Images
49
is not only more accurate than state-of-the-art optical flow on synthetic scenes, but it also shows convincing performance on real-world scenes. Notably, it is able to handle occlusion situations where the state-of-the-art in optical flow based on two successive images - is destined to fail. In our experiments, we also observed that accuracy of the motion field is not the most important issue for frame interpolation. With our estimated motion fields that contain some residual error, together with occlusion timings, we are able to obtain interpolated frames that have a smaller numerical error than interpolation with ground-truth motion. In addition, the interpolated frames also look perceptionally convincing, as, in contrast to traditional interpolation, our algorithms are able to reproduce occlusion borders correctly by making use of the estimated occlusion timings. Acknowledgements. The authors gratefully acknowledge funding by the German Science Foundation from project DFG MA2555/4-1.
References 1. Barron, J., Fleet, D., Beauchemin, S.: Performance of optical flow techniques. IJCV 12(1), 43–77 (1994) 2. Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A database and evaluation methodology for optical flow. In: Proc. ICCV, pp. 1–8. IEEE, Los Alamitos (2007) 3. Christmas, W.: Filtering requirements for gradient-based optical flow measurement. T-IP 9, 1817–1820 (2000) 4. Sellent, A., Eisemann, M., Goldl¨ ucke, B., Cremers, D., Magnor, M.: Motion field estimation from alternate exposure images. T-PAMI (to appear) 5. Kundur, D., Hatzinakos, D.: Blind image deconvolution. IEEE Signal Process Magazine 13, 43–64 (1996) 6. Xiao, J., Cheng, H., Sawhney, H., Rao, C., Isnardi, M.: Bilateral filtering-based optical flow estimation with occlusion detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 211–224. Springer, Heidelberg (2006) 7. Sand, P., Teller, S.: Particle video: Long-range motion estimation using point trajectories. IJCV 80, 72–91 (2008) 8. Alvarez, L., Deriche, R., Papadopoulo, T., Sanchez, J.: Symmetrical dense optical flow estimation with occlusions detection. IJCV 75, 371–385 (2007) 9. Xu, L., Jia, J., Matsushita, Y.: Motion detail preserving optical flow estimation. In: Proc. CVPR, pp. 1293–1300. IEEE, San Francisco (2010) 10. Sellent, A., Eisemann, M., Magnor, M.: Motion Field and Occlusion Time Estimation via Alternate Exposure Flow. In: Proc. ICCP. IEEE, Los Alamitos (2009) 11. Sellent, A., Eisemann, M., Goldl¨ ucke, B., Pock, T., Cremers, D., Magnor, M.: Variational optical flow from alternate exposure images. In: Proc. VMV, pp. 135– 143 (2009) 12. Aggarwal, J., Nandhakumar, N.: On the computation of motion from sequences of images-a review. Proc. of the IEEE 76, 917–935 (1988) 13. Anandan, P.: A computational framework and an algorithm for the measurement of visual motion. IJCV 2, 283–310 (1989)
50
A. Sellent, M. Eisemann, and M. Magnor
14. Alvarez, L., Weickert, J., S´ anchez, J.: Reliable estimation of dense optical flow fields with large displacements. IJCV 39, 41–56 (2000) 15. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004) 16. Lim, S., Apostolopoulos, J., Gamal, A.: Optical flow estimation using temporally oversampled video. T-IP 14, 1074–1087 (2005) 17. Black, M.J., Anandan, P.: Robust dynamic motion estimation over time. In: Proc. CVPR, pp. 296–302 (1991) 18. Mahajan, D., Huang, F., Matusik, W., Ramamoorthi, R., Belhumeur, P.: Moving gradients. In: Proc. SIGGRAPH. ToG, vol. 28, pp. 1–11. ACM, New York (2009) 19. Yitzhaky, Y., Kopeika, N.: Identification of blur parameters from motion blurred images. Graphical Models and Image Processing 59, 310–320 (1997) 20. Pao, T., Kuo, M.: Estimation of the point spread function of a motion-blurred object from autocorrelation. In: Proc. of SPIE, vol. 2501 (2003) 21. Rekleitis, I.M.: Optical flow recognition from the power spectrum of a single blurred image. In: Proc. ICIP, pp. 791–794. IEEE, Los Alamitos (1996) 22. Jia, J.: Single image motion deblurring using transparency. In: Proc. CVPR, pp. 1–8. IEEE Computer Society, Los Alamitos (2007) 23. Dai, S., Wu, Y.: Motion from blur. In: Proc.CVPR, pp. 1–8. IEEE, Los Alamitos (2008) 24. Wang, J., Cohen, M.F.: Image and video matting: A survey. Foundations and Trends in Computer Graphics and Vision 3, 97–175 (2007) 25. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera shake from a single photograph. ToG (2006) 26. Xu, L., Jia, J.: Two-phase kernel estimation for robust motion deblurring. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 157–170. Springer, Heidelberg (2010) 27. Bardsley, J., Jefferies, S., Nagy, J., Plemmons, R.: Blind iterative restoration of images with spatially-varying blur. Optics Express 14, 1767–1782 (2006) 28. Levin, A.: Blind motion deblurring using image statistics. Advances in Neural Information Processing Systems 19, 841–848 (2007) 29. Tico, M., Vehvilainen, M.: Estimation of motion blur point spread function from differently exposed image frames. In: Proc. of Eusipco, Florence, Italy (2006) 30. Yuan, L., Sun, J., Quan, L., Shum, H.Y.: Image deblurring with blurred/noisy image pairs. In: Proc. SIGGRAPH. ToG, vol. 26, pp. 1–8. ACM, New York (2007) 31. Lim, S., Silverstein, A.: Estimation and removal of motion blur by capturing two images with different exposures (2008) 32. Ben-Ezra, M., Nayar, S.: Motion-based motion deblurring. T-PAMI 26, 689 (2004) 33. Tai, Y., Du, H., Brown, M., Lin, S.: Image/video deblurring using a hybrid camera. In: Proc. CVPR, pp. 1–8. IEEE Computer Society, Los Alamitos (2008) 34. Rav-Acha, A., Peleg, S.: Restoration of multiple images with motion blur in differentdirections. In: Workshop on Appl. of Comp. V., pp. 22–28. IEEE, Los Alamitos (2000) 35. Rav-Acha, A., Peleg, S.: Two motion-blurred images are better than one. Pattern Recognition Letters 26, 311–317 (2005) 36. Cho, T., Levin, A., Durand, F., Freeman, W.: Motion blur removal with orthogonal parabolic exposures. In: Proc. ICCP, pp. 1–8 (2010) 37. Chen, W.G., Nandhakumar, N., Martin, W.N.: Image motion estimation from motion smear-a new computational model. T-PAMI 18 (1996)
Motion Estimation from Alternate Exposure Images
51
38. Chen, W.G., Nandhakumar, N., Martin, W.N.: Estimating image motion from smear: a sensor system and extensions. In: Proc. ICIP, pp. 199–202. IEEE, Los Alamitos (1995) 39. Favaro, P., Soatto, S.: A variational approach to scene reconstruction and image segmentation from motion-blur cues. In: Proc. CVPR. IEEE, Los Alamitos (2004) 40. Agrawal, A., Xu, Y., Raskar, R.: Invertible motion blur in video. In: Proc. SIGGRAPH. ToG, vol. 28, pp. 1–8. ACM, New York (2009) 41. Dennis, J., Schnabel, R.: Numerical methods for unconstrained optimization and nonlinear equations. Prentice-Hall, Englewood Cliffs (1983) 42. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. IJCV 59 (2004) 43. Forsythe, G.E., Malcolm, M.A., Moler, C.B.: Computer Methods for Mathematical Computations. Prentice-Hall, Englewood Cliffs (1976) 44. Tikhonov, A., Arsenin, V.: Solutions of Ill-Posed Problems. Winston, NY (1977) 45. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 46. Black, M.J., Anandan, P.: The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Comp. V. and Img. Underst. 63, 75–104 (1996) 47. Chambolle, A.: An algorithm for total variation minimization and applications. Journal of Mathematical Image Visualization 20, 89–97 (2004)
Understanding What We Cannot See: Automatic Analysis of 4D Digital In-Line Holographic Microscopy Data Laura Leal-Taix´e1, , Matthias Heydt2 , Axel Rosenhahn2,3, and Bodo Rosenhahn1 1
Leibniz Universit¨at Hannover, Appelstr. 9A, Hannover, Germany
[email protected] 2 Applied Physical Chemistry, University of Heidelberg, INF 253, Heidelberg, Germany 3 Institute of Functional Interfaces, Karlsruhe Institute of Technology, Campus Nord, 76344 Eggenstein-Leopoldshafen, Germany
Abstract. Digital in-line holography is a microscopy technique which got an increasing attention over the last few years in the fields of microbiology, medicine and physics, as it provides an efficient way of measuring 3D microscopic data over time. In this paper, we present a complete system for the automatic analysis of digital in-line holographic data; we detect the 3D positions of the microorganisms, compute their trajectories over time and finally classify these trajectories according to their motion patterns. Tracking is performed using a robust method which evolves from the Hungarian bipartite weighted graph matching algorithm and allows us to deal with newly entering and leaving particles and compensate for missing data and outliers. In order to fully understand the behavior of the microorganisms, we make use of Hidden Markov Models (HMMs) to classify four different motion patterns of a microorganism and to separate multiple patterns occurring within a trajectory. We present a complete set of experiments which show that our tracking method has an accuracy between 76% and 91%, compared to ground truth data. The obtained classification rates on four full sequences (2500 frames) range between 83.5% and 100%. Keywords: digital in-line holographic microscopy, particle tracking, graph matching, multi-level Hungarian, Hidden Markov Models, motion pattern classification.
1 Introduction Many fields of interest in biology and other scientific research areas deal with intrinsically three-dimensional problems. The motility of swimming microorganisms such as bacteria or algae is of fundamental importance for topics like pathogen-host interactions [1], predator-prey interactions [1], biofilm-formation [2], or biofouling by marine microorganisms [3, 4]. Understanding the motility and behavioral patterns of microorganisms allows us to understand their interaction with the environment and thus to control environmental
Corresponding author.
D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 52–76, 2011. c Springer-Verlag Berlin Heidelberg 2011
Understanding What We Cannot See
(a)
53
(b)
Fig. 1. (a) The input data, the projections obtained with digital in-line holography (inverted colors for better visualization). Sample trajectory in red. (b) The output data we want to obtain from each volume, the classification into four motion patterns, colored according to speed: orientation (1), wobbling (2), gyration (3) and intensive surface probing (4).
parameters to avoid unwanted consequences such as infections or biofouling. To study these effects in 3D several attempts have been made: tracking light microscopy, capable of tracking one bacterium at a time [5], stereoscopy [6] or confocal microscopy [7]. Berg built a pioneering tracking light microscope, capable of tracking one bacterium at a time in 3D. This has been used to investigate bacteria like Escherichia Coli [5]. Another way of measuring 3D trajectories is stereoscopy, which requires two synchronized cameras [6]. Confocal microscopy has also been used to study the motion of particles in colloidal systems over time, however the nature of this scanning technique limits the obtainable frame rate [7]. For any of these techniques, in order to draw statistically relevant conclusions, thousands of images have to be analyzed. Nowadays, this analysis is still heavily dependent on manual intervention. Recent work [8] presents a complete vision system for 2D cell tracking, which proves the increasing demand for efficient computer vision approaches in the field of microscopy as an emerging discipline. The search for a nearly automatic analysis of biological images has been extensively studied [9] but most of the work focuses on position as well as on the shape of the particle [10]. Several methods exist for multiple object detection based on methods such as Markov Chain Monte Carlo (MCMC) [11], inference in Bayesian networks [12] or the Nash Equilibrium of game theory [13]. These have been proven useful to track a fairly small number of targets, but are less appropriate when the number of targets is very large, as in our case. Statistical methods like Kalman filters [8], particle filters or recursive Bayesian filters [14] are widely used for tracking but they need a dynamical model of the target, a task that can be challenging depending on the microorganism under study and to which we dedicate the second part of this paper. In contrast to [14, 8], we do not use the output predictions of the filters to deal with occlusions, but rather use past and future information to complete broken trajectories and detect false alarms. Therefore, we do not need an extra track linking step as in [8]. Furthermore, we deal with 3D trajectories and random
54
L. Leal-Taix´e et al.
and fast motions which are unsuited for a prediction type of approach. In this work we propose a global optimal matching solution and not a local one as suggested in [15]. Besides generating motion trajectories from microscopic data, a subsequent classification allows the biologists to get in a compact and compressed fashion the desired information from the large image sets. Indeed, the classification of motion patterns in biology is a well-studied topic [16] but identifying these patterns manually is a complicated and time consuming task. Recently, machine learning and pattern recognition techniques have been introduced to analyze in detail such complex movements. These techniques include: Principal Component Analysis (PCA) [17], a linear transformation used to analyze high dimensional data; Bayesian models [18] which use a graph model and the rules of probability theory to select among different hypotheses; or Support Vector Machines (SVM) [19], which use training data to find the optimum parameters of the model representing each class. A comparison of machine learning approaches applied to biology can be found in [20]. In order to classify biological patterns, we need to use an approach able to handle time-varying signals. Hidden Markov Models [21] are statistical models especially known for their application in temporal pattern recognition. They were first used in speech recognition and since then, HMMs have been extensively applied to vision. Applications vary from handwritten word recognition [22], face recognition [23] or human action recognition [24, 25]. In this paper, we present a complete system for the automatic analysis of digital inline holographic data. This microscopy technique provides videos of a 3D volume and is used to study complex movements of microorganisms. The huge amount of information that we can extract from holographic images makes it necessary to have an automatic method to analyze this complex 4D data. Our system performs the detection of the 3D positions, tracking of the complete trajectories and classification of motion patterns. For multiple microorganism tracking, we propose a geometrically motivated and globally optimal multi-level Hungarian to compensate for leaving and entering particles, recover from missing data and erase the outliers to reconstruct the whole trajectory of the microorganisms [26] . Afterwards, we focus on the classification of four motion patterns of the green alga Ulva linza with the use of Hidden Markov Models [27]. Furthermore, our system is able to find and separate different patterns within a single sequence. Besides classification of motion patterns, a key issue is the choice of features used to classify and distinguish the involved patterns. For this reason we perfom an extensive analysis of the importance of typical motion parameters, such as velocity, curvature, orientation, etc. Our developed system is highly flexible and can easily be extended. Especially for forthcoming work on cells, microorganisms or human behavior, such automated algorithms are of pivotal importance as they allow high throughput analysis of individual segments in motion data.
2 Detection of 3D Positions In this section we present the details of digital in-line holography, how this microscopy technique allows us to obtain 3D positions of the microorganisms as well as the image processing methods used to robustly extract these positions from the images.
Understanding What We Cannot See
55
2.1 Digital In-Line Holographic Microscopy (DIHM) Digital in-line holographic microscopy provides an alternative, lensless microscopy technique which intrinsically contains three dimensional information about the investigated volume. It does not require a feedback control which responds to the motion and it uses only one CCD chip. This makes the method very straightforward and can be implemented with a very simple setup as shown in Figure 2.
Pinhole CCD detector
Laser beam
Microorganisms Hologram Fig. 2. Schematic setup for a digital in-line holographic experiment consisting of the laser, a spatial filter to create the divergent light cone, the objects of interest (e.g. microorganisms) and a detector which records the hologram
The holographic microscope requires only a divergent wavefront which is produced by diffraction of laser light from a pinhole. A CCD chip finally captures the hologram. The holographic microscope setup follows directly Gabors initial idea [28] and has been implemented for laser radiation by Xu et al. [29]. A hologram recorded without the presence of particles, called the source is subtracted from each hologram. This is used to reduce the constant illumination background and other artifacts; there are filtering methods [30, 31] to achieve this in case a source image is not readily available. These resulting holograms can then be reconstructed back into real space by a KirchhoffHelmholtz transformation [29] shown in Equation (1). ikr·ξ K(r) = d2 ξI(ξ)e |ξ| (1) S
The integration extends over the 2D surface of the screen with coordinates ξ = (X, Y, L), where L is the distance from the source (pinhole) to the center of the detector (CCD chip), I(ξ) is the contrast image (hologram) on the screen obtained by subtracting the images with and without the object present and k the wave number: k = 2π/λ. As we can see in Figure 3, the idea behind the reconstruction is to obtain a series of stacked XY projections from the hologram image. These projections contain the information at different depth values. From these images, we can obtain the 3 final projections XY , XZ and Y Z, as described in [32]. These projections contain the image
56
L. Leal-Taix´e et al.
Hologram
y
Stacked XY projections YZ projection
XZ projection
x z z1
zi
zN
XY projection
Fig. 3. Illustration of the reconstruction process. From the hologram a stack of XY projections is obtained in several depths and from those, the final 3 projections (XZ, XZ and Y Z) are obtained.
information of the complete observation volume, i.e. from every object located in the light cone between pinhole and detector. The resolution in X and Y is δx,y = NλA , D where N A stands for the numerical aperture given by N A = 2L , where D is the detector’s side length. The resolution in the Z direction, which is the direction of the laser, is worse, δz = 2NλA2 . This is because the third dimension, Z, is obtained with a mathematical reconstruction, unlike confocal microscopy, where the value of every voxel is returned. On the other hand, confocal microscope takes a long time to return the values of all the voxels in a volume, and therefore is unsuited for tracking at a high frame rate. Using video sequences of holograms, it is possible to track multiple objects in 3D over time at a high frame rate, and multiple spores present in a single frame can be tracked simultaneously [3, 15, 33]. Using this advantage of digital in-line holographic microscopy a number of 3D phenomena in microbiology have been investigated: Lewis et al. [34] examined the swimming speed of Alexandrium (Dinophyceae), Sheng et al. [35, 36] studied the swimming behavior of predatory dinoflagellates in the presence of prey, and Sun et al. [37] used a submersible device to investigate in situ plankton in the ocean. 2.2 Detection of the Microorganisms In our sequences we are observing the green algae Ulva linza, which has a spherical spore body and four flagella. Since the body scatters most of the light, in the projected images the particles have a circular shape. In order to preserve and enhance the particle shape (see Figure 4(a)) but reduce noise and illumination irregularities of the image (see Figure 4(b)), we apply the Laplacian of Gaussian filter (LoG) which, for its shape, is a blob detector [38]: −1 x2 + y 2 − x2 +y2 2 LoG(x, y) = 1− e 2σ πσ 4 2σ 2
(2)
Due to the divergent nature of the light cone, the particles can appear smaller or larger in the projections depending on the z-plane. Therefore, the LoG filter is applied in several scales [38] according to the magnification. Note that the whole algorithm is extremely
Understanding What We Cannot See
(a)
57
(b)
Fig. 4. (a) Enhancement of the shape of the microorganisms. (b) Reduction of the noise.
Y Z projection
XZ projection
XY projection
Fig. 5. From the 3D positions obtained at each time frame, we use the method in Section 3 to obtain the full trajectory of each microorganism
adaptable, since we can detect particles with any shape by just changing the filter. After this, we use thresholding on each projection to have the position of the candidate particles on the image. The final 3D positions (Figure 6, green box labeled ”Candidate particles”) are determined by thresholding each projection XY , XZ and Y Z to find the particles in each image and crossing the information of the three projections. Once we have computed the 3D positions of all the microorganisms in all frames, we are interested in linking these 3D positions in order to find their complete 3D trajectories over time, a problem that is generally called Multiple object tracking (see Figure 5).
3 Automatic Extraction of 3D Trajectories In this section we present the complete method to estimate the 3D trajectories of the microorganisms over time. Our algorithm, the Multi-level Hungarian, is a robust method
58
L. Leal-Taix´e et al.
Projections filtered with LoG and thresholded t-2
t-1
t
t+1
t+2 Candidate Particles
Standard Hungarian Final Particles Particles
Find particle position position
Adding it iteration iteration n
changes>0 changes>0 ge NO NO O changes>0 changes>0 ge
Trajectories
Deleting ng iteration iteration ite n
NO NO
YES YES
YES YES
Multi-level Hungarian
Fig. 6. Diagram of the algorithm described in Section 3.2
evolved from the Hungarian-Munkre’s assignment method, and is capable of dealing with entering and leaving particles, missing data and outliers. The diagram of the method is presented in Figure 6. 3.1 Cost Function and Bipartite Graph Matching Graph Matching is one of the fundamental problems in Graph Theory and it can be defined as: given a graph G = (V, E), where E represents its set of edges and V its set of nodes, a matching M in G is a set of pairwise non-adjacent edges, which means that no edges share a common vertex. For our application, we are specially interested in the Assignment Problem, which consists in finding a maximum weight matching in a weighted bipartite graph. In a general form, the problem can be expressed as: ”There are N jobs and N workers. Any worker can be assigned to any job, incurring some cost that varies depending on the job-worker assignment. All jobs must be performed by assigning exactly one worker to each job in such a way that the total cost is minimized (or maximized)”. For the subsets of vertices X and Y , we build a cost matrix in which the element C(i, j) will represent the weight or cost related to vertex i in X and vertex j in Y . For numerical optimization we use the Hungarian or Munkres’ assignment algorithm, a combinatorial optimization algorithm [39, 40] that solves the bipartite graph matching problem in polynomial time. For implementation details on the Hungarian we recommend [41]. Our initial problem configuration is: there are M particles in frame
Understanding What We Cannot See
59
Table 1. Summary of the advantages and disadvantages of the Hungarian algorithm ADVANTAGES Finds a global solution for all vertices Cost matrix is versatile Easy to solve, bipartite matching is the simplest of all graph problems DISADVANTAGES Cannot handle missing vertices (a) Cannot handle entering or leaving particles (b) No discrimination of matches even if the cost is very high (c)
t1 and N particles in frame t2 . The Hungarian will help us to find which particle in t1 corresponds to which particle in t2 , allowing us to reconstruct their full trajectories in 3D space. Nonetheless, the Hungarian algorithm has some disadvantages which we should be aware of. In the context of our project, we summarize in Table 1 some of the advantages and disadvantages of the Hungarian algorithm. In the following sections, we present how to solve the three disadvantages: (a) is solved with the multi-level Hungarian method explained in Section 3.2, (b) is solved with the IN/OUT states of Section 3.1 and finally a solution for (c) is presented in Section 3.1 as a maximum cost restriction. The cost function C as key input for the Hungarian algorithm is created using the Euclidean distances between particles, that is, element C(i, j) of the matrix represents the distance between particle i of frame t1 and particle j of frame t2 . With this matrix, we need to solve a minimum assignment problem since we are interested in matching those particles which are close to each other. Note that it is also possible to include in the cost function other characteristics of the particle like speed, size or gray level distribution. Such parameters can act as additional regularizers during trajectory estimation. IN and OUT States. In order to include more knowledge about the environment to the Hungarian algorithm and avoid matches with very high costs, we have created a variation for the cost matrix. In our experiments, particles can only enter and leave the scene by crossing the borders of the Field Of View (FOV) of the holographic microscope, therefore, the creation and deletion of particles depends on their distance to the borders of the FOV. Nonetheless, the method can be easily extended to situations where trajectories are created (for example by cell division) or terminated (when the predator eats the prey) away from the FOV borders. As shown in Figure 7, we introduce the IN/OUT states in the cost matrix by adding extra row and columns. If we are matching the particles in frame f to particles in frame f + 1, we will add as many columns as particles in frame f and as many rows as particles in frame f + 1. This way, all the particles have the possibility to enter/leave the scene. Additionally, this allows us to obtain a square matrix, needed for the matching algorithm, even if the number of particles is not the same in consecutive frames.
60
L. Leal-Taix´e et al.
OUT OUT OUT
IN IN IN IN
Fig. 7. Change in the cost matrix to include the IN/OUT states. Each particle is represented by a different color. The value of each extra element added is the distance between the particle position and the closest volume boundary. ⎧ ⎪ min(|Pi − {Mx , mx , My , my , Mz }|), ⎪ ⎪ ⎨ CBB (i, k) = min(|Pk − {Mx , mx , My , my , Mz }|) ⎪ ⎪ ⎪ ⎩min(C (i, 1 : k − 1), C (1 : i − 1, k)) BB
BB
1 ≤ i ≤ M and k > N 1 ≤ k ≤ N and i > M (3) i > N and k > N
The cost of the added elements includes the information of the environment by calculating the distance of each particle to the nearest edge of the FOV as in Equation (3), where M is the number of particles in frame t1 and N is the number of particles in frame t2 , mx , my , mz are the low borders and Mx , My , Mz are the high borders for each of the axis. Note that the low border in the z axis is not included as it represents the surface where the microorganisms might settle and, therefore, no particles can enter or leave from there. If the distance is small enough, the Hungarian algorithm matches the particle with an IN/OUT state. In Figure 8 we consider the simple scenario in which we have 4 particles in one frame and 4 in the next frame. As we can see, there is a particle which leaves the scene from the lower edge and a particle which enters the scene in the next frame from the right upper corner. As shown in Figure 8(a), the Hungarian algorithm finds a wrong matching since the result is completely altered by the entering/leaving particles. With the introduction of the IN/OUT state feature, the particles are now correctly matched (see Figure 8(b)) and the ones which enter/leave the scene are identified as independent particles.
Understanding What We Cannot See
61
IN
OUT
(a)
(b)
Fig. 8. Representation of the particles in frame t1 (left) and t2 (right). The lines represent the matchings. (a) Wrongly matched. (b) Correctly matched as a result of the IN/OUT state feature.
Maximum Cost Restriction. Due to noise and illumination irregularities of the holograms, it is common that a particle is not detected in several frames, which means a particle can virtually disappear in the middle of the scene. If a particle is no longer detected, all the matches can be greatly affected. That is why we introduce a maximum cost restriction for the cost matrix, which does not allow matches which have costs higher than a given threshold V . This threshold is the observed maximum speed of the algae spores under study [32]. The restriction is guaranteed by using the same added elements as the ones used for the IN/OUT states, therefore, the final value of the added elements of the cost matrix will be: C(i, k) = min(CBB , V Δt)
(4)
Overall, if a particle is near a volume border or cannot be matched to another particle which is within a reachable distance, it will be matched to an IN/OUT state. This ensures that the resulting matches are all physically possible. Still, if we have missing data and a certain particle is matched to an IN/OUT state, we will recover two trajectories instead of the complete one. In the next section we present a hierarchical solution to recover missing data by extending the matching to the temporal dimension. 3.2 Multi-level Hungarian for Missing Data If we consider just the particles detected using the thresholding, we see that there are many gaps within a trajectory (see Figure 12(a)). These gaps can be a result of morphing (different object orientations yield to different contrast), changes in the illumination, etc. The standard Hungarian is not capable of filling in the missing data and creating full trajectories, therefore, we now introduce a method based on the standard Hungarian that allows us to treat missing data, outliers and create complete trajectories. The general routine of the algorithm, the multi-level Hungarian, is: – Find the matchings between particles in frames [i − 2 . . . i + 2], so we know the position of each particle in each of these frames (if present). (Section 3.2). – Build a table with all these positions and fill the gaps given some strict conditions. Let the algorithm converge until no particles are added. (Section 3.2). – On the same table and given some conditions, erase the outliers. Let the algorithm converge until no particles are deleted. (Section 3.2).
62
L. Leal-Taix´e et al.
The Levels of the Multi-level Hungarian. The multi-level Hungarian takes advantage of the temporal information in 5 consecutive frames and is able to recover from occlusions and gaps in up to two consecutive frames. The standard Hungarian gives us the matching between the particles in frame t1 and frame t2 and we use this to find matchings of the same particle in 5 consecutive frames, [i − 2, . . . , i + 2]. In order to find these matchings, the Hungarian is applied on different levels. The first two levels, represented in Figure 9 by red arrows, are created to find the matching of the particles in the frame of study, frame i. But it can also be the case that a particle is not present in frame i but is present in the other frames. To solve all the possible combinations given this fact, we use Levels 3, 4 and 5, represented in Figure 9 by green arrows. H3 H5
?
?
H2
H5
H1
H1
? H2
H4 H4
Fig. 9. Represented frames: [i-2,i-1,i,i+1,i+2]. Levels of the multi-level Hungarian.
Below we show a detailed description and purpose of each level of the multi-level Hungarian: – Level 1: Matches particles in frame i with frames i ± 1. – Level 2: Matches particles in frame i with frames i ± 2. With the first two levels, we know, for all the particles in frame i, their position in the neighboring frames (if they appear). – Level 3: Matches particles in frame i − 1 with frame i + 1. – Level 4: Matches particles in frame i ± 1 with frame i ∓ 2. Level 3 and 4 solve the detection of matchings when a particle appears in frames i ± 1 and might appear in i ± 2, but is not present in frame i. – Level 5: Matches particles in frame i ± 1 with frame i ± 2. Conditions to Add/delete Particles. Once all the levels are applied hierarchically, a table with the matching information is created. On one axis we have the number of particles and on the other the 5 frames from [i − 2 . . . i + 2], as shown in Figure 10. To change the table information, we use two iterations: the adding iteration and the deleting iteration which appear in Figure 6 as blue boxes. During the adding iteration,
Understanding What We Cannot See
63
we look for empty cells in the table where there is likely to be a particle. A new particle position is added if, and only if, two conditions are met: 1. There are at least 3 particles present in the row. Particles have continuity while noise points do not. 2. It is not the first or last particle of the row. We use this strict condition to avoid the creation of false particle positions or the incorrect elongation of trajectories. If we look at particle 6 of the table in Figure 10. In this case, we do not want to add any particle in frames i − 2 and i − 1, since the trajectory could be starting at frame i. In the case of particle 4, we do not want to add a particle in frame i + 2 because the trajectory could be ending at i + 1. Each iteration repeats this process for all frames, and we iterate until the number of particles added converges. After convergence, the deleting iteration starts and we erase the outliers considered as noise. A new particle position is deleted if, and only if, two conditions are met: 1. The particle is present in the frame of study i. 2. There are less than 3 particles in the same row. We only erase particles from the frame of study i because it can be the case that a particle appears blurry in the first frames but is later correctly detected and has more continuity. Therefore, we only delete particles from which we know the complete neighborhood. Each iteration repeats this process for all frames, and we iterate until the number of particles deleted converges. The resulting particles are shown in Figure 10. i-1
i
i+1
i+2
6
5
4
3
2
1
i-2
Fig. 10. Table with: the initial particles detected by the multi-level Hungarian (green ellipses), the ones added in the adding iteration (yellow squares) and the ones deleted in the deleting iteration (red crosses). In the blank spaces no position has been added or deleted.
Missing Data Interpolation. During the adding iteration, we use the information of the filtered projection in order to find the correct position of the new particle (Figure 6) . For example, if we want to add a particle in frame i − 1, we go to the filtered projections XY, XZ, YZ in t = i − 1, take the position of the corresponding particle in t = i or t = i − 2 and search for the maximum value within a window w. If the position found is already present in the candidate particles’ list of that frame, we go back to the projection and determine the position of the second maximum value. This allows us to distinguish two particles which are close to each other.
64
L. Leal-Taix´e et al.
There are many studies on how to improve the particle depth-position resolution (zposition). As in [42] we use the traditional method of considering the maximum value of the particle as its center. Other more complex methods [31] have been developed which also deal with different particle sizes, but the flexibility of using morphological filtering already allows us to easily adapt our algorithm. 3.3 The Final Hungarian Once the final particle positions are obtained (in Figure 6, orange box labeled ”Final particles”), we perform one last step to determine the trajectories. We use the standard Hungarian to match particles from frame i to frame i + 1.
4 Motion Pattern Classification In this section we describe the different types of motion patterns, as well as the design of the complete HMM and the features used for their classification. 4.1 Hidden Markov Models Hidden Markov Models [21] are statistical models of sequencial data widely used in many applications in artificial intelligence, speech and pattern recognition and modeling of biological processes. In an HMM it is assumed that the system being modeled is a Markov process with unobserved states. This hidden stochastic process can only be observed through another set of stochastic processes that produce the sequence of symbols O = o1 , o2 , ..., oM . An HMM consists of a number N of states S1 , S2 , ..., SN . The system is at one of the states at any given time. Every HMM can be defined by the triple λ = (Π, A, B). Each transition from Si to Sj can Π = {πi } is the vector of initial state probabilities. occur with a probability of aij , where j aij = 1. A = {aij } is the state transition matrix. In addition, each state Si generates an output ok with a probability distribution bik = P (ok |Si ). B = {bik } is the emission matrix. There are three main problems related to HMMs: 1. The evaluation problem: for a sequence of observations O compute the probability P (O|λ) that an HMM λ generated O. This is solved using the Forward-Backward algorithm. 2. The estimation problem: given O and an HMM λ, recover the most likely state sequence S1 , S2 , ..., SN that generated O. Problem 2 is solved by the Viterbi algorithm, a dynamic programming algorithm that computes the most likely sequence of hidden states in O(N 2 T ) time. 3. The optimization problem: find the parameters of the HMM λ which maximize P (O|λ) for some output sequence O. A local maximum likelihood can be derived efficiently using the Baum-Welch algorithm. For a more detailed introduction to HMM theory, we refer to [21].
Understanding What We Cannot See
65
4.2 Types of Patterns In our experimental setup we are interested in four patterns shown by the green alga Ulva linza as depicted in Figure 1(b): Orientation(1), Wobbling(2), Gyration(3) and intensive surface probing or Spinning(4). These characteristic swimming patterns are highly similar to the patterns observed before in [43] for the brown algae Hincksia irregularis. Orientation. Trajectory 1 in Figure 1(b) is an example of the Orientation pattern. This pattern typically occurs in solution and far away from surfaces. The most important characteristics of the pattern are the high swimming speed (a mean of 150μm/s) and a straight swimming motion with moderate turning angles. Wobbling. Pattern 2 is called the Wobbling pattern and its main characteristic is a much slower mean velocity of around 50μm/s. The spores assigned to the pattern often change their direction of movement and only swim in straight lines for very short distances. Compared to the orientation pattern this leads to less smooth trajectories. Gyration. Trajectory 3 is an example of the Gyration pattern. This pattern is extremely important for the exploration of surfaces as occasional surface contacts are observable. The behavior in solution is similar to the Orientation pattern. Since in this pattern spores often switch between swimming towards and away from the surfaces, it can be interpreted as a pre-stage to surface probing. Intensive Surface Probing and Spinning. Pattern 4 involves swimming in circles close to the surface within a very limited region. After a certain exploration time, the spores can either permanently attach or leave the surface to the next position and start swimming in circular patterns again. This motion is characterized by decreased mean velocities of about 30μm/s in combination with a higher tendency to change direction (see Figure 1(b), case 4). 4.3 Features Used for Classification An analysis of the features used for classification is presented in this section. Most of the features are generally used in motion analysis problems. An intrinsic characteristic of digital in-line holographic microscopy is the lower resolution of the Z position compared to the X,Y resolution [31]. Since many of the following features depend on the depth value, we compute the average measurements within 5 frames in order to reduce the noise of such features. The four characteristic features used are: – v, velocity: the speed of the particles is an important descriptive feature as we can see in Figure 1(b). We use only the magnitude of the speed vector, since the direction is described by the next two parameters. Range is [0, maxSpeed]. maxSpeed is the maximum speed of the particles as found experimentally in [32]. – α, angle between velocities: it measures the change in direction, distinguishing stable patterns from random ones. Range is [0, 180].
66
L. Leal-Taix´e et al.
– β, angle to normal of the surface: it measures how the particles approaches the surface or how it swims above it. Range is [0, 180]. – D, distance to surface: this can be a key feature to differentiate surface-induced movements from general movements. Range is (mz , Mz ], where mz and Mz are the z limits of the volume under study. In order to work with Hidden Markov Models, we need to represent the features for each pattern with a fixed set of symbols. The number of total symbols will depend on the number of symbols used to represent each feature Nsymbols = Nv Nα Nβ ND . In order to convert every symbol for each feature into a unique symbol for the HMM, we use Equation (5), where J is the final symbol we are looking for, J1..4 are the symbols for each of the features, ranged [1..NJ1 ..4 ], where NJ1..4 are the number of symbols per feature. J = J1 + (J2 − 1)NJ1 + (J3 − 1)NJ1 NJ2 + (J4 − 1)NJ1 NJ2 NJ3
(5)
In the next sections we present how to use the resulting symbols to train the HMMs. The symbols are the observations of the HMM, therefore, the training process gives us the probability of emitting each symbol for each of the states. 4.4 Building and Training the HMMs In speech recognition, an HMM is trained for each of the phonemes of a language. Later, words are constructed by concatenating several HMMs of the phonemes that form the word. HMMs for sentences can even be created by concatenating HMMs of words, etc. We take a similar hierarchical approach in this paper. We train one HMM for each of the patterns and then we combine them into a unique Markov chain with a simple yet effective design that will be able to describe any pattern or combination of patterns. This approach can be used in any problem where multiple motion patterns are present. Individual HMM Per Pattern. In order to represent each pattern, we build a Markov chain with N states and we only allow the model to stay in the same state or move one state forward. Finally, from state N we can also go back to state 1. The number of states N is found empirically using the training data (we use N = 4 for all the experiments, see Section 5.4). The HMM is trained using the Baum-Welch algorithm to obtain the transition and emission matrices. Complete HMM. The idea of having a complete HMM that represents all the patterns is that we can not only classify sequences where there is one pattern present, but sequences where the particle makes transitions between different patterns. In Figure 11(a) we can see a representation of the complete model while the design of the transition matrix is depicted in Figure 11(b). The four individual HMMs for each of the patterns are placed in parallel (blue). In order to deal with the transitions we create two special states: the START and the SWITCH state. The START state is just created to allow the system to begin at any pattern (orange). We define Pstart = PSwitchT oModel = 1−PNswitch where NP is the number of patterns. P As START does not contain any information of the pattern, it does not emit any symbol.
Understanding What We Cannot See
67
Pattern 1
1
…
2
N
Pstart
Pswitch Pstart
. . .
START
Pstart
2
SWITCH
Pstart
…
1
Pstart Pswitch
2 Pmodel (1 − Pswitch )
Pswitch
Pattern NP
1
Pswitch
N
PSwitchT oM odel
3 4
Pswitch
(a)
(b)
Fig. 11. (a) Complete HMM created to include changes between patterns within one trajectory. (b) Transition matrix of the complete HMM.
The purpose of the new state SWITCH is to make transitions easier. Imagine a given trajectory which makes a transition from Pattern 1 to Pattern 2. While transitioning, the features create a symbol that neither belongs to Pattern 1 nor 2. The system can then go to state SWITCH to emit that symbol and continue to Pattern 2. Therefore, 1 all SWITCH emission probabilities are Nsymbols . Since SWITCH is such a convenient state, we need to impose restrictive conditions so that the system does not go or stay in SWITCH too often. This is controlled by the parameter Pswitch , set at the minimum value of all the Pmodel minus a small . This way, we ensure that Pswitch is the lowest transition probability in the system. Finally, the sequence of states given by the Viterbi algorithm determines the motion pattern observed. Our implementation uses the standard MatLab HMM functions.
5 Experimental Results In order to test our algorithm we use 6 sequences (labeled S1 to S6) in which the swimming motion of Ulva linza spores is observed [3]. All the sequences have some particle positions which have been semi-automatically reconstructed and manually labeled and inspected (our ground truth) for later comparison with our fully-automatic results. 5.1 Performance of the Standard Hungarian First of all, we want to show the performance of the final standard Hungarian described in Section 3.3. For this, we use the ground truth particle positions and apply the Hungarian algorithm to determine the complete trajectories of the microorganisms. Comparing the automatic matches to the ground truth, we can see that in 67% of all the sequences, the total number of particles in the sequence is correctly detected, while in the remaining 33%, there is just a 5% difference in the number of particles. The average accuracy of the matchings reaches 96.61%.
68
L. Leal-Taix´e et al.
To further test the robustness of the Hungarian algorithm, we add random noise to each position of our particles. The added noise is in the same order as the noise intrinsically present in the reconstructed images, determined experimentally in [32]. N = 100 experiments are performed on each of the sequences and the accuracy is recorded. Results show that the average accuracy of the matching is just reduced from 96.61% to 93.51%, making the Hungarian algorithm very robust to the noise present in the holographic images and therefore well suited to find the trajectories of the particles. 5.2 Performance of the Multi-level Hungarian To test the performance of the multi-level Hungarian we apply the method to three sets of particles: – Set A: particles determined by the threshold (pre multi-level Hungarian) – Set B: particles corrected after multi-level Hungarian – Set C: ground truth particles, containing all the manually labeled particles We then start by comparing the number of particles detected, as shown in Table 2. As shown in Table 2, the number of particles detected in Set A is drastically reduced in Set B, after applying the multi-level Hungarian, demonstrating its abilities to compensate for missing data and merging trajectories. If we compare it to Set C, we see that
(a)
(b)
(c) Fig. 12. (a) 3 separate trajectories are detected with the standard Hungarian (blue dashed line). Merged trajectory detected with our method (with a smoothing term, red line). Missing data spots marked by arrows. (b), (c) Ground truth trajectories (blue dashed line). Trajectories automatically detected with our method (red line).
Understanding What We Cannot See
69
Table 2. Comparison of the number of particles detected by thresholding, by the multi-level Hungarian and the ground truth S1 S2 Set A 1599 1110 Set B 236 163 Set C 40 143
S3 579 130 44
S4 668 142 54
S5 1148 189 49
S6 2336 830 48
Table 3. Comparison of the trajectories’ average length S1 Set A 3 Set B 19 Set C 58
S2 5 31 54
S3 5 27 54
S4 4 23 70
S5 6 38 126
S6 7 23 105
the number is still too high, indicating possible tracks which were not merged and so detected as independent. Nonetheless, as we do not know exactly the amount of particles present in a volume (not all particle positions have been labeled), it is of great value for us to compare the average length of the trajectories, defined as the number of frames in which the same particle is present. The results are shown in Table 3 where we can clearly see that the average length of a trajectory is greatly improved with the multi-level Hungarian, which is crucial since long trajectories give us more information on the behavior of the particles. Now let us consider just useful trajectories for particle analysis, that is, trajectories with a length of more than 25 frames, which are the trajectories that will be useful later for motion pattern classification. Tracking with the standard Hungarian returns 20.7% of useful trajectories from a volume, while the multi-level Hungarian allows us to extract 30.1%. In the end, this means that we can obtain more useful information from each analyzed volume. Ultimately, this means that fewer volumes have to be analyzed in order to have enough information to draw conclusions of the behavior of a microorganism. 5.3 Performance of the Complete Algorithm Finally, we are interested in determining the performance of the complete algorithm, including detection and tracking. For this comparison, we are going to present two values: – Missing: percentage of ground truth particles which are not present in the automatic determination – Extra: percentage of automatic particles that do not appear in the ground truth data In Table 4 we show the detailed results for each surface. Our automatic algorithm detects between 76% and 91% of the particles present in the volume. This gives us a measure of how reliable our method is, since it is able to
70
L. Leal-Taix´e et al.
Table 4. Missing labeled and extra automatic particles
Missing (%) Extra (%)
S1 S2 S3 S4 S5 S6 8.9 20.7 19.1 23.6 11.5 12.9 54.9 34.1 46.5 13.3 25.8 74.6
detect most of our verified particle positions. Putting this information together with the percentage of particles detected by our algorithm but not labeled, we can see that our method extracts much more information from the volume of study. This is clear in the case of S6, where we have a volume with many crossing particles which are difficult to label manually and where our algorithm gives us almost 75% more information. We now consider the actual trajectories and particle position and measure the position error of our method. The error is measured as the Euclidean distance between each point of the ground truth and the automatic trajectories, both at time t. In Figure 12(a) we can see the 3 independent trajectories found with the standard Hungarian and the final merged trajectory which proves the power of our algorithm to fill in the gaps (pointed by arrows). In Figure 12(b) we can see that the automatic trajectory is much shorter (there is a length difference of 105 frames), although the common part is very similar with an error of just 4,2μm. Figure 12(c) on the other hand, shows a perfectly matched trajectory with a length difference of 8 frames and error of 6,4μm for the whole trajectory, which is around twice the diameter of the spore body. This proves that the determination of the particle position is accurate but the merging of trajectories can be improved. The next sections are dedicated to several experimental results on the automatic classification of biological motion patterns. All the trajectories used from now on are obtained automatically with the method described in Section 3 and are classified manually by experts, which we refer to as our ground truth classification data. 5.4 Evaluation of the Features Used for Classification The experiments in this section have the purpose of determining the impact of each feature for the correct classification of each pattern. We perform leave-one-out tests on our training data which consists of 525 trajectories: 78 for wobbling, 181 for gyration, 202 for orientation and 64 for intensive surface probing. The first experiment that we conduct (see Figure 13) is to determine the effect of each parameter for the classification of all the patterns. The number of symbols and states can only be determined empirically since they depend heavily on the amount of training data. In our experiments, we found the best set of parameters to be N = 4, Nv = 4, Nα = 3, Nβ = 3 and ND = 3, for which we obtain a classification rate of 83.86%. For each test, we set one parameter to 1, which means that the corresponding feature has no effect in the classification process. For example, the first bar in blue labeled ”No Depth” is done with ND = 1. The classification rate for each pattern (labeled from 1 to 4) as well as the mean for all the patterns (labeled Total) is recorded. As we can see, the angle α and the normal β information are the less relevant features, since the classification rate with and without these features is almost the same.
Understanding What We Cannot See
71
"#& "%# "! "!"$
%%&"!$&
$!&&"!
"!
'$&"!
#!!!
"&
&&$!
Fig. 13. Classification rate for parameters N = 4, Nv = 4, Nα = 3, Nβ = 3 and ND = 3. On each experiment, one of the features is not used. In the last experiment all features are used.
The angle information depends on the z component and, as explained in section 4.3, the lower resolution in z can result in noisy measurements. In this case, the trade-off is between having a noisy angle data which can be unreliable, or an average measure which is less discriminative for classification. The most distinguishing feature according to Figure 13 is the speed. Without it, the total classification rate decreases to 55.51% and down to just 11.05% for the orientation pattern. Based on the previous results, we could think of just using the depth and speed information for classification. But if Nα = Nβ = 1, the rate goes down to 79.69%. That means that we need one of the two measures for correct classification. The parameters used are: N = 4, Nv = 4, Nα = 1, Nβ = 3 and ND = 3, for which we obtain a classification rate of 83.5%. This rate is very close to the result with Nα = 3, with the advantage that we now use less symbols to represent the same information. Several tests lead us to choose N = 4 number of states. The confusion matrix for these parameters is shown in Figure 14. As we can see, patterns 3 and 4 are correctly classified. The common misclassifications occur when Orientation (1) is classified as Gyration (3), or when Wobbling (2) is classified as Spinning (4). In the next section we discuss these misclassifications in detail. 1 - Ori 2 - Wob 3 - Gyr 4 - Spin 1 - Ori
0.75
0.09
0.16
2 - Wob
0.07
0.68
0.01
0.24
3 - Gyr
0.01
0.94
0.05
0.02
0.98
4 - Spin
Fig. 14. Confusion matrix; parameters N = 4, Nv = 4, Nα = 1, Nβ = 3 and ND = 3
72
L. Leal-Taix´e et al.
(a)
(b)
Fig. 15. (a) Wobbling (pattern 2) misclassified as Spinning (4). (b) Gyration (3) misclassified as Orientation (1). Color coded according to speed as in Figure 1(b).
(a) Gyration (3) + Spinning (4). Zoom on the spinning part. Color coded according to speed as in Figure 1(b).
1
3
(b) Orientation (1, red) + Gyration (3, yellow). Transition marked in blue and pointed by an arrow. Fig. 16. Sequences containing two patterns within one trajectory
Understanding What We Cannot See
73
Fig. 17. Complete volume with patterns: Orientation (1, red), Wobbling (2, green), Gyration (3, yellow). The Spinning (4) pattern is not present in this sequence. Patterns which are too short to be classified are plotted in black.
5.5 Classification on Other Sequences In this section, we present the performance of the algorithm when several patterns appear within one trajectory and also analyze the typical misclassifications. As test data we use four sequences which contain 27, 40, 49 and 11 trajectories, respectively. We obtain classification rates of 100%, 85%, 89.8% and 100%, respectively. Note that for the third sequence, 60% of the misclassifications are only partial, which means that the model detects that there are several patterns but only one of them is misclassified. One of the misclassifications that can occur is that Wobbling (2) is classified as Spinning (4). Both motion patterns have similar speed values and the only truly differentiating characteristics are the depth and the angle α. Since we use 3 symbols for depth, the fact that the microorganism touches the surface or swims near the surface leads to the same classification. That is the case of Figure 15(a), in which the model chooses pattern Spinning (4) because the speed is very low (dark blue) and sometimes the speed in the Wobbling pattern can be a little higher (light blue). As commented in section 4.2, Gyration (3) and Orientation (1) are two linked patterns. The behavior of gyration in solution is similar to the orientation pattern, that is why the misclassification shown in Figure 15(b) can happen. In this case, since the microorganism does not interact with the surface and the speed of the pattern is high (red color), the model detects it as an orientation pattern. We note that this pattern is difficult to classify, even for a trained expert, since the transition from orientation into gyration usually occurs gradually as spores swim towards the surface and interrupt the swimming pattern (which is very similar to the orientation pattern) by short surface contacts. In general, the model has been proven to handle changes between patterns extremely well. In Figure 16(a), we see the transition between Gyration (3) and Spinning (4).
74
L. Leal-Taix´e et al.
In Figure 16(b), color coded according to classification, we can see how the model detects the Orientation part (red) and the Gyration part (yellow) perfectly well. The model performs a quick transition (marked in blue) and during this period the model stays in the SWITCH state. We have verified that all the transition periods detected by the model lie within the manually annotated transition boundaries marked by experts, even when there is more than one transition present in a trajectory. The classification results on a full sequence are shown in Figure 17. Finally, we can obtain the probability of each transition (e.g. from Orientation to Spinning) for a given dataset under study. This is extremely useful for experts to understand the behavior of a certain microorganism under varying conditions.
6 Conclusions In this paper we presented a fully-automatic method to analyze 4D digital in-line holographic microscopy videos of moving microorganisms by detecting the microorganisms, tracking their full trajectories and classifying the obtained trajectories into meaningful motion patterns. The detection of the microorganisms is based on a simple blob detector and can be easily adapted for any microorganism shape. To perform multiple object tracking, we modified the standard Hungarian graph matching algorithm, so that it is able to overcome the disadvantages of the classical approach. The new multi-level Hungarian recovers from missing data, discards outliers and is able to incorporate geometrical information in order to account for entering and leaving particles. The automatically determined trajectories are compared with ground truth data, proving the method detects between 75% and 90% of the labeled particles. For motion pattern classification, we presented a simple yet effective hierarchical design which combines multiple trained Hidden Markov Models (one for each of the patterns), and has proved successful to identify different patterns within one single trajectory. The experiments performed on four full sequences result in a total classification rate between 83.5% and 100%. Our system is proved to be a helpful tool for biologists and physicists as it provides a vast amount of analyzed data in an easy and fast way. As future work, we plan on further improving the tracking results by using a network flow approach, which will be specially useful for volumes with high density of microorganisms. Acknowledgements. This work has been funded by the German Research Foundation, DFG projects RO 2497/7-1 and RO 2524/2-1 and by the Office of Naval Research, grant N00014-08-1-1116.
References 1. Ginger, M., Portman, N., McKean, P.: Swimming with protists: perception, motility and flagellum assembly. Nature Reviews Microbiology 6(11), 838–850 (2008) 2. Stoodley, P., Sauer, K., Davies, D., Costerton, J.: Biofilms as complex differentiated communities. Annual Review of Microbiology 56, 187–209 (2002)
Understanding What We Cannot See
75
3. Heydt, M., Rosenhahn, A., Grunze, M., Pettitt, M., Callow, M.E., Callow, J.A.: Digital inline holography as a 3d tool to study motile marine organisms during their exploration of surfaces. The Journal of Adhesion 83(5), 417–430 (2007) 4. Rosenhahn, A., Ederth, T., Pettitt, M.: Advanced nanostructures for the control of biofouling: The fp6 eu integrated project ambio. Biointerphases 3(1), IR1–IR5 (2008) 5. Frymier, P., Ford, R., Berg, H., Cummings, P.: 3d tracking of motile bacteria near a solid planar surface. Proc. Natl. Acad. Sci. U.S.A. 92(13), 6195–6199 (1995) 6. Baba, S., Inomata, S., Ooya, M., Mogami, Y., Izumikurotani, A.: 3-dimensional recording and measurement of swimming paths of microorganisms with 2 synchronized monochrome cameras. Review of Scientific Instruments 62(2), 540–541 (1991) 7. Weeks, E., Crocker, J., Levitt, A., Schofield, A., Weitz, D.: 3d direct imaging of structural relaxation near the colloidal glass transition. Science 287(5452), 627–631 (2000) 8. Li, K., Miller, E., Chen, M., Kanade, T., Weiss, L., Campbell, P.: Cell population tracking and lineage construction with spatiotemporal context. Medical Image Analysis 12(5), 546–566 (2008) 9. Miura, K.: Tracking movement in cell biology. Microscopy Techniques, 267–295 (2005) 10. Tsechpenakis, G., Bianchi, L., Metaxas, D., Driscoll, M.: A novel computation approach for simultaneous tracking and feature extraction of c. elegans populations in fluid environments. IEEE Transactions on Biomedical Engineering 55(5), 1539–1549 (2008) 11. Khan, Z., Balch, T., Dellaert, F.: Mcmc-based particle filtering for tracking a variable number of interacting targets. TPAMI (2005) 12. Nillius, P., Sullivan, J., Carlsson, S.: Multi-target tracking - linking identities using bayesian network inference. In: CVPR (2006) 13. Yang, M., Yu, T., Wu, Y.: Game-theoretic multiple target tracking. In: ICCV (2007) 14. Betke, M., Hirsh, D., Bagchi, A., Hristov, N., Makris, N., Kunz, T.: Tracking large variable number of objects in clutter. In: CVPR (2007) 15. Lu, J., Fugal, J., Nordsiek, H., Saw, E., Shaw, R., Yang, W.: Lagrangian particle tracking in three dimensions via single-camera in-line digital holography. New J. Phys. 10 (2008) 16. Berg, H.: Random walks in biology. Princeton University Press, Princeton (1993) 17. Hoyle, D., Rattay, M.: Pca learning for sparse high-dimensional data. Europhysics Letters 62(1) (2003) 18. Wang, X., Grimson, E.: Trajectory analysis and semantic region modeling using a nonparametric bayesian model. In: CVPR (2008) 19. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3), 389–442 (2004) 20. Sbalzariniy, I., Theriot, J., Koumoutsakos, P.: Machine learning for biological trajectory classification applications. Center for Turbulence Research, 305–316 (2002) 21. Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2) (1989) 22. Chen, M., Kundu, A., Zhou, J.: Off-line handwritten word recognition using a hidden markov model type stochastic network. TPAMI 16 (1994) 23. Nefian, A., Hayes, M.H.: Hidden markov models for face recognition. In: ICASSP (1998) 24. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden markov model. In: CVPR (1992) 25. Brand, M., Kettnaker, V.: Discovery and segmentation of activities in video. TPAMI 22(8), 844–851 (2000) 26. Leal-Taix´e, L., Heydt, M., Rosenhahn, A., Rosenhahn, B.: Automatic tracking of swimming microorganisms in 4d digital in-line holography data. In: IEEE WMVC (2009)
76
L. Leal-Taix´e et al.
27. Leal-Taix´e, L., Heydt, M., Weisse, S., Rosenhahn, A., Rosenhahn, B.: Classification of swimming microorganisms motion patterns in 4D digital in-line holography data. In: Goesele, M., Roth, S., Kuijper, A., Schiele, B., Schindler, K. (eds.) Pattern Recognition. LNCS, vol. 6376, pp. 283–292. Springer, Heidelberg (2010) 28. Gabor, D.: A new microscopic principle. Nature 161(8), 777 (1948) 29. Xu, W., Jericho, M., Meinertzhagen, I., Kreuzer, H.: Digital in-line holography for biological applications. Proc. Natl. Acad. Sci. U.S.A. 98(20), 11301–11305 (2001) 30. Raupach, S., Vossing, H., Curtius, J., Borrman, S.: Digital crossed-beam holography for in situ imaging of athmospheric particles. J. Opt. A: Pure Appl. Opt. 8, 796–806 (2006) 31. Fugal, J., Schulz, T., Shaw, R.: Practical methods for automated reconstruction and characterization of particles in digital in-line holograms. Meas. Sci. Technol. 20, 75501 (2009) 32. Heydt, M., Div´os, P., Grunze, M., Rosenhahn, A.: Analysis of holographic microscopy data to quantitatively investigate three dimensional settlement dynamics of algal zoospores in the vicinity of surfaces. Eur. Phys. J. E: Soft Matter and Biological Physics (2009) 33. Garcia-Sucerquia, J., Xu, W., Jericho, S., Jericho, M.H., Tamblyn, I., Kreuzer, H.: Digital inline holography: 4d imaging and tracking of microstructures and organisms in microfluidics and biology. In: Proc. SPIE, vol. 6026, pp. 267–275 (2006) 34. Lewis, N.I., Xu, W., Jericho, S., Kreuzer, H., Jericho, M., Cembella, A.: Swimming speed of three species of alexandrium (dinophyceae) as determined by digital in-line holography. Phycologia 45(1), 61–70 (2006) 35. Sheng, J., Malkiel, E., Katz, J., Adolf, J., Belas, R., Place, A.: Digital holographic microscopy reveals prey-induced changes in swimming behavior of predatory dinoflagellates. Proc. Natl. Acad. Sci. U.S.A. 104(44), 17512–17517 (2007) 36. Sheng, J., Malkiel, E., Katz, J., Adolf, J., Place, A.: A dinoflagellate exploits toxins to immobilize prey prior to ingestion. Proc. Natl. Acad. Sci. U.S.A. 107(5), 2082–2087 (2010) 37. Sun, H., Hendry, D., Player, M., Watson, J.: In situ underwater electronic holographic camera for studies of plankton. IEE Journal of Oceanic Engineering 32(2), 373–382 (2007) 38. Lindeberg, T.: Scale-space theory in computer vision. Springer, Heidelberg (1994) 39. Kuhn, H.: The hungarian method for the assignment problem. Nav. Res. Logist. 2, 83–87 (1955) 40. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society of Industrial and Applied Mathematics 5(1), 32–38 (1957) 41. Pilgrim, R.: Munkres’ assignment algorithm; modified for rectangular matrices. Course Notes, Murray State University, http://csclab.murraystate.edu/bob.pilgrim/445/munkres.html 42. Masuda, N., Ito, T., Kayama, K., Kono, H., Satake, S., Kunugi, T., Sato, K.: Special purpose computer for digital holographic particle tracking velocimetry. Optics Express 14, 587–592 (2006) 43. Iken, K., Amsler, C., Greer, S., McClintock, J.: Qualitative and quantitative studies of the swimming behaviour of hincksia irregularis (phaeophyceae) spores: ecological implications and parameters for quantitative swimming assays. Phycologia 40, 359–366 (2001)
3D Reconstruction and Video-Based Rendering of Casually Captured Videos Aparna Taneja1 , Luca Ballan1 , Jens Puwein1 , Gabriel J. Brostow2, and Marc Pollefeys1 1 2
Computer Vision and Geometry Group, ETH Zurich, Switzerland Department of Computer Science, University College London, UK
Abstract. In this chapter we explore the possibility of interactively navigating a collection of casually captured videos of a performance: realworld footage captured on hand held cameras by a few members of the audience. The aim is to navigate the video collection in 3D by generating video based rendering of the performance using the offline pre-computed reconstruction of the event. We propose two different techniques to obtain this reconstruction, considering that the video collection may have been recorded in complex, uncontrolled outdoor environments. One approach recovers the event geometry by exploring the temporal domain of each video independently, while the other explores the spatial domain of the video collection at each time instant, independently. The pros and cons of the two methods and their applicability to the addressed navigation problem, are also discussed. In the end, we propose an interactive GPU-accelerated viewing tool to navigate the video collection. Keywords: Video-Based Rendering, Dynamic scene reconstruction, Free Viewpoint Video, 3DVideo, 3D reconstruction.
1
Introduction
Photo and video collections exist online with copious amounts of footage. Community contributed photos of scenery can already be registered together offline, allowing for navigation of specific landmarks using a fast Image-Based Rendering (IBR) representation [1]. We propose that similar capabilities should exist for videos of performances or events, filmed by members of the audience, for instance, with hand held cameras or mobile phones. In particular, we want to give the user the ability to replay the event by seamlessly navigating around a performer, using the collection of videos. One can only make weak assumptions about this footage because the audience members doing the filming may have various video-recording devices, they could sit or move about far apart from each other, they may be indoors or outdoors, and they may have a partially obstructed view of the action. Due to all these possibilities, a full 3D reconstruction of the dynamic scene observed by the videos in the collection, is very challenging. In this chapter we propose two different techniques to obtain such a reconstruction. Both these techniques first recover the geometry of the static elements D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 77–103, 2011. c Springer-Verlag Berlin Heidelberg 2011
78
A. Taneja et al.
of the scene and then, exploit this information to infer the geometry of the dynamic elements exploring the video collection in either the temporal or the spatial domain. We also propose a full pipeline to generate a hybrid representation of the video collection which can be navigated using an interactive GPU-accelerated viewing tool. The chapter is organized as follows: Section 2 discusses the related work. Section 3 describes the offline procedure that aims to compute the hybrid representation of the video collection. In particular, Section 3.3.1 and Section 3.3.2 present the two methods to recover the geometry of the dynamic elements of the scene, and Section 3.3.3 draws a comparison between the two methods. Section 4 describes how the video collection can be navigated interactively. Section 5 discusses the experimental results and Section 6 draws the conclusions.
2
Related Work
Research in the area of image based rendering has culminated in the Photo Tourism work of [1,2] and the commercially supported online PhotoSynth community. One of their main contributions was the pivotal insight that instead of stitching many people’s disparate photos together into a panorama, it is possible and useful to compute a 3D point-cloud from the 2D features that the photos have in common. The point-cloud in turn serves as a scaffold and a non-photorealistic backdrop that provides a spatial context. While a “visitor” navigates the original photos, they see the point-cloud and hints of other photos in a way that reflects the real spatial layout of, for example, the Trevi Fountain. The recent work of [3] extends the view interpolation to scenarios with erroneous and incomplete 3D scene geometry. However all these approaches treat the environment as static and do not deal with moving foreground elements (e.g. people), excluding them from the visualization. On the contrary, dynamic environments with moving foregrounds were widely studied in both the video based rendering community and the surface capture community. Surface capture techniques have exploited the usage of silhouette [4,5,6,7,8], photo consistency [9,10,11], shadows and shading [12,13,14], and motion [15] to recover the geometries of the dynamic elements of a scene. Multimodal techniques have also been explored combining multiple information at the same time to make the reconstruction robust to inaccurate input data. Relevant examples are [16,17,18] which combine silhouette and photo consistency, [14] which combines silhouette, shadow and shading, and [19] which combines narrow baseline stereo with wide baseline stereo. Prior knowledge on the foreground objects of the scene has also been exploited. For instance [20,21,22,23,24] assume a prior on the possible shapes of the foreground objects, specifically, they are assumed to be human. Most of these works however, focus on indoor controlled environments where the cameras are typically static and both the lighting and the background controlled. Approaches that have dealt with outdoor uncontrolled scenarios have resorted to a large and a dense arrangement of cameras, as in [25,26], to priors on the scene as in [27,28,29], or to priors on the foreground objects [30].
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
79
In situations where the inter-camera baseline is small, some methods have been proposed to generate video based rendering content. As an example, [25] used narrow baseline stereo and spanned a total of 30◦ of viewing angle using a chain of eight cameras, and could tolerate 100 pixels disparity by focusing special computations on depth discontinuities. The method proposed in [31] demonstrates that under conditions of even 15◦ angular separation, it can be sufficient to model the whole scene with homography transformations of 2D superpixels whose correspondence is computed as an alternative to per-pixel optical flow. In a studio setting but starting with crude geometry of a performer, [32] shows how good optical flow can fix texture-assignment problems that occur where views of some geometry overlap. Previously, view interpolation based on epipolar constraints was demonstrated in [33], where correspondences were specified manually. However normally, these view interpolation algorithms rely heavily on correlation based stereo and nearby cameras. To deal with a casually captured video collection, we need to consider the fact that these videos may have been captured in outdoor environments where the background can be complex and its appearance may change over time, the cameras may move and their internal settings may change during the recording. No assumptions can be made on the arrangement or density of the cameras and also, on the dynamic structure of the environment.
3
Offline Processing
The aim of this stage is to synthesize an hybrid representation of the video collection that will be subsequently navigated. More precisely, the aim is to i) recover the geometry and the appearance of all the static elements of the scene, ii) calibrate the video collection spatially, temporally and photometrically, and in the end, iii) recover an approximate representation of the shape of the dynamic elements of the scene. 3.1
Static Elements Reconstruction
A 3D reconstruction of all the static elements of the scene is necessary to: i) provide context while rendering transitions, ii) calibrate the camera poses for each video frame, and iii) refine each camera’s video matte. A variety of methods exist for static scene reconstruction [34,35,36,37,38,39,40]. Aside from photos and videos of a specific event, one could also use online photo collections of specific places to build dense 3D models [38]. For the sake of simplicity, we refer to the static elements of the scene as the background and to the dynamic elements as the foreground or the middleground. In particular, we refer to the foreground as the main object of interest in the footage (e.g., the main performer), while we refer to the middleground as the remaining dynamic elements of the scene. We follow the same Structure from Motion (SfM) strategy as in [1], matching SIFT features [41] between photos, estimating initial camera poses and 3D points, and refining the 3D solutions via bundle adjustment. We then proceed
80
A. Taneja et al.
SfM + MVS
(a)
Texture Mapping
(b)
(c)
Fig. 1. (a) Collection of images of the filming location. (b) Geometry of the static elements of the scene. (c) Textured Geometry.
Residual Input Video
Static Geometry
Synthesize Video
Synthesized Video
Recompute Calibration
Fig. 2. Refining the camera poses by minimizing the error between the actual video and the synthesized video
by computing a depthmap for each photo using standard multi-view planesweep stereo (MVS) based on normalized cross-correlation [42]. The final polygonal surface mesh is generated using the robust range image fusion presented in [43]. A static texture for the background geometry is also extracted from the photos and baked on (see Figure 1), using a wavelet based pyramidal fusion technique as presented in [44,45]. Since the background scene is fairly dynamic in places, much of that texture will be replaced during the interactive stage of the system, by sampling the view-dependent colors opportunistically from each camera’s video. 3.2
Spatial, Temporal and Photometric Calibration
The camera poses for all the video frames are computed relative to the reconstructed background geometry. We refer to the real image seen by camera A at time t as ItA . The intrinsics, KtA ∈ R3×3 , and the extrinsics EtA ∈ R3×4 for each image ItA are estimated as follows. First, the SIFT features found in images that had been used to reconstruct the background, are searched for potential matches
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
81
to features found in each video frame. These matches generate correspondences between 2D points in the current frame and 3D points in the background geometry. The pose of that camera at that specific time is recovered by applying the Direct Linear Transform (DLT) [46] and a refinement step based on the reprojection error. This approach however, does not guarantee that the poses of different cameras are recovered with the same 3D accuracy. Similar reprojection errors of sparse features, as measured in pixels, could indicate very different qualities of poseestimation, especially when depths and resolutions vary greatly. The key is to achieve a calibration that looks correct when the textured geometry is rendered in conjunction with the performer during the interaction stage, even if it is off by a few meters. Therefore, we perform a second optimization of the camera poses. We use particle filtering [47] to minimize the sum-of-square-difference between each ItA and the image obtained by rendering the texture of the geometry at the current calibration estimate (see Figure 2). In this case, the texture is obtained as the median reprojected texture from a temporal window of 1000 frames of the same camera A (subsampled for efficiency). The video collection is synchronized, as proposed in [30], by performing correlation of the audio signals. We silence the quieter 90% of each video, align on the rest, and still need to manually timeshift about one in four videos. Sound travels slowly, so video-only synchronization [48,49] may be preferable despite being more costly computationally. To account for different settings in the cameras, like different exposure time, gain and white balancing, the video streams are also calibrated photometrically with respect to each other and with respect to the textured static geometry. More specifically, we used the method proposed in [50] to compute a color transfer function mapping the color space of one camera into the color space of another. This mapping is computed for all the pairs of cameras and also between each camera and the textured geometry. 3.3
Dynamic Elements Reconstruction
As discussed before in the related work section, a lot of techniques have been developed to recover the shapes of dynamic elements in a scene, exploiting all the possible kinds of depth cues, like multi-view stereo and silhouettes, and their possible combinations. However, in the considered scenario of casually captured videos, only few assumptions can be made about the scene and the way it was recorded. As a consequence, algorithms like multi-view stereo are not applicable in general since they rely on a dense arrangement of cameras close to the object of interest, which might not always be the case. On the contrary, the usage of silhouette as a depth cue does not suffer from such constraints, but estimating silhouette in an uncontrolled outdoor environment is not trivial. In fact, this procedure relies on an accurate knowledge of the appearance of the background. This appearance is easy to infer when the cameras are static and a few frames representing the empty background are available.
82
A. Taneja et al.
However, in a casually captured outdoor scenario, the cameras may move and the background appearance may change over time due to changes in illumination, exposure or white balance compensations, and moving objects in the background. While variations in illumination and camera settings can be partially dealt with using the photometric calibration data, the changes in appearance due to moving objects remains a big issue. For instance, in the case of the Rothman sequence (Figure 3), the audience was standing throughout the event and was therefore modeled by the system as static geometry. While the motion of some people within the crowd did not influence the crowd geometry, it does influence the crowd/background appearance. In the past, some approaches have already explored segmentation in a moving cameras scenario but mainly resorting to priors on the shape of the dynamic elements of the scene, as in [30,51,52], to priors on their appearance, as in [53], or to priors on their motions, as in [54]. We choose to use silhouette information for reconstruction and do not resort to any of these priors, but instead exploit the known information about the geometry of the static background. Unlike Video SnapCut [55], Video Cutout [56] and Background Cut [57], we prefer a more automatic segmentation technique which requires only a little user interaction at the cost of segmentation accuracy. This is motivated by the fact that for a video collection where each video contains thousands of frames the above mentioned technique would demand a lot of time and effort from the user. Our downstream rendering process is designed specifically to cope with our lowerquality segmentation. In the next sections, we propose two methods to recover the appearance of the background, namely the ones presented in [58] and [59]. The former estimates the background appearance for all the videos individually looking back and forth in their time-line, while the latter recovers the background appearance by transferring information about the same time instant across different cameras. While the latter approach provides a full 3D volumetric representation of the dynamic elements of the scene, the former approach approximates this geometry as a set of billboards. Section 3.3.3 compares these two approaches and the corresponding representations for the purpose of video based rendering. 3.3.1 Reconstruction Using Temporal Information. In this approach, each dynamic element is represented as a set of billboards where each billboard corresponds to a specific camera. All the billboards related to a dynamic object are centered on the object’s actual center and their normal is aligned perpendicular to the ground plane. Each billboard faces the related camera at each time instant and has a texture and transparency map provided by the corresponding camera (see Figure 10). While the texture is obtained by trivially projecting the corresponding video frame onto the billboard, the transparency map, implicitly representing the silhouette of the dynamic element, is inferred by exploring the temporal domain of each video, independently. This procedure is performed as follows.
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
(a)
(b)
(c)
83
(d)
Fig. 3. (a) One of the input frames of the Rothman sequence. (b) The obtained initial segmentation. (c) The mean of the obtained per-pixel color distribution of the background for that specific frame. (d) The final segmentation.
A user is first required to paint the pixels of two random images from each video with the binary labels Ω ∈ {1, 0}, to indicate foreground pixels that belong to the performer, vs. background pixels that do not. With multiple videos, each lasting potentially thousands of frames, all subsequent segmentation, is computed automatically, despite the obvious complications for our video based rendering approach. Even using a primitive paint program, the user effort does not exceed 10 min. per input video. The user-labeled training pixels define a foreground and a background color model. We simply use a k-nearest-neighbor classifier (k = 60) in RGB space, so the pixel-wise independent posterior probability is kkΩ , amounting to the fraction of a pixel’s color neighbors that had been labeled Ω. To get a conservative foreground mask γtA and to compute it efficiently, we store the class-conditional likelihood ratio of foreground to background in a discretized 2563 color cube lookup table. The table usually takes 5 min. to compute, and each frame is then segmented in 2-3 sec., using 0.6 as the necessary distance ratio to label a pixel as foreground (see Figure 3). To get a conservative foreground mask, mean-shift tracking ( [60]) was used to predict the area of the foreground pixels. Only pixels labeled as foreground and belonging to that area are considered as foreground objects. This decreases the number of false positive foreground pixels. The quality of this initial segmentation is however insufficient for our rendering purposes, as shown in Figure 3(b). To improve it, we use a new background color model, the same foreground color model as above, and graph cuts [61] to optimize the boundary. Each image It is treated as a moving foreground ft with changing background bt by the compositing equation, It = αt ft + (1 − αt ) bt , where αt is the per-pixel alpha matte. With a binary initial segmentation γt in hand, we now seek to estimate f , b and a refined α for each frame. A per-pixel color model for the background bt of each video frame is estimated first. Dilation of the initial segmentation γt by 10 pixels gives a conservative background mask, removing the need for a manually specified traveling garbage matte. Knowing both the background geometry and the calibration parameters,
84
A. Taneja et al.
we can render the “empty” scene seen at time t from camera A using the colors from elsewhere in A’s timeline (see Figure 3(c)). In one sense our approach is similar to that of [62], where a model of the background is generated and textured using the input video. Here, much like Chuang et al., we determine the probability distribution of bt by sampling from temporally proximate frames. Our algorithm collects samples of bt (m) for m’s which are not labeled as foreground at time t, i.e. those where γt (m) = 0. Further samples are collected by searching backward and forward in time with increasing Δ, projecting the images It±Δ with their related γt±Δ , onto the scene, according to A’s calibrations. Once 10 samples for the same pixel m have been collected, a Gaussian is fitted to model bt (m), though we save the medians instead of the means. This procedure has been parallelized and runs with GPU acceleration. We first solve the compositing equation assuming α’s are binary, leading to a trimap that is ready for further processing. Graph cuts is applied to maximize the conditional probability P (αt |It ), which is proportional to P (It |αt )P (αt ). Applying the logarithm and under the usual assumptions of conditional independence, log (P (αt )) represents the binary potential, while log (P (It |αt )) represents the unary potential. For each pixel m in It , P (It (m)|αt ) = P (ft )αt (m)P (bt (m))(1−αt (m))
(1)
where P (ft ) is the foreground color model estimated above and P (bt (m)) is the aforementioned Gaussian distribution. Due to the inevitable presence of small calibration and background geometry errors, the projection of It±Δ can be imperfect by some small local transformations. To account for this, P (bt (m)) is actually considered to be the maximum of all the pixels in a 5 × 5 neighborhood. The binary potential is formulated as the standard smoothness term, but modified to take into account both spatial and temporal gradients in the video. Once this discrete solution for αt is found, a trimap is automatically generated by erosion and dilation (3 and 1 pixels respectively). For all grey pixels in the trimap, we apply the matting technique proposed in [63]. An example result is shown in Figure 3(d). The presented segmentation procedure extracts both the foreground and the middleground elements of the scene. The mean-shift tracker is able to distinguish between the dynamic elements so that during rendering, those elements are modeled as separate sets of billboards. When instead, the 3D position of a middleground element cannot be triangulated, as happens when it appears in only one camera, our system makes it disappear before a transition starts, and reappear as the transition concludes. This situation can be observed in the Magician and the Juggler sequences, when people stand in front of somebody’s cameras. 3.3.2 Reconstruction Using Spatial Information. The method proposed in the previous section explores the temporal domain of the videos to segment the dynamic elements in each video independently. An alternative solution would be to exploit the spatial domain of these videos and retrieve the appearance of
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
It1
It2
It3
Rt21
Rt22
Rt23
Dt21
Dt22
Dt23
85
Fig. 4. (Top row) Source images acquired respectively by camera 1, 2 and 3. (Middle row) Images Rtij computed by projecting the previous images into camera 2 (black pixels indicate missing color information, i.e., β = 0). (Bottom row) Difference images Dtij .
the background from the images provided by other cameras at the same time instant. In other words, we propose to jointly segment the dynamic elements by using the known 3D geometry of the background to transfer color information across multiple views. This will result in a full volumetric reconstruction of the dynamic elements of the scene. The proposed technique is explained in detail in the following paragraphs. Given some images captured at the same time instant t, we aim to project each image onto the other images and exploiting their differences. Let i and j be two cameras given at time t and let πti be the projection function mapping 3D points in the world coordinate system to 2D points in the image coordinate system of camera i according to both the intrinsic and the extrinsic parameters. Since both the background geometry and the projection function πti are known, the depth map of the background geometry seen by camera i can be computed. Let’s denote this depth map with Zti . The value stored in each of its pixels represents the depth of the closest 3D point of the background geometry that projects to that pixel using πti . In practice, Zti can be easily computed in GPU by rendering the background geometry from the point of view of camera i and by extracting the resulting Z-buffer. Let Rtij denote the image obtained by projecting the image Itj into camera i, i.e., the image obtained by rendering the background geometry from the point of view of camera i using the color information of camera j and taking into account the color transfer function between i and j. More formally, for each pixel p in Rtij , we know that (πti )−1 ([p, Zti (p)]T ) represents the coordinates of the closest 3D point in the background geometry projecting in p. Note that (πti )−1 is the
86
A. Taneja et al.
Background geometry Scene element g Ghost of g
Camera j
Camera i Ideal image of g
Fig. 5. Image formation process for a reprojection image Rtij . Since the scene element γ is not a part of the background geometry, it generates a ghost image on camera i which is far away from the region it should ideally project to if it were a part of the background geometry.
inverse of the projection function πti where the depth is assumed to be known and equal to Zti (p). Therefore, the coordinates of pixel p in the image j are equal to πtj ((πti )−1 ([p, Zti (p)]T )) (2) In the end, the color of the pixel p in Rtij is defined as follows Rtij (p) = Ij (πtj ((πti )−1 ([p, Zti (p)]T )))
(3)
Let us note that no color information can be retrieved for pixels of Rtij that map outside the field of view of camera j and also for those which have no depth information in Zti , e.g., for those projecting onto regions not modeled by the background geometry. We keep track of such pixels by defining a binary mask βtij such that, βtij (p) = 0 indicates the absence of color information at pixel p in Rtij . The procedure of computing Rtij is performed on GPU using shaders. Figure 4 shows some example images Rtij obtained by projecting the images captured by three different cameras, namely #1, #2 and #3, into the camera #2. The reader can notice that, when the background geometry matches the current scene geometry the captured image Iti and the image Rtij look alike in all the pixels with βtij (p) equal to one. On the contrary, in the presence of a dynamic object which was not present in the background geometry, this gets projected into the background points behind it. This reprojection is referred as the ghost of the foreground object in the image Rtij . Figure 5 explains this concept visually. In Figure 4, the ghost of the juggler can be observed in both images Rt21 and Rt23 while it’s not visible in Rt22 since the image is projected on itself. Let’s call Dtij the image obtained by a per-pixel comparison between image Iti with Rtij . In order to make our comparison method robust to errors that may be
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
(a)
(b)
87
(c)
Fig. 6. Results obtained by applying different color similarity measures to compare the two images Itj and Rtij in order to build the image Dtij . (a) Result obtained by applying the Equation 4. (b) Result obtained by applying the Equation 4 with Itj and Rtij swapped. (c) Result obtained by applying the Equation 5.
present in either the calibration or in the background geometry, the similarity measure used to compare these two images takes into account for local affine transformations in the image space. We propose to compute Dtij as (4) Dtij (p) = min (Iti (p) − Rtij (q)) q∈Wp
where Wp is a window around p and · is the L1 norm in the RGB color space. This similarity measure proved to be more robust but, unfortunately, some details around the ghost borders are lost. This can be seen in Figure 6(a) where the ghost of the foreground object gets shrunk by half the window size used. In order to avoid these artifacts, the same approach is repeated by comparing, this time, the pixel p in Rtij to a corresponding window Wp in Iti . A result obtained by using this second approach is shown in Figure 6(b) where, this time instead, the silhouette of the foreground object gets shrunk by half the window size. In the end we chose to use the following metric which combines the advantages of the both the previous metrics: Dtij (p) = max( min (Iti (p) − Rtij (q)), min (Rtij (p) − Iti (q))) (5) q∈Wp
q∈Wp
A result obtained by applying this new metric can be seen in Figure 6(c). Given the input images Iti , all the possible images Dtij for each i > j are computed. This leads to a set of (n2 − n)/2 difference images Dtij that we will refer to as D. The problem of recovering the 3D geometry of the foreground object is formulated in a probabilistic way using as observation the computed set of images D. The scene to be reconstructed is discretized as a voxel grid. Let G be the random vector representing the occupancy state of all the voxels inside this grid where Gkt = 1 indicates the voxel k is full and empty otherwise. The aim is to find a labeling L∗ for G which maximizes the posterior probability P (G = L|D), i.e., L∗ = arg max P (G = L|D) (6) L
88
A. Taneja et al.
By the Bayes’ rule, this is equivalent to L∗ = arg max(log(P (D|G = L)) + log(P (G = L)))
(7)
L
We first describe how the probability P (D|G = L) is computed for a given labeling of the voxel grid, while P (G = L) is described later. Let φit (k) denote the footprint of the voxel k in camera i, i.e., the projection of all the 3D points belonging to k onto the image plane of camera i. Furthermore, ij denote with χij t (k) the set of the ghost pixels of voxel k in the image Rt . Since these pixels are the ones corresponding to the background geometry points occluded by the foreground object in camera j, i.e. (πtj )−1 ([φjt (k), Ztj (φjt (k))]T ), χij t (k) can be computed as follows j −1 i χij ([φjt (k), Ztj (φjt (k))]T )) t (k) = πt ((πt )
(8)
i.e., by projecting those background points into camera i (See Figure 5). We make three conditional independence assumptions for computing the probability P (D|G = L): first, the state of the voxels are assumed to be conditionally independent; second, the image formation process is assumed to be independent for the all images and third, the color of a pixel in an image is independent from the others. Using these assumptions, the probability P (D|G = L) can be expressed as P (D|G = L) = P (D|Gkt = Lkt ) (9) k
where P (D|Gkt ) =
P (Dtij (p)|Gkt )
∀p ∈ φit (k) ∪ χij t (k)
(10)
i,j,p
Let us now introduce another random variable Ctij representing the consensus between the pixels in image Iti and the ones in image Rtij . Ctij (p) = 1 indicates that the color information at pixel p in Iti agrees with the color information at p in Rtij . Clearly, this variable strongly depends on the image Dtij . Specifically, P (Dtij (p)|Gkt ) is modeled using a formulation similar to the one proposed by Franco and Boyer in [6], i.e., P (Dtij (p)|Gkt ) = P (Dtij (p)|Ctij (p) = 1)P (Ctij (p) = 1|Gkt ) + P (Dtij (p)|Ctij (p) = 0)P (Ctij (p) = 0|Gkt )
(11)
P (Dtij (p)|Ctij (p))
While in their work they used background images to determine we assume the following: in case of consensus (Ctij (p) = 1) the probability of Dtij (p) being high is low and vice versa. Therefore P (Dtij (p)|Ctij (p) = 1) is chosen to be a Gaussian distribution centered at zero and truncated for values less than zero. Concerning the pixels with no color information, i.e., the ones with βtij (p) = 0, we assume this probability to be uniform. Therefore, ⎧ 2 ⎨ − (Dtij (p)) 2σ2 ij ij βtij (p) = 1 P (Dt (p)|Ct (p) = 1) = κ e (12) ⎩U βtij (p) = 0
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
89
where κ is the normalization factor for the Gaussian distribution, and U the uniform distribution. On the contrary, when there is no consensus (Ctij (p) = 0) no information can be stated for Dtij (p) and therefore P (Dtij (p)|Ctij (p) = 0) is set to the uniform distribution. P (Ctij (p) = 1|Gkt ) and P (Ctij (p) = 0|Gkt ) are defined in a similar way as in [6] but while in their formulation, the state of the voxel k is influenced only by the background state of the pixels in φit (k), in our formulation its state is also influenced by the pixels in χij t (k). While this property adds additional dependence between the voxels, it provides more information on the state of each voxel. In fact, we not only rely on the consensus observed in the voxel’s footprint φit (k) but also on the consensus observed in χij t (k). This allows us to recover from two kinds of situations, namely: when the colors of the foreground object are similar to the colors of the actual background points behind it, and when the information corresponding to the foreground object in the image Rtij is missing. However, our approach will not help if the colors of the actual background points in χij t (k) are also similar to the colors of the foreground element. Concerning P (G = L) we assume dependency only between neighboring voxels (26-neighborhood). In this way, Equation 7 can be entirely solved using graph cuts [64,65,61]. More precisely, the pairwise potential log(P (Gat = Lat , Gbt = Lbt )) between two neighboring voxels a and b is defined considering that if these voxels project to pixels lying on edges of the original images Iti there should be a low cost for cutting across these voxels and viceversa. To account for this, in our implementation, we compute the projection of the centers of each pair of neighboring voxels a and b on each image Iti . Subsequently we check all the pixels on the line connecting these two projections looking for an edge. If an edge is not found then the pairwise potential is increased. To account for temporal continuity in the final mesh the voxel state prior takes into account its labeling computed in the previous frame according to P (Gat = 1) = 0.3 + ξ((L∗ )at−1 ) where ξ defines the temporal smoothness. Once graph cuts provides a grid labeling L∗ as a solution for Equation 7, marching cubes [66] can be applied to obtain a continuous mesh of the dynamic object. 3.3.3 Comparison. In the previous sections we presented two techniques to segment the dynamic elements of a scene. These two methods differ from each other in the way they infer the appearance of the background. The former searches this information over time, looking back and forth inside a time window of a single video. The latter instead, recovers the background appearance by transferring information about the same time instant across different cameras. While the former provides a pixel-level segmentation accuracy, the latter is limited by the voxel resolution which cannot be pushed over a certain limit without considering calibration errors. However, since the latter approach models the background appearance independently for each time instant, it is more robust to abrupt changes in the background appearance. On the contrary, the
90
A. Taneja et al.
Fig. 7. (Left and Right) Example of segmentation obtained on the Rothman sequence using the method presented in Section 3.3.1. (Center) Reconstruction obtained using deterministic visual hull on these segmentations. It is evident that segmentation errors in even a single image influence significantly the reconstruction.
former approach would have serious problems in scenarios, such as a concert, where moving spotlights can change the background appearance quickly. Concerning the shape representations proposed by the two methods, what we can observe is that a billboard can only make a simple planar approximation of the shape of a 3D object. Perspective artifacts will appear if the observer’s viewing angle is larger than 10◦ . Merging multiple billboards together tends to approximate the actual geometry of the object, hence reducing the visible artifacts significantly. However, using the temporal domain method of Section 3.3.1, this would require a lot of cameras viewing the same object. An accurate volumetric or mesh representation would also provide a nice visualization of an object. However, extracting these representations using shape from silhouette approaches involves the fusion of information coming from multiple cameras. This fusion is sensitive to the inevitable errors inherent in the calibration and the segmentation. Since one cannot assume that the obtained segmentations would have the same accuracy, this becomes a significant issue in our scenario. For instance, an error in the segmentation or calibration in even a single camera can corrupt the entire reconstruction (e.g., see Figure 7). Probabilistic approaches, like the one proposed in Section 3.3.2, can deal with such issues since its formulation is more robust to misleading information. However the usage of this technique results in visible quantization artifacts since the scene has to be discretized into voxels whose size is bound by the available information. For the purpose of video based rendering, we chose to use the billboards representation since the quantization was a big issue. The following sections describe how the visual artifacts generated by the planar approximations, introduced by this representation, can be minimized.
4
Online Navigation
Once a hybrid representation has been computed, the video collection can be navigated. In the following sections, we present our online navigation tool which allows a user to interactively explore the event from multiple viewpoints.
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
91
Fig. 8. Interactive Navigation Interface: Regular Mode (left) is a live preview of the content being rendered to the final output video. Orbit Mode (right) has the same functionality, but also depicts the scenery, performer, and moving cameras. Users can switch between the two modes, and always have jog/shuttle control over both the timeline of the input footage.
4.1
User-Interface
The largely GPU-driven user interface of the system lets the user smoothly navigate the video collection in both space and time. The GUI can be operated in two different modes, Regular Mode and Orbit Mode, where the same jog/shuttle and camera-transition commands are available by keyboard or mouse at all times. Those commands can be recorded and used as an edit-list for more elaborate postprocessing of an output video. The Regular Mode is essentially a rendering of the event from either a real cameras’s perspective, or the virtual camera’s transition when the user clicks on the navigation arrows (see Figure 8). The navigator icon on the lower right corner of the interface, indicates the possible directions that the user can go (up, down, left, right, forward and backward depending on the availability of nearby videos). Each camera’s neighbors are determined relative to its image plane. Orbit Mode has a live preview window to the side, and serves primarily as a digital production control-room, where the scenery, performer, and all the moving cameras are depicted as elements of a dynamic 3D world. Orbit Mode also has a video wall option where inset views of each camera are fixed in place on the screen, but some users preferred when these individual videos played as moving screens inside the scene. 4.2
Video-Based Rendering
The user always watches the scene from the point of view of the Virtual Camera V . The virtual camera intrinsic parameters K V are assumed to be fixed and equal to one of the cameras recording the scene. Its extrinsic parameters EtV are always locked to one of the cameras of the collection and unlocked only during a transition from one camera to another. When a camera change is requested, a virtual camera V performs the view interpolation from a starting camera A to an ending camera B, over a period of time [t0 , t1 ]. EtV0 = EtA0 at the start of the transition and EtV1 = EtB1 by the end.
92
A. Taneja et al.
In order to compute the intermediate rotation and translation matrices for camera V , we performed a camera interpolation technique aimed to maintain a constant and linear translation of the image of the actor. Formally, given the point t in space, representing the barycenter of the actor, we force the image of this point in the virtual camera, at any time t ∈ [t0 , t1 ], to be the exact linear interpolation between what it was at time t0 (in view A) and what it will be at t1 (in view B). For more details, please refer to [58]. As mentioned in Section 3.3, the shape of a dynamic element of the scene can either be represented using a set of billboards or using a volumetric representation. In the next sections, we will explain how the scene can be rendered using both representations. 4.2.1 Volumetric Representation. A volumetric representation of the dynamic elements of the scene can be computed for each time instant t as described in Section 3.3.2. Once a transition is requested the dynamic elements are rendered as part of the background geometry using an approach similar to Unstructured Lumigraph [67]. A detailed description of this procedure is presented in the Section 4.2.3. Figure 9 shows a reconstruction result obtained on the Juggler sequence. The voxel grid used was of resolution 140 × 140 × 140 covering the
(a)
(b)
(c)
(d)
Fig. 9. Results obtained on one frame of the juggler sequence. (a) Volumetric reconstruction. (b) One frame of the videos used for the reconstruction. (c) Reconstructed volume projected back to the previous image. (d) Reconstruction rendered from another viewpoint.
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
93
entire extent of the scene where the action took place. Figures 9(a) and 9(d) show one reconstructed frame where two people, a foreground and a middleground, are present in the scene. While the volumetric reconstruction is robust enough to deal with segmentation inaccuracies and to recover the small balls being juggled by the performer, the quantization introduced by the voxelization makes this representation inappropriate for video based rendering purposes, since the resolution of the voxel grid cannot be increased indefinitely without considering calibration and background geometry errors. 4.2.2 Billboard Representation and Transition Optimization. In this case, when a transition is requested from camera A to B, each dynamic element of the scene is modeled by the proxy shape of the two billboards ζ A and ζ B related to camera A and B, respectively (see Figure 10). Each billboard approximates the actor’s geometry using a planar proxy, and therefore it can introduce significant artifacts while one navigates between cameras. However, billboards can actually be quite effective, as long as we use them in tandem with a good measure of the expected visual disturbance. Ideally, while V is traveling along its path between A and B, V would cross-fade imperceptibly from rendering mostly the billboard ζ A to showing mostly ζ B . We have observed that a well-placed billboard is a convincing enough proxy shape for viewing-angle changes around 10◦ , but the illusion can quickly be lost when the second billboard comes into view. The enhanced Cross Dissolve presented in [68] could help, but we have found that if timed correctly, a cut from one billboard to the next can be almost unnoticeable. [69] made a similar observation. It is preferable for the user to confuse a sharper appearance change-over with the performer’s natural ongoing motions. The best time for appearance changeover is when the action is at its most fronto-parallel to the two cameras. Choosing
B
A
V Fig. 10. As the virtual camera transitions from view A to view B, the foreground object is represented by two video sprites on planar billboards, one for each view. The video footage from each camera is rendered onto the respective billboard with the segmentation mask applied.
94
A. Taneja et al.
a bad time will reveal the actor’s current 3D shape as non-planar. We will explain later the simple strategy that finds the best change-over time, but first we introduce the error measure to be optimized namely, the Inter-Billboard Distance. The Inter-Billboard Distance at time t for camera V is computed using the following procedure. Each billboard, ζ A and ζ B , is first rendered separately from B the viewpoint of the virtual camera V at time t using the masks αA t and αt as texture. Those two images are then thresholded, producing two silhouette images, S1 and S2 . Overlaying S2 on S1 , as seen in camera V in Figure 10, allows one to evaluate how much change a user can perceive if the two billboards are suddenly swapped during the transition. The more these two images agree the less perceptible the change is. Mathematically, the distance measure used is 1 1 D(S1 , S2 ) = d(m, S2 ) + d(m, S1 ), (13) #S1 #S2 m∈S1
m∈S2
where m represents a pixel inside the silhouette, and d (m, S) is the l 2 -distance between this point and a silhouette S. #S1 and #S2 represent the number of points in S1 and S2 , respectively. This error can be quickly computed in a fragment shader using the distance transform [70]. We also tried a correlationbased distance, but found it less effective at matching the perceptual differences observed by the user. In fact, changes of appearance within the silhouette that occur during a change-over are often perceptually confused with subject motion. The Inter-Billboard Distance (Equation 13) largely dictates the right moment to switch billboards. We have found that also including the start time as a
Fig. 11. (Top) Background rendered from left, right, and view-independent texture, (center) corresponding suitability maps, (bottom) final rendered background and generated motion blurred background
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
95
parameter can lead to a much better optimum. Thus we optimize over two variables: ρ which is the fraction of the transition interval at which the billboard transition occurs, and Δ which is the transition delay, the time between the user’s request and its actual start. This search is similar in spirit to the approach [71] proposed for combining motion capture data. The start time is delayed by no more than 3 sec., and the transition time was set to 1.5 sec. Since the search domain is limited and known, a fast grid search optimizes both parameters in a separate thread to preserve real-time playback. Once the user requests a transition, the exact timing is optimized as described, such that the transition is performed to minimize disturbing visual artifacts. 4.2.3 Rendering the Virtual Camera. During normal playback of a video in Regular Mode, the virtual camera position is locked to the real camera’s extrinsics. Depending on camera A’s intrinsic parameters KtA , the original video is played at a different size in relation to the virtual camera intrinsics K V . Black borders are added to the video if its size is smaller than the virtual camera. This happens, for instance, when one camera has landscape orientation while the other is in portrait mode, or if the zooms are different. While it is possible to adapt the intrinsic camera parameters of the virtual camera to those of the real one, that can create perceptually undesirable effects (i.e. the Vertigo effect). During the first 20% of the transition, the virtual camera remains locked to the original viewpoint, but the scene rendering fades from the original video to the synthetically rendered scene (at which point the black borders disappear). Then the virtual camera starts moving along the computed transition path while the video is still playing. Like the start of the transition, the virtual camera is locked to the target camera position for the last 20% of the transition, when video of the target camera fades in. During the entire transition, the video is rendered using the color space of the original camera. This is done by using precomputed 3 × 3 color transformations, approximately mapping the appearance between videos, and also from the view-independent texture to the videos. Only during the last 20% is the appearance gradually transformed to the target video. Next, the middle of the synthetically rendered video transition is created. Although a very large amount of footage is available for rendering, a real-time rendering application must take bandwidth and other system hardware limitations into account. Using all the available videos, masks, and background videos simultaneously would require far too many resources to render the scene interactively. To render a transition from camera A to camera B, we chose to load and use only data extracted from videos A and B, and the static information of the scene. These two cameras are normally also the closest to the virtual camera path, and the benefits of using more videos are often limited. This tradeoff is similar to the one made for IBR of static scenes by [72], where at most three views were used to texture each scene element. We adapt the Unstructured Lumigraph Rendering framework [67] to cope with the fact that some parts of the background scene are occluded by foreground and that we can only afford to use two videos. At each time t, the images ItA and
96
A. Taneja et al.
ItB are used to color the geometry of the background scene, as in [67]. The B generated α-masks, αA t and αt are used to mask the foreground pixel elements A B of It and It , respectively. Three images of the scene from the point of view V are generated: the first one uses only the color information from ItA , the second uses colors from ItB , while the last one uses the view-independent texture extracted in the pre-processing stage. The view-independent texture is necessary because on the path between A and B, the virtual camera can see parts of the scene that are hidden in both A and B. For each generated image, a per-pixel suitability mask is generated in parallel, taking into account the α-masks (i.e. that a pixel is background or not), occlusions, and viewing angles. Occlusions are handled by rendering the depth maps of both ItA and ItB . We use the angle differences with respect to the surface normals to weight each pixel from the two sources. This is important in the presence of miscalibrations and geometry errors. The suitability mask of the image generated using the static texture is given a constant low value so that its colors are used only where neither of the other images can provide useful information. After the suitability has been computed, a dilation/erosion and smoothing filter is applied to ensure a spatially smooth transition between the texture during the blending, and to account for discontinuities and blobs that can appear due to occlusion handling and matting errors. The entire procedure is implemented on the GPU using a 3-pass rendering. As a final step, a motion blur filter is applied to all the pixels belonging to the background scene, which makes the foreground object stand out. This is a user-controllable option in the software, and we found that, when enabled, the user’s attention is focused on the performer, i.e. the center of the action, while the motion blur gives peripheral cues about the direction and the speed of the transition. The whole background rendering approach is illustrated in Figure 11. The foreground elements of the scene are then rendered using a similar techB nique with the images ItA and ItB , and the appropriate alpha masks αA t and αt . In the case of billboards, only the two billboards ζ A and ζ B are rendered for each dynamic element and the transition from ζ A to ζ B is decided as described in Section 3.3.1.
5
Experiments
There are many casually filmed events, but multi-view footage that is publicdomain is so far readily available only when Citizen-journalists are provided with a specified portal for submissions eg. after a U2 concert. We obtained the Climber and Dancer datasets from [30] and INRIA Grenoble Rhone-Alpes respectively, and the Juggler, Magician, and Rothman data by attending real events, handing out cameras to members of the public with instructions to play with the settings, and where needed, obtaining signatures allowing for use and dissemination of the footage. These events were chosen because together, they explore a variety of challenges in terms of inter-camera distance, large out-ofplane motion, fast performances of skill, complicated outdoor and indoor lighting conditions, and intrusive objects in the field of view. We performed the manual
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
97
part of the process ourselves, labeling the performer in two frames per video, and locating ∼ 40 12MP photos of each new environment. The Climber videos are 720 × 544, the Dancer videos are 780 × 582, and the new footage measures 960×544 pixels, with people filming in landscape or portrait mode (or switching), with different settings for zoom, automatic gain, and white balance. Some people adjusted these manually at times. Naturally, the results of this interactive system are best evaluated in video, so please see our web site [73], where a fully working prototype of our video based rendering player is also downloadable. Among the videos available on our web site, several videos demonstrate specific stages of the algorithm such as rendering-for-matting, and several videos show events produced by volunteer test-subjects. Similar colors on the performer and the background are inevitable, which our initial segmentation confirms repeatedly. Even drastically increasing the amount of training data had no effect. The γ masks are frequently exaggerated in size, but that being only an intermediate stage, simply meant that the adaptive scene renderer had to seek further out in the timeline to obtain enough samples. With our new form of background subtraction, even significant imperfections in the reconstructed scene geometry did not hinder us from pulling a useful matte, probably because those imperfections coincided with textureless areas. The bigger segmentation problems occur when the subject exhibits significant motion blur, because mixed pixels can match the rendered background quite well. Clutter in the scene is caused by both objects that change and people who move around the performer. For the juggler sequence, there are sufficient moving cameras to obtain a reconstruction of all the dynamic elements. However, while scenes like Magician and Rothman have enough cameras in positions to triangulate billboards (of the performer and the clutter), their coverage is sparse and their calibrations and segmentations are off by too much to yield acceptable 3D shapes. We also experimented with computing heightfields to augment our billboards, but without structured lights like those of [20], the results were disappointing. These findings seem consistent with [74]. The modicum of clutter in the scenes we tested was handled with relatively few artifacts because elements that were rejected from the background model either ended up as middleground billboards due to their 3D separation from the performer, or when incorrectly merged with the performer in one view, were deemed too costly by the Transition Optimization. The current prototype is real-time, running at 25 fps on an Intel i7 2.93Ghz Quad-core with 8GB of memory, an nVidia GTX280 GPU, and a normal 7200rpm hard drive. Even events filmed with at least six cameras can be explored without impacting performance, because videos are streamed locally, and can be subsampled if HD footage were available. The information necessary for the next frames is preloaded by a separate thread to allow undisturbed real-time rendering. Each scene’s geometry takes ∼1hr to reconstruct and is automatic except that part way through, the pipeline presented in [43] requires the user to designate a
98
A. Taneja et al.
Fig. 12. Examples of different transitions for the Juggler and the Rothman sequence. In the Juggler sequence, the three consecutive frames span the best changeover (i.e. switch between billboards) found within a given timeframe. The optimization is successful if this changeover is hard to perceive. The background can be motion blurred or not.
bounding box for the volume reconstruction, and afterwards fit a plane for the ground. After this user effort, the automatic processing takes multiple hours. In the left column of Figure 12, three of our many example transitions are shown between different cameras for the Juggler sequence. The right column shows an example of transition from the Rothman sequence.
6
Conclusions
In this chapter we explored the possibility of interactively navigating a casually captured video collection of a performance. We proposed an offline pre-processing stage to recover the geometry of all the static elements of the scene and to calibrate the video collection spatially, temporally and photometrically relative to this geometry.
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
99
We proposed two different approaches to obtain the reconstruction of the dynamic elements of the scene, both exploiting the previously computed information. These methods differ from each other in the way they infer the appearance of the static geometry. The first method searches this information over time, looking back and forth inside a time window of each video independently. The second instead, recovers the background appearance by transferring color information about the same time instant across different videos. While the first method provides a pixel-level segmentation accuracy, the second is limited by the voxel resolution which is bound to be more than one pixel. Since the second approach models the background appearance independently for each time instant, the final segmentation is more robust to abrupt changes in the background appearance. For the purpose of interactively navigating a video collection, we chose to use the first method to represent the shape of the dynamic elements of the scene since it offers a better quality of visualization during the rendering process. To overcome the limitations introduced by the billboard representation during the rendering process, we presented an optimization technique aimed at reducing the visual artifacts. In the future, the two methods can be combined together to benefit from the strengths of each depending on the scenario. In summary, we presented an approach to convert a collection of hand-held videos into a digital performance that can easily be navigated. We suggest the reader to try this visualization tool available on our web site [73]. Acknowledgements and Credits. The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013) / ERC grant agreement #210806, and from the Packard Foundation. We are grateful for the support of the SNF R’Equip program. Some parts of this chapter contain a revised and readapted version of two of our previously published works, namely [58] and [59]. In particular, Figc 2010 Association for Computing ures 8, 10, 11, 12 were taken from [58] ( Machinery, Inc. Reprinted by permission) and Figures 4, 5, 6, 9 were taken from [59] with kind permission from Springer Science+Business Media.
References 1. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: Exploring photo collections in 3d. In: SIGGRAPH Conference Proceedings, pp. 835–846 (2006) 2. Snavely, N., Garg, R., Seitz, S.M., Szeliski, R.: Finding paths through the world’s photos. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2008) 27, 11–21 (2008) 3. Goesele, M., Ackermann, J., Fuhrmann, S., Haubold, C., Klowsky, R., Steedly, D., Szeliski, R.: Ambient point coulds for view interpolation. In: SIGGRAPH (2010) 4. Kim, H., Sarim, M., Takai, T., Guillemaut, J.Y., Hilton, A.: Dynamic 3d scene reconstruction in outdoor environments. In: 3DPVT (2010) 5. Guan, L., Franco, J.S., Pollefeys, M.: Multi-object shape estimation and tracking from silhouette cues. In: CVPR (2008)
100
A. Taneja et al.
6. Franco, J.-S., Boyer, E.: Fusion of multi-view silhouette cues using a space occupancy grid. In: ICCV, pp. 1747–1753 (2005) 7. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based visual hulls. In: Proceedings of ACM SIGGRAPH, pp. 369–374 (2000) 8. Sarim, M., Hilton, A., Guillemaut, J.Y., Kim, H., Takai, T.: Multiple view widebaseline trimap propagation for natural video matting. In: 2010 Conference on Visual Media Production (CVMP), pp. 82–91 (2010) 9. Seitz, S., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR (2006) 10. Seitz, S.M., Dyer, C.R.: Photorealistic scene reconstruction by voxel coloring. In: CVPR, p. 1067 (1997) 11. Furukawa, Y., Ponce, J.: Dense 3d motion capture for human faces. In: CVPR, pp. 1674–1681 (2009) 12. Vlasic, D., Peers, P., Baran, I., Debevec, P., Popovi´c, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multi-view photometric stereo. In: SIGGRAPH Asia (2009) 13. Ahmed, N., Theobalt, C., Dobrev, P., Seidel, H.P., Thrun, S.: Robust fusion of dynamic shape and normal capture for high-quality reconstruction of time-varying geometry. In: CVPR (2008) 14. Hern´ andez, C., Vogiatzis, G., Cipolla, R.: Shadows in three-source photometric stereo. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 290–303. Springer, Heidelberg (2008) 15. Vedula, S., Baker, S., Seitz, S., Kanade, T.: Shape and motion carving in 6d. In: CVPR (2000) 16. Goldlucke, B., Ihrke, I., Linz, C., Magnor, M.: Weighted minimal hypersurface reconstruction. PAMI, 1194–1208 (2007) 17. Hilton, A., Starck, J.: Multiple view reconstruction of people. In: 3DPVT (2004) 18. Sinha, S.N., Pollefeys, M.: Multi-view reconstruction using photo-consistency and exact silhouette constraints: A maximum-flow formulation. In: ICCV, pp. 349–356 (2005) 19. Tung, T., Nobuhara, S., Matsuyama, T.: Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In: ICCV (2009) 20. Waschb¨ usch, M., W¨ urmlin, S., Gross, M.H.: 3d video billboard clouds. Computer Graphics Forum (Proc. Eurographics EG 2007) 26, 561–569 (2007) 21. Ballan, L., Cortelazzo, G.M.: Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In: 3DPVT (June 2008) 22. Carranza, J., Theobalt, C., Magnor, M.A., Peter Seidel, H.: Free-viewpoint video of human actors. ACM Transactions on Graphics, 569–577 (2003) 23. Vlasic, D., Baran, I., Matusik, W., Popovi´c, J.: Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics 27, 97:1–97:9 (2008) 24. de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM Trans. Graph. 27, 1–10 (2008) 25. Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. ACM Transactions on Graphics 23, 600–608 (2004) 26. Kanade, T.: Carnegie mellon goes to the superbowl (2001), http://www.ri.cmu.edu/events/sb35/tksuperbowl.html 27. W¨ urmlin, S., Niederberger, C.: Realistic virtual replays for sports broadcasts (2010), http://www.liberovision.com/
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
101
28. Guillemaut, J.-Y., Kilner, J., Hilton, A.: Robust graph-cut scene segmentation and reconstruction for free-viewpoint video of complex dynamic scenes. In: ICCV (2009) 29. Hayashi, K., Saito, H.: Synthesizing free-viewpoint images from multiple view videos in soccer stadium. In: CGIV, pp. 220–225 (2006) 30. Hasler, N., Rosenhahn, B., Thorm¨ ahlen, T., Wand, M., Gall, J., Seidel, H.P.: Markerless motion capture with unsynchronized moving cameras. In: CVPR, pp. 224– 231 (2009) 31. Lipski, C., Linz, C., Berger, K., Sellent, A., Magnor, M.: Virtual video camera: Image-based viewpoint navigation through space and time. Computer Graphics Forum 29, 2555–2568 (2010) 32. Eisemann, M., Decker, B.D., Magnor, M., Bekaert, P., de Aguiar, E., Ahmed, N., Theobalt, C., Sellent, A.: Floating Textures. Computer Graphics Forum (Proc. Eurographics EG 2008) 27, 409–418 (2008) 33. Seitz, S.M., Dyer, C.R.: View morphing. In: Proceedings of ACM SIGGRAPH, pp. 21–30 (1996) 34. Pollefeys, M., Van Gool, L., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. IJCV 59, 207–232 (2004) 35. Lhuillier, M., Quan, L.: A quasi-dense approach to surface reconstruction from uncalibrated images. IEEE Trans. Pattern Anal. Mach. Intell. 27, 418–433 (2005) 36. Ballan, L., Cortelazzo, G.M.: Multimodal 3D shape recovery from texture, silhouette and shadow information. In: 3DPVT. Chapel Hill, USA (2006) 37. Campbell, N.D., Vogiatzis, G., Hern´ andez, C., Cipolla, R.: Automatic 3d object segmentation in multiple views using volumetric graph-cuts. In: 18th British Machine Vision Conference, vol. 1, pp. 530–539 (2007) 38. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-view stereo for community photo collections. In: ICCV, pp. 1–8 (2007) 39. Ballan, L., Brusco, N., Cortelazzo, G.M.: 3D Content Creation by Passive Optical Methods. In: 3D Online Multimedia and Games: Processing, Visualization and Transmission. World Scientific Publishing, Singapore (2008) 40. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR, pp. 519–528 (2006) 41. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 42. Gallup, D., Frahm, J.M., Mordohai, P., Yang, Q., Pollefeys, M.: Real-time planesweeping stereo with multiple sweeping directions. In: CVPR (2007) 43. Zach, C., Pock, T., Bischof, H.: A globally optimal algorithm for robust tv-l1 range image integration. In: ICCV (2007) 44. Sheffer, A., Praun, E., Rose, K.: Mesh parameterization methods and their applications. Foundations and Trends in Computer Graphics and Vision 2, 105–171 (2006) 45. Brusco, N., Ballan, L., Cortelazzo, G.M.: Passive reconstruction of high quality textured 3D models of works of art. In: 6th International Symposium on Virtual Reality, Archeology and Cultural Heritage, VAST (2005) 46. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000); ISBN: 0521623049 47. Arulampalam, M.S., Maskell, S., Gordon, N.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. Signal Processing 50, 174– 188 (2002)
102
A. Taneja et al.
48. Sinha, S.N., Pollefeys, M.: Synchronization and calibration of camera networks from silhouettes. In: ICPR 2004: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR 2004), vol. 1, pp. 116–119 (2004) 49. Tuytelaars, T., Van Gool, L.: Synchronizing video sequences. In: CVPR, vol. 1, pp. 762–768 (2004) 50. Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Computer Graphics and Applications 21, 34–41 (2001) 51. Baumberg, A., Hogg, D.: An efficient method for contour tracking using active shape models. In: Motion of Non-Rigid and Articulated Objects, pp. 194–199 (1994) 52. Leibe, B., Cornelis, N., Cornelis, K., Gool, L.V.: Dynamic 3d scene analysis from a moving vehicle. In: CVPR (2007) 53. Elgammal, A., Duraiswami, R., Harwood, D., Davis, L.S.: Background and foreground modeling using nonparametric kernel density estimation for visual surveillance. Proceedings of the IEEE 90, 1151–1163 (2002) 54. Sheikh, Y., Javed, O., Kanade, T.: Background subtraction for freely moving cameras. In: ICCV (2009) 55. Bai, X., Wang, J., Simons, D., Sapiro, G.: Video snapcut: robust video object cutout using localized classifiers. ACM Trans. Graph. 28 (2009) 56. Wang, J., Bhat, P., Colburn, R.A., Agrawala, M., Cohen, M.F.: Interactive video cutout. ACM Trans. Graph. 24, 585–594 (2005) 57. Sun, J., Zhang, W., Tang, X., Shum, H.Y.: Background cut. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 628–641. Springer, Heidelberg (2006) 58. Ballan, L., Brostow, G.J., Puwein, J., Pollefeys, M.: Unstructured video-based rendering: Interactive exploration of casually captured videos. ACM Transactions on Graphics (Proceedings of SIGGRAPH), 1–11 (2010), http://doi.acm.org/10.1145/1833349.1778824 59. Taneja, A., Ballan, L., Pollefeys, M.: Modeling dynamic scenes recorded with freely moving cameras. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part III. LNCS, vol. 6494, pp. 613–626. Springer, Heidelberg (2011) 60. Cheng, Y.: Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 790–799 (1995) 61. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI 26, 1124–1137 (2004) 62. Rav-Acha, A., Kohli, P., Rother, C., Fitzgibbon, A.: Unwrap mosaics: A new representation for video editing. ACM Transactions on Graphics (SIGGRAPH 2008) (2008) 63. Chuang, Y.Y., Curless, B., Salesin, D.H., Szeliski, R.: A bayesian approach to digital matting. In: Proceedings of IEEE CVPR 2001, Kauai, Hawaii, vol. 2, pp. 264–271 (2001) 64. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. PAMI 23, 1222–1239 (2001) 65. Kolmogorov, V., Zabin, R.: What energy functions can be minimized via graph cuts? PAMI 26, 147–159 (2004) 66. Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH 21, 163–169 (1987) 67. Buehler, C., Bosse, M., McMillan, L., Gortler, S.J., Cohen, M.F.: Unstructured lumigraph rendering. In: SIGGRAPH, pp. 425–432 (2001)
3D Reconstruction and Video-Based Rendering of Casually Captured Videos
103
68. Grundland, M., Vohra, R., Williams, G.P., Dodgson, N.A.: Cross dissolve without cross fade: Preserving contrast, color and salience in image compositing. In: Proceedings of EUROGRAPHICS, Computer Graphics Forum, pp. 577–586 (2006) 69. Sch¨ odl, A., Szeliski, R., Salesin, D.H., Essa, I.: Video textures. In: SIGGRAPH 2000: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 489–498 (2000) 70. Rong, G., Tan, T.S.: Jump flooding in gpu with applications to voronoi diagram and distance transform. In: ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), pp. 109–116. ACM, New York (2006) 71. Wang, J., Bodenheimer, B.: Synthesis and evaluation of linear motion transitions. ACM Trans. Graph 27, 1–15 (2008) 72. Debevec, P., Borshukov, G., Yu, Y.: Efficient view-dependent image-based rendering with projective texture-mapping. In: 9th Eurographics Workshop on Rendering (1998) 73. Unstructured VBR, http://www.cvg.ethz.ch/research/unstructured-vbr/ 74. Kilner, J., Starck, J., Hilton, A.: A comparative study of free-viewpoint video techniques for sports events. In: CVMP (2006)
Silhouette-Based Variational Methods for Single View Reconstruction Eno T¨ oppe1,2 , Martin R. Oswald1 , Daniel Cremers1 , and Carsten Rother2 1
Technische Universit¨ at M¨ unchen, Germany 2 Mircosoft Research, Cambridge, UK
Abstract. We explore the 3D reconstruction of objects from a single view within an interactive framework by using silhouette information. In order to deal with the highly ill-posed nature of the problem we propose two different reconstruction priors: a shape and a volume prior and cast them into a variational problem formulation. For both priors we show that the corresponding relaxed optimization problem is convex. This leads to unique solutions which are independent of initialization and which are either globally optimal (shape prior) or can be shown to lie within bounds from the optimal solution (volume prior). We analyze properties of the proposed priors with regard to the reconstruction results as well as their impact on the minimization problem. By employing an implicit volumetric representation our reconstructions enjoy complete topological freedom. Being parameter-based, our interactive reconstruction tool allows for intuitive and easy to use modeling of the reconstruction result. Keywords: Single View Reconstruction, Image-Based Modeling, Convex Optimization.
1 1.1
Introduction Single View Reconstruction
The general problem of 3D reconstruction has been considered in a plethora of works in computer vision - at least in the case of given multiple views and in stereo vision. With the help of multi-view concepts like point correspondences and photo-consistency it has been shown that high-quality reconstructions can be inferred from a set of photographs of a single object. However, there are relatively few works on single view reconstruction, although its underlying problem may be considered one of the most fundamental in vision. This may to a great extent be due to the high ill-posedness of the corresponding mathematical problem, but is nevertheless astonishing, as humans excel in solving the task in every-day life. The main difficulty in inferring 3D geometry from a single image lies in the fact that it is inherently ill-posed. The process of image formation is not invertible and it is impossible to retrieve exact depth values from a single image. Thus, we have to make use of a strong prior. Such priors can either be obtained by D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 104–123, 2011. c Springer-Verlag Berlin Heidelberg 2011
Silhouette-Based Variational Methods for Single View Reconstruction
105
statistical learning of shape or by restraining the solution space, e.g. by making the assumption of smoothness and compactness. Additionally to a prior, user input can be incorporated into the reconstruction process which can be realized as a modeling tool. While the growing amount of image data on the Internet increases the availability of multiple views for certain scenes, this does only apply to few places of strong public interest, like e.g. touristic hot spots. Single view reconstruction becomes particularly important in situations where a rough estimate of object geometry is desired rather than an exact reconstruction. This is the case when generating an alternate view of a single photograph or changing the illumination of the depicted scene. In this work we follow the idea of modeling an object from a single view gradually by user input, but with the ultimate goal of keeping the process simple for the user. Instead of an involved modeling stage that amounts to the specification of absolute vertex positions and normal directions, we rather rely on user provided global and local constraints that, together with a strong prior, lead to a reconstruction estimate. This work recapitulates two different priors for single view reconstruction which were proposed in papers [1] and [2]. One is based on a shape prior formulation, the other one amounts to a global constraint on the volume of the reconstruction. We evaluate both approaches and compare them to each other. 1.2
Issues and Related Work
Existing work on single view reconstruction and on interactive 3D modeling can be roughly classified into the categories planar versus curved and implicit versus parametric approaches. Many approaches such as that of Horry et al. [3] aim to reconstruct planar surfaces by evaluating user defined vanishing points and lines. This has been extended by Liebowitz [4] and Criminisi [5]. This process has been completely automated by Hoiem et al. [6], yielding appealing results on a limited number of input images. Sturm et al. [7] make use of user-specified constraints such as coplanarity, parallelism and perpendicularity in order to reconstruct piecewise planar surfaces. An early work for the reconstruction of curved objects is Terzopoulos et al. [8] in which symmetry seeking models are reconstructed from a user defined silhouette and symmetry axis using snakes. However, this approach is restricted to the class of tube-like shapes. Moreover, reconstructions are merely locally optimal. The work of Zhang et al. [9] addresses this problem and proposes a model which globally optimizes a smoothness criterion. However, it concentrates on estimating height fields rather than reconstructing real 3D representations. Moreover, it requires a huge amount of user interaction in order to obtain appealing reconstructions. Also related to the field are easy-to-use tools like Teddy [10] and FiberMesh [11] that have pioneered sketch based modeling but are not image-based.
106
E. T¨ oppe et al.
All of the cited works are using explicit surface representation – while surface manipulation is often straightforward and a variety of cues are easily integrated leading to respective forces or constraints on the surface, there are two major limitations: Firstly numerical solutions are generally not independent of the choice of parameterization. And secondly, parametric representations are not easily extended to objects of varying topology. While Prasad et al. [12] were able to extend their approach to surfaces with one or two holes, the generalization to objects of arbitrary topology is by no means straightforward and quite involved for the user. Similarly, topology-changing interaction in the FiberMesh system requires a complex remeshing of the modeled object leading to computationally challenging numerical optimization schemes. For the given reasons in this work we pursue an implicit representation of the reconstructed object. Joshi et al. [13] also suggest a silhouette-based surface inflation method and minimize a similar energy as [9] or [12] in order to obtain a smooth surface. However, like Zhang et al. [9], Joshi et al. aim to reconstruct depth maps rather than full 3D objects. Another problem of all existing works is the fact that they revert to inflation heuristics in order to avoid surface collapsing. These techniques boil down to fixing absolute depth values which undesirably restrict the solution space. We show that a prior on the volume of the reconstruction solves this problem. A precursor to volume constraints are the volume inflation terms pioneered for deformable models by Cohen and Cohen [14]. However, no constant volume constraints were considered and no implicit representations were used. 1.3
Contribution
In this paper, we focus on the reconstruction of curved objects of arbitrary topology with a minimum of user input in an interactive and intuitive framework. We propose a convex variational method which generates a 3D object in a matter of seconds using silhouette information only. To this end, we revert to an implicit representation of the surface given by the indicator function of its interior (sometimes referred to as voxel-occupancy). In this representation, the weighted minimal surface problem is a convex functional and relaxation of the binary function leads to an overall convex problem. Two approaches are presented to overcome the ambiguity in the reconstruction process: In the first one we formulate a shape prior which determines the basic shape and at the same time inflates the reconstruction geometry. In the second approach we introduce a constraint on the volume of the reconstruction. We discuss advantages and shortcomings of both approaches. In both cases we detail how to solve the resulting optimization problem by means of relaxation. This leads to a solution to the unrelaxed problem that is globally optimal in the case of a shape prior and that we show to be within in a bound of the optimum in the case of a volume prior.
Silhouette-Based Variational Methods for Single View Reconstruction
User Strokes for Segmentation
Silhouette
First Estimate
107
Final Result
Fig. 1. The basic workflow of the single view reconstruction process: The user marks the input image with scribbles (left) from which a silhouette is generated by segmentation (second from left). A first reconstruction estimate is generated automatically from the silhouette (third from left). The user can then iteratively adapt the model in an interactive manner (right).
2
Reconstruction Workflow
A good silhouette is the main prerequisite for a reasonable reconstruction result with the algorithms proposed in Sections 4 and 5. The number of holes in the segmentation of the target object determines the topology of the reconstructed surface. Notably, the proposed reconstruction methods can also cope with disconnected regions of the object silhouette. The segmentation is obtained by utilizing an interactive graph cuts scheme similar to the ones described by [15] and [16]. The algorithm calculates two distinct regions based on respective color histograms which are defined by representational pen strokes given by the user (see Fig. 1). From the input image and silhouette a first reconstruction is generated automatically, which - depending on the complexity and the class of the object - can already be satisfactory. However, for some object classes and due to the general over-smoothing of the resulting mesh (see Section 3), the user can subsequently adapt the reconstruction by specifying intuitive and simple global and local constraints. These editing tools are completely parameter-based. The editing stage can be reiterated by the user until the desired result is obtained.
3
Implicit Variational Surfaces
Assume we are given the silhouette of an object in an image as returned by an interactive segmentation tool. The goal is then to obtain a smooth 3D model of the object which is consistent with the silhouette. How should we select the correct 3D model among the infinitely many that match the silhouette? Clearly, we need to impose additional information, at the same time we want to keep this information at a minimum since user interaction is always tedious and slow. Formally, we are given an image plane Ω which contains the input image and lies in R3 . As part of the image we also have an object silhouette Σ ⊂ Ω. Now, we are seeking to compute reconstructions as minimal weighted surfaces S ⊂ R3 that are compliant with the object silhouette Σ:
108
E. T¨ oppe et al.
min
g(s)ds
(1)
S
subject to
π(S) = Σ
where π : R3 → Ω is the orthographic projection onto the image plane Ω, g : R3 → R+ is a smoothness weighting function and s ∈ S is an element of the surface S. We now introduce an implicit representation by replacing the surface S with its implicit binary indicator function u ∈ BV (R3 ; {0, 1}) representing the voxel occupancy (0 =exterior, 1 =interior), where BV denotes the functions of bounded variation [17]. The desired minimal weighted surface area is then given by minimizing the total variation (TV) over a suitable set UΣ of feasible functions u: (2) min g(x)|∇u(x)|d3 x u∈UΣ
where ∇u denotes the derivative in the distributional sense. Eq. (2) favors smooth solutions. However, smoothness is locally affected by the function g(x) : R3 → R+ which will be used later for modeling. Without any modeling g is the identity mapping by default, i.e. g(x) ≡ 1. How does the set UΣ of feasible functions look like? For simplicity, we assume the silhouette to be enclosed by the surface. Then all surface functions that are consistent with the silhouette Σ must be in the set ⎫ ⎧ ⎧ ⎪ ⎪ π(x) ∈ /Σ⎪ ⎬ ⎨ ⎨0, UΣ = u ∈ BV (R3 ; {0, 1}) u(x) = 1, (3) x∈Σ ⎪ ⎪ ⎪ ⎭ ⎩ ⎩ arbitrary, otherwise Obviously, solving problem (1) / (2) results in the silhouette itself. Therefore we need further assumptions in order to rule out trivial solutions. In the subsequent sections we propose two different approaches to the problem. Using the Weighting Function for Modeling The weight g(x) of the TV-norm in Eq. (2) can be used to locally control the smoothness of the reconstruction: With a low value 0 ≤ g(x) < 1, the smoothness condition on the surface is locally relaxed, allowing for creases and sharp edges to form. Conversely, higher values for g(x) > 1 locally enforces surface smoothness. For controlling the weighting function we employ a user scribble interface. The parameter associated to each scribble marks the local smoothness g(x) within the respective scribble area and is propagated through the volume along projection direction. This approach of parametric local smoothness adaptation can be applied in the case of a data term (Section 4) as well as in case of a constant volume constraint (Section 5).
Silhouette-Based Variational Methods for Single View Reconstruction
4
109
Inflation via Shape Prior
By introducing a data term, we realize two objectives: volume inflation and determination of the basic reconstructed shape. Since there is no inherent data term in the single view setting we have to define one heuristically. We choose a term of the following form: u(x) φ(x)d3 x
(4)
φ : R3 → IR can be adopted to achieve the desired object shape and may also be adopted by user-interaction later on. Adding this term to the energy in Equation (2) amounts to the following problem: min (5) u(x) φ(x)d3 x + λ g(x)|∇u(x)|d3 x u∈UΣ
where λ is a weighting parameter that determines the relative smoothness of the solution. In order to fix a definition for φ we make the simple assumption that the thickness of the observed object increases as we move inward from its silhouette. For any point p ∈ V let dist(p, ∂S) = min p − s , s∈∂S
denote its distance to the silhouette contour ∂S ⊂ Ω. Then we set: −1 if dist(x, Ω) ≤ h(π(x)) φ(x) = +1 otherwise ,
(6)
(7)
where the height map h : Ω → IR depends on the distance of the projected 3D point to the silhouette according to the function
h(p) = min μcutoff , μoffset + μfactor ∗ dist(p, ∂S)k (8) with four parameters k, μoffset , μfactor , μcutoff ∈ R+ affecting the shape of the function φ. How the user can employ these parameters to modify the computed 3D shape will be discussed in the following paragraph. Note that this choice of φ implies symmetry of the resulting model with respect to the image plane. Since the backside of the object is unobservable, it will be reconstructed properly for plane-symmetric objects. Data Term Parameters. By altering the parameters μoffset , μfactor , μcutoff and the exponent k of the height map function (8), users can intuitively change the data term (4) and thus the overall shape of the reconstruction. Note that the impact of these parameters is attenuated with increasing importance of the smoothness term. The effects of the offset, factor and cutoff parameters on the height map are shown in Fig. 2 and are quite intuitive to grasp. The exponent k of the distance function in (8) mainly influences the objects curvature in the proximity of the silhouette contour. This can be observed in Fig. 2 showing an evolution from a cone to a cylinder just by decreasing k.
110
E. T¨ oppe et al.
FXWRII
IDFWRU
RIIVHW
k=2
k=1
k = 1/100
Fig. 2. Effect of μoffset , μfactor , μcutoff (left) and various values of parameter k and resulting (scaled) height map plots for a circular silhouette
Altering the Data Term Locally. Due to the incorporation of a distance transform in the data term, the reconstruction will always become flat at the silhouette border. However this is not always desired like for instance for the bottom and top of the vase in Fig. 3. A simple remedy to this problem is to ignore parts of the contour during the calculation of the distance function. The user indicates the sections of the silhouette contour he wants to have ignored. To keep user interaction simple, we approximate the object contour by a polygon which is laid over the input image. By clicking on the edge, the user indicates to ignore the corresponding contour pixels during distance map calculation (see Fig. 3 top right). 4.1
Optimization via Convex Relaxation
To minimize energy (2) plus the data term (4) we follow the framework developed in [18]. To this end, we relax the binary problem, looking for functions u : V → [0, 1] instead. We can globally minimize the resulting convex functional by solving the corresponding Euler-Lagrange equation ∇u 0 = φ − λ div g , (9) |∇u| using a fixed-point iteration in combination with Successive Over-Relaxation (SOR). A global optimum of the original binary labeling problem is then obtained by simple thresholding of the solution of the relaxed problem – see [18] for details. In [19] it was shown that such relaxation techniques have several advantages over graph cut methods. In this work, the two main advantages are the lack of metrication errors and the parallelization potential. These two aspects allow to compute smooth single view reconstructions with no grid bias within a few seconds using standard graphics hardware.
Silhouette-Based Variational Methods for Single View Reconstruction
111
Fig. 3. Top row: height maps and corresponding reconstructions with and without marked sharp contour edges. Bottom row: input image with marked contour edges (blue) and line strokes (red) for local discontinuities which are shown right.
Input Image
Reconstructed Geometry with Input Image
Reconstructed Geometry only
Textured Geometry
Fig. 4. An example with intricate topology. Due to the implicit representation of the reconstruction surface, the algorithm of Sections 4 and 5 can handle any genus
4.2
Experiments
In the following we apply our reconstruction method with data term to several input images. We show different aspects of the reconstruction process for typical classes of target objects and mention limitations of the approach. The experimental results are shown in Figures 4, 5, 6 and 7. Default values for the data term parameters (8) are k = 1, μoffset = 0, μfactor = 1, μcutoff = ∞. Each row depicts several views of a single object reconstruction starting with the input image.
112
E. T¨ oppe et al.
Input Image
Textured Geometry with Input Image
Reconstructed Geometry only
Textured Geometry
Fig. 5. For the cockatoo very little additional user input was necessary. The smoothness was reduced locally by a single user scribble (see Section 3).
Input Image
Textured Geometry with Input Image
Reconstructed Geometry with Input Image
Textured Geometry
Fig. 6. The Cristo statue composes of smooth and non-smooth parts. The socket part was marked as non-smooth by a user scribble adapting the weight of the minimal surface locally.
The fence in Fig. 4 is an example of the complex topology the algorithm can handle. The reconstruction was automatically generated by the method right after the segmentation stage, i.e. without changing the surface smoothness. Figures 5, 6 and 7 demonstrate the potential of user editing as described in Sections 3 and 4. The reconstructions were edited by adapting the local smoothness and locally editing the data term. It can be seen, that elaborate modeling effects can be achieved by these simple operations. Especially for the cockatoo a single curve suffices in order to add the characteristic indentation to the beak. No expert knowledge is necessary. For the socket of the Cristo statue, creases help to attain sharp edges, while keeping the rest of the statue smooth. It should be stressed, that no other post-processing operations were used. The experiments in Figure 7 stand for a more complex series of target objects. A closer look reveals that the algorithm clearly attains its limit. The structure of the opera building (third row) as well as the elaborate geometry of the bike and its drivers cannot be correctly reconstructed with the proposed method due to a lack of information and more sophisticated tools. Yet the results are appealing and could be spiced up with the given tools.
Silhouette-Based Variational Methods for Single View Reconstruction
Input Image
Textured Geometry with Input Image
Reconstruction Geometry only
113
Textured Geometry
Fig. 7. Reconstruction examples where the algorithm attains its limit. Nevertheless the results are pleasing and could be used for tasks like new view synthesis.
5
Inflation via Volume Prior
Adding a data term to the variational problem (2) delivers reasonable results for single view reconstruction, as shown in the last section. However, we have also seen that a data term imposes a strong bias on the shape. Ideally we would like to have a non-heuristic inflation approach that does not restrict the shape variety while at the same time exhibits the same natural compactness as seen in the experiments above. As an alternative inflation strategy we propose to use a constraint on the size of the volume enclosed by the minimal surface. We formulate this as a hard constraint by further constraining the feasible set of problem (2): min E(u) where E(u) = g(x)|∇u(x)|d3 x (10) u∈UΣ ∩UV (11) and UV = u ∈ BV (R3 ; {0, 1}) u(x)d3 x = Vt where UV denominates all reconstructions with bounded variation that have the specific target volume Vt . Different approaches to finding Vt can be considered. Since in the implementation the optimization domain is naturally bounded, we choose Vt to be a fixed
114
E. T¨ oppe et al.
fraction of the volume of this domain. In a fast interactive framework the user can then adapt the target volume with the help of instant visual feedback. Most importantly, as opposed to a data term driven model volume constraints do not dictate where inflation takes place. 5.1
Optimization via Convex Relaxation
As in Section 4 we choose to relax the binary problem. This amounts to replacing r . The corresponding UV and UΣ with their respective convex hulls UVr and UΣ optimization problem is then convex: r Proposition 1. The relaxed set U r := UΣ ∩ UVr is convex.
Proof. The constraint in the definition of UV is clearly linear in u and therefore UVr is convex. The same argument holds for UΣ . Being an intersection of two convex sets U r is convex as well. One standard way of finding the globally optimal solution to this problem is gradient descent, which is known to converge very slowly. Since optimization speed is an integral part of an interactive reconstruction framework, we convert our problem to a form for which a primal-dual scheme [20] can be applied. We start by replacing the TV-norm in our minimal surface problem by its weak equivalent: 3 3 minr g(x)|∇u|d x = minr max (12) −udivξ d x u∈U |ξ(x)|2 ≤g(x)
u∈U
Cc1 (R3 , R3 ).
where ξ ∈ The main problem is that we are dealing with an optimization problem over a constrained set. u needs to fulfill three constraints: Silhouette consistency, constant volume and u ∈ [0, 1]. In order to maintain silhouette consistency (3) of the solution we simply restrict updates to those voxels which project onto the silhouette interior excluding the silhouette itself. Furthermore we reformulate the volume constraint as a Lagrange multiplier λ, which together with Equation (12) leads to the following Lagrangian dual problem [21]: 3 3 max minr (13) −udivξ d x + λ u d x − Vt |ξ(x)|2 ≤g(x)
λ
u∈UΣ
Equation (13) is a saddle point problem. In [20] it was presented how to solve such problems of this special form with a primal-dual scheme. We employ this scheme which is fast and provably convergent. It consists of alternating a gradient descent with respect to the function u and a gradient ascent for the dual variables ξ and λ interlaced with an over-relaxation step on the primal variable: ⎧ k+1 = Π|ξ(x)|2 ≤g(x) (ξ k + τ · ∇¯ uk ) ξ ⎪ ⎪ ⎪ ⎨λk+1 = λk + τ · ( u ¯ dx − Vt ) (14) k+1 k ⎪ u = Πu∈[0,1] (u − σ · (divξ k+1 + λ)) ⎪ ⎪ ⎩ k+1 u ¯ = 2uk+1 − uk
Silhouette-Based Variational Methods for Single View Reconstruction
115
where ΠA denotes the projection onto the set A (see [20] for details). Note that the projection for the primal variable u now reduces to a clipping operation. Projection of ξ is done by simple clipping as well. The scheme (14) is numerically attractive since it avoids division by the potentially zero-valued gradient-norm which appears in the Euler-Lagrange equation of the TV-norm. Moreover, it is parallelizable and we therefore implemented it on the GPU. 5.2
Optimality Bounds
Having computed a global optimal solution uopt of Equation (12), the question remains how we obtain a binary solution and how the two solutions relate to one another energetically. Unfortunately no thresholding theorem holds, which would imply energetic equivalence of the relaxed optimum and its thresholded version for arbitrary thresholds. Nevertheless we can construct a binary solution ubin as follows: Proposition 2. The relaxed solution can be projected to the set of binary functions in such a way that the resulting binary function preserves the user-specified volume Vt . Proof. It suffices to order the voxels x by decreasing values u(x). Subsequently, one sets the value of the first Vt voxels to 1 and the value of the remaining voxels to 0. Concerning an optimality bound the following holds: Proposition 3. Let uropt be the global optimal solution of the relaxed energy and uopt the global optimal solution of the binary problem. Then E(ubin ) − E(uopt ) ≤ E(ubin ) − E(uropt ) . 5.3
(15)
Theoretical Analysis of Material Concentration
As we have seen above, the proposed convex relaxation technique does not guarantee global optimality of the binary solution. The thresholding theorem [22] – applicable in the unconstrained problem – no longer applies to the volumeconstrained problem. While the relaxation naturally gives rise to posterior optimality bounds, one may take a closer look at the given problem and ask why the relaxed volume labeling u should favor the emergence of solid objects rather than distribute the prescribed volume equally over all voxels. In the following, we prove analytically that the proposed functional has an energetic preference for material concentration. For simplicity, we consider the case that the object silhouette in the image is a disk. And we compare the two extreme cases of all volume being concentrated in a ball (a known solution of the Cheeger problem) compared to the case that the same volume is distributed equally over the feasible space (namely a cylinder) – see Figure 8.
116
E. T¨ oppe et al.
Fig. 8. The two cases considered in the analysis of the material concentration for the approach in Section 5. On the left hand side we assume a hemi-spherical condensation of the material. On the right hand side the material is distributed evenly over the volume.
Proposition 4. Let usphere denote the binary solution which is 1 inside the sphere and 0 outside – Fig. 8, left side – and let ucyl denote the solution which is uniformly distributed (i.e. constant) over the entire cylinder – Fig. 8, right side. Then we have (16) E(usphere ) < E(ucyl ), independent of the height of the cylinder. Proof. Let R denote the radius of the disk. Then the energy of usphere is simply given by the area of the half-sphere: E(usphere ) = |∇usphere |d2 x = 2πR2 . (17) If instead of concentrated to the half-sphere, the same volume, i.e. V = is distributed uniformly over the cylinder of height h ∈ (0, ∞), we have ucyl (x) =
V 2πR3 2R = = . πR2 h 3πR2 h 3h
2π 3 3 R ,
(18)
inside the entire cylinder, and ucyl (x) = 0 outside the cylinder. The respective surface energy of ucyl is given by the area of the cylinder weighted by the respective jump size: E(ucyl ) = |∇ucyl |d2 x 2R 2R = 1− (πR2 + 2πRh) πR2 + 3h 3h 7 = πR2 > E(usphere ). (19) 3 5.4
Experiments
In this section we study the properties of constant volume weighted minimal surfaces again within an interactive reconstruction environment. We show that appealing and realistic 3D models can be generated with minimal user input.
Silhouette-Based Variational Methods for Single View Reconstruction
Input Image
Reconstruction
+30% volume
117
+40% volume
Fig. 9. By increasing the target volume with the help of a slider, the reconstruction is intuitively inflated. In this example the intial rendering of the volume with 175x135x80 voxels took 3.3 seconds. Starting from there each subsequent volume adaptation took only about 1 second.
Input Image
Reconstructed Geometry
Textured Geometry
Fig. 10. The constant volume approach favors minimal surfaces for a user-specified volume. This amounts to solving a Cheeger set problem.
Cheeger Sets and Single View Reconstruction. Solutions to the problem in Eq. (10) are so called Cheeger sets, i.e. minimal surfaces for a fixed volume. In the simplest case of a circle-shaped silhouette the corresponding Cheeger set is a ball. Fig. 10 demonstrates that in fact round silhouette boundaries (in the unweighted case g(x) ≡ 1) result in round shapes. In the example of the balloon it also becomes apparent that thinner structures in the silhouette are inflated less than compact parts: Coming from the top of the balloon toward the basket on the bottom the inflation gradually degrades along with the silhouette width. Varying the Volume. In the constant volume formalism presented in this section the only parameter we have to determine for our reconstruction is the target volume Vt (apart from the weighting function g(x) of the TV-norm in Eq. (12)). The effect of changing this scalar parameter on the appearance of the reconstruction surface can be witnessed in Fig. 9. One can see that the adaptation of the target volume has an intuitive effect on the resulting shape. This is important for an interactive user driven reconstruction process that is made possible by the small computation times that we gain through a parallel implementation of algorithm in Eq. (14).
118
E. T¨ oppe et al.
Image with User Input
Reconstructions
Pure Geometry
Fig. 11. The proposed approach allows to generate 3D models with sharp edges, marked by the user as locations of low smoothness (see Section 3). Along the red user strokes (second from left) the local smoothness weighting is decreased.
Input
Reconstruction
Different View
Geometry
Fig. 12. Volume inflation dominates where the silhouette area is large (bird) whereas thin structures (twigs) are inflated less
Sharp Edges. Similar to Section 4 we examine the effects of adapting the smoothness locally, but now using the volume prior instead of the shape prior. Fig. 11 shows that by adapting the weighting function g(x) of Eq. (12) not only round, but other very characteristic shapes can be modeled with minimal user interaction. The 2D user input is shown alongside with the reconstruction results. More experiments with smoothness adaptation for the constant volume case are presented in the following Section 6.
6
Comparison
In this section we compare the two proposed priors with respect to their reconstruction results, usability and runtime. Comparison of Experimental Results. We have already indicated that the data term acts as a strong prior on the resulting shape of the reconstruction. This can be verified in Fig. 14: The left reconstruction was done with the shape prior as described in Section 4. Clearly the silhouette distance transform dominates
Silhouette-Based Variational Methods for Single View Reconstruction
Input Image
Data Term as Shape Prior
119
Reconstruction with Reconstruction with Shape Prior Volume Prior
Fig. 13. Using a silhouette distance transform as shape prior the relation between data term (second from left) and reconstruction (third from left) is not easy to assess for a user. The Cheeger set approach of Section 5 behaves more naturally in this respect (right).
the shape in the resulting reconstruction. This might of course be advantageous for particular shapes like the examples shown in Fig. 2. Still, often a Cheeger set (right of Fig. 14) is a better guess for natural shapes. Increasing the smoothness parameter in the data term approach will mitigate the influence of the distance transform. However, with higher smoothness the result tends to be less voluminous making it hard to achieve ball-like shapes (see Fig. 13). In both approaches thin structures in the silhouette are less inflated than more compact parts. This is a basic property of minimal surfaces. Nevertheless in the data term approach thin structures tend to be too flat, especially in the presence of a high smoothness parameter (see Fig. 14). In principle a data term inhibits the flexibility of the reconstructions. The air plane in Fig. 15 represents an example in which a parametric shape prior - just as the proposed data term of Section 4 - would fail to offer the necessary flexibility required for modeling protrusions. Since our fixed-volume approach does not impose points of inflation user input can influence the reconstruction more freely: Marking the wings as highly non-smooth (i.e. low weights 0 < g(x) < 0.3) effectively makes them pop out. From a user perspective the shape prior approach is much more involved in terms of the amount of user input. The shape prior consists of four parameters to offer reasonable but still limited flexibility to adapt its shape. On the other hand, for the volume prior approach only one parameter needs to be specified by the user. In Fig. 14 and 13 we make use of the same input images that were used in [12]. Comparing both results with the ones in their work reveals that we get comparable results with a significantly lower amount of user input. As opposed to their work, our method exhibits complete topological freedom. Figure 16 depicts a direct comparison of the two proposed priors on several reconstruction results. Again one can observe that the volume prior generally yields more roundish shapes, while the distance function dominates the results with the shape prior. In sum, both priors yield comparable reconstruction results. Although for the shape prior approach the user has slightly more possibilities to adapt the final shape of the reconstruction, this freedom is paid off by a significant amount of additional user input as it is not always simple to find the right combination of
120
E. T¨ oppe et al.
Input Image
Reconstruction with Reconstruction with Data Term as Shape Prior Volume Prior
Fig. 14. In contrast to a solution with shape prior (Eq. 5) (center ), the solution with volume prior (right) does not favor a specific shape and generates more natural looking results. Although in the center reconstruction the dominating shape prior can be mitigated by a higher smoothness (λ in Eq. (5)), this ultimately leads to the flattening of thin structures like the handle.
Image with User Input
Reconstructed Geometry
Textured Geometry
Fig. 15. An example for a minimal surface with prescribed volume and local smoothness adaptation. Colored lines in the input image mark user input, which locally alters the surface smoothness. Red marks low, yellow marks high smoothness (see Section 3 for details).
parameters. In contrast, the volume prior has only a single parameter making it simpler to adapt the shape of the reconstruction. A limitation of both methods is the implicit assumption that the plane of object symmetry should be approximately parallel to the image plane since the topology of the reconstructed object is directly inferred from the topology of the object’s silhouette. Runtime Comparison. The two priors lead to different optimization problems and we solved them with different optimization schemes. Both have been implemented in a parallel manner using the NVIDIA CUDA Framework. All computation times refer to a PC with a 2.27GHz Intel Xeon CPU and an NVIDIA GeForce GTX580 graphic device running a recent Linux distribution. The computation times for the shape prior approach are slightly lower than the ones of the volume prior approach. For instance, the teapot example (Fig. 14) with a resolution of 189 × 139 × 83 the method with shape prior needs 2 seconds while the method with volume prior needs 4.7 seconds. However, the computation times depend mainly on the volume resolution and also on the
Silhouette-Based Variational Methods for Single View Reconstruction Input Image
Reconstruction with Shape Prior Volume Prior
121
Geometry with Shape Prior Volume Prior
Fig. 16. Direct comparison of the methods with shape and volume prior for several examples
object to be reconstructed. For a reasonable quality of the reconstruction also lower volume resolutions may be sufficient. When using e.g. 63 × 47 × 32 the computation times drop down to 0.03s and 0.13s for the methods with shape and volume prior, respectively.
7
Conclusion
In this work we considered a variational approach to the problem of 3D reconstruction from a single view by searching for a weighted minimal surface that is consistent with the given silhouette. A major part of our contribution is to show
122
E. T¨ oppe et al.
that this can be done in an interactive way by providing a tool that is intuitive and computes solutions within seconds. Two paradigms were proposed in order to deal with the highly ill-posed task. In the first one we introduce a shape prior that is incorporated as a data term in order to avoid flat solutions. This approach is along the lines of other works as it boils down to fixing depth values of the reconstruction in order to inflate it. In the other proposed method we search for a weighted minimal surface that complies with a fixed user given volume. The resulting Cheeger set problem goes without specifying expected depth of any sort thus providing more geometric flexibility of the result. In the former case we compute globally optimal solutions to the variational problem. In the latter case we showed that the solution lies within a bound of the optimum and exactly fulfills the prescribed volume. We compared both priors and found that the volume prior is more flexible and thus better suited for the task of single view reconstruction. On a variety of challenging real world images, we showed that the proposed method compares favorably over existing approaches, that volume variations lead to families of realistic reconstructions and that additional user scribbles allow to locally reduce smoothness so as to easily create protrusions.
References 1. Oswald, M.R., T¨ oppe, E., Kolev, K., Cremers, D.: Non-parametric single view reconstruction of curved objects using convex optimization. In: Denzler, J., Notni, G., S¨ uße, H. (eds.) Pattern Recognition. LNCS, vol. 5748, pp. 171–180. Springer, Heidelberg (2009) 2. Toeppe, E., Oswald, M.R., Cremers, D., Rother, C.: Image-based 3D modeling via cheeger sets. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part I. LNCS, vol. 6492, pp. 53–64. Springer, Heidelberg (2011) 3. Horry, Y., Anjyo, K.I., Arai, K.: Tour into the picture: using a spidery mesh interface to make animation from a single image. In: SIGGRAPH 1997: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, pp. 225–232. ACM Press, Addison-Wesley Publishing Co., New York, USA (1997) 4. Liebowitz, D., Criminisi, A., Zisserman, A.: Creating architectural models from images. In: Proc. EuroGraphics, vol. 18, pp. 39–50 (1999) 5. Criminisi, A., Reid, I., Zisserman, A.: Single view metrology. Int. J. Comput. Vision 40, 123–148 (2000) 6. Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Trans. Graph 24, 577–584 (2005) 7. Sturm, P.F., Maybank, S.J.: A method for interactive 3d reconstruction of piecewise planar objects from single images. In: Proc. BMVC, pp. 265–274 (1999) 8. Terzopoulos, D., Witkin, A., Kass, M.: Symmetry-seeking models and 3d object reconstruction. IJCV 1, 211–221 (1987) 9. Zhang, L., Dugas-Phocion, G., Samson, J.S., Seitz, S.M.: Single view modeling of free-form scenes. In: Proc. of CVPR, pp. 990–997 (2001) 10. Igarashi, T., Matsuoka, S., Tanaka, H.: Teddy: a sketching interface for 3d freeform design. In: SIGGRAPH 1999, pp. 409–416. ACM Press, Addison-Wesley Publishing Co., New York, USA (1999)
Silhouette-Based Variational Methods for Single View Reconstruction
123
11. Nealen, A., Igarashi, T., Sorkine, O., Alexa, M.: Fibermesh: designing freeform surfaces with 3d curves. ACM Trans. Graph. 26, 41 (2007) 12. Prasad, M., Zisserman, A., Fitzgibbon, A.W.: Single view reconstruction of curved surfaces. In: CVPR, pp. 1345–1354 (2006) 13. Joshi, P., Carr, N.: Repouss´e: Automatic inflation of 2d art. In: Eurographics Workshop on Sketch-Based Modeling (2008) 14. Cohen, L.D., Cohen, I.: Finite-element methods for active contour models and balloons for 2-d and 3-d images. IEEE Trans. on Patt. Anal. and Mach. Intell. 15, 1131–1147 (1993) 15. Boykov, Y.Y., Jolly, M.P.: Interactive graph cuts for optimal boundary region segmentation of objects in n-d images. In: ICCV, vol. 1, pp. 105–112 (2001) 16. Rother, C., Kolmogorov, V., Blake, A.: GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph 23, 309–314 (2004) 17. Ambrosio, L., Fusco, N., Pallara, D.: Functions of bounded variation and free discontinuity problems. The Clarendon Press Oxford University Press, New York (2000) 18. Kolev, K., Klodt, M., Brox, T., Cremers, D.: Continuous global optimization in multview 3d reconstruction. International Journal of Computer Vision (2009) 19. Klodt, M., Schoenemann, T., Kolev, K., Schikora, M., Cremers, D.: An experimental comparison of discrete and continuous shape optimization methods. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 332–345. Springer, Heidelberg (2008) 20. Pock, T., Cremers, D., Bischof, H., Chambolle, A.: An algorithm for minimizing the piecewise smooth mumford-shah functional. In: IEEE Int. Conf. on Computer Vision, Kyoto, Japan (2009) 21. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 22. Chan, T., Esedo¯ glu, S., Nikolova, M.: Algorithms for finding global minimizers of image segmentation and denoising models. SIAM Journal on Applied Mathematics 66, 1632–1648 (2006)
Single Image Blind Deconvolution with Higher-Order Texture Statistics Manuel Martinello and Paolo Favaro Heriot-Watt University School of EPS, Edinburgh EH14 4AS, UK
Abstract. We present a novel method for solving blind deconvolution, i.e., the task of recovering a sharp image given a blurry one. We focus on blurry images obtained from a coded aperture camera, where both the camera and the scene are static, and allow blur to vary across the image domain. As most methods for blind deconvolution, we solve the problem in two steps: First, we estimate the coded blur scale at each pixel; second, we deconvolve the blurry image given the estimated blur. Our approach is to use linear high-order priors for texture and second-order priors for the blur scale map, i.e., constraints involving two pixels at a time. We show that by incorporating the texture priors in a least-squares energy minimization we can transform the initial blind deconvolution task in a simpler optimization problem. One of the striking features of the simplified optimization problem is that the parameters that define the functional can be learned offline directly from natural images via singular value decomposition. We also show a geometrical interpretation of image blurring and explain our method from this viewpoint. In doing so we devise a novel technique to design optimally coded apertures. Finally, our coded blur identification results in computing convolutions, rather than deconvolutions, which are stable operations. We will demonstrate in several experiments that this additional stability allows the method to deal with large blur. We also compare our method to existing algorithms in the literature and show that we achieve state-of-the-art performance with both synthetic and real data. Keywords: coded aperture, single image, image deblurring, depth estimation.
1
Introduction
Recently there has been enormous progress in image deblurring from a single image. Perhaps one of the most remarkable results is to have shown that it is possible to extend the depth of field of a camera by modifying the camera optical response [1,2,3,4,5,6,7]. Moreover, techniques based on applying a mask at the lens aperture have demonstrated the ability to recover a coarse depth of the
This research was partly supported by SELEX Galileo grant SELEX/HWU/ 2010/SOW3.
D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 124–151, 2011. c Springer-Verlag Berlin Heidelberg 2011
Single Image Blind Deconvolution with Higher-Order Texture Statistics
(a)
125
(b)
Fig. 1. Results on an outdoor scene [exposure time 1/200s]. (a) Blurry coded image captured with mask b (see Fig. 4). (b) Sharp image reconstructed with our method.
scene [4,5,8]. Depth has then been used for digital refocusing [9] and advanced image editing. In this paper we present a novel method for image deblurring and demonstrate it on blurred images obtained from a coded aperture camera. Our algorithm uses as input a single blurred image (see Fig. 1 (a)) and automatically returns the corresponding sharp one (see Fig. 1 (b)). Our main contribution is to provide a computationally efficient method that achieves state-of-the-art performance in terms of depth and image reconstruction with coded aperture cameras. We demonstrate experimentally that our algorithm can deal with larger amounts of blur than previous coded aperture methods. One of the leading approaches in the literature [5] recovers a sharp image by sequentially testing a deconvolution method for several given hypotheses for the blur scale. Then, the blur scale that yields a sharp image that is consistent with both the model and the texture priors is chosen. In contrast, in our approach we show that one can identify the blur scale by computing convolutions, rather than deconvolutions, of the blurry image with a finite set of filters. As a consequence, our method is numerically stable especially when dealing with large blur scales. In the next sections, we present all the steps needed to define our algorithm for image deblurring. The task is split in two steps: First the blur scale is identified and second, the coded image is deblurred with the estimated blur scale. We present an algorithm for blur scale identification in section 3.1. Image deblurring is then solved iteratively in section 3.2. A discussion on mask selection is then presented in section 4.1. Comparisons to existing methods are shown in section 5. 1.1
Prior Work
This work relates to several fields ranging from computer vision to image and signal processing, and from optics to astronomy and computer graphics. For simplicity, we group past work based on the technique being employed.
126
M. Martinello and P. Favaro
Coded Imaging: Early work in coded imaging appears in the field of astronomy. One of the most interesting pattern designs is the Modified Uniformly Redundant Arrays (MURA) [10] for which a simple coding and decoding procedure was devised (see one such pattern in Fig. 4). In our tests the MURA pattern seems very well behaved, but too sensitive to noise (see Fig. 5). Coded patterns have also been used to design lensless systems, but these systems require either long exposures or are sensitive to noise [11]. More recently, coding of the exposure [12] or of the aperture [4] has been used to preserve high spatial frequencies in blurred images so that deblurring is well-posed. We test the mask proposed in [4] and find that it works well for image deblurring, but not for blur scale identification. A mask that we have tested and has yielded good performance is the four-holes mask of Hiura and Matsuyama [13]. In [13] however, the authors used multiple images. A study on good apertures for deblurring multiple coded images via Wiener filtering has instead led to two novel designs [14,15]. Although the masks were designed to be used together, we have tested each of them independently for comparisons purposes. We found, as predicted by the authors, that the masks are quite robust to noise and quite well designed for image deblurring. Image deblurring and depth estimation with a coded aperture camera has also been demonstrated by Levin et al. [5]. One of their main contributions is the design of an optimal mask. We indeed find this mask quite effective both on synthetic data and real data. However, as already noticed in [16], we have found that the coded aperture technique, if approached as in [5], fails when dealing with large blur amounts. The method we propose in this paper, instead, overtakes this limitation, especially when using the four-hole mask. Finally, a design based on annular masks has also been proposed in [17] and has been exploited for depth estimation in [3]. We also tested this mask in our experiments, but, contrary to our expectations, we did not find its performance superior to the other masks. 3D Point Spread Functions: While there are several techniques to extract depth from images, we briefly mention some recent work by Greengard et al. [18] because their optical design included and exploited diffraction effects. They investigated 3D point spread functions (PSF) whose transverse cross sections rotate as a result of diffraction, and showed that such PSFs yield an order of magnitude increase in the sensitivity with respect to depth variations. The main drawback however, is that the depth range and resolution is limited due to the angular resolution of the reconstructed PSF. Depth-Invariant Blur: An alternative approach to coded imaging is wavefront coding. The key idea is to use aspheric lenses to render the lens point spread function (PSF) depth-invariant. Then, shift-invariant deblurring with a fixed known blur can be applied to sharpen the image [19,20]. However, while the results are quite promising, the PSF is not fully depth-invariant and artifacts are still present in the reconstructed image. Other techniques based on depthinvariant PSFs exploit the chromatic aberrations of lenses [7] or use diffusion [21]. However, in the first case, as the focal sweep is across the spectrum, the method is mostly designed for grayscale imaging. While the results shown in
Single Image Blind Deconvolution with Higher-Order Texture Statistics
127
these recent works are stunning, there are two inherent limitations: 1) Depth is lost in the imaging process; 2) In general, as method based on focal sweep are not exactly depth-invariant, the deblurring performance decays for objects that are too close or too far away from the camera. Multiple Viewpoint: The extension of the depth of field can also be achieved by using multiple images and/or multiple viewpoints. One technique is to obtain multiple viewpoints by capturing multiple coded images [8,13,22] or by capturing a single image by using a plenoptic camera [9,6,23,24]. These methods however, exploit multiple images or require a more costly optical design (e.g., a calibrated microlens array). Motion Deblurring and Blind Deconvolution: This work also relates to work in blind deconvolution, and in particular on motion deblurring. There has been a quite steady progress in uniform motion deblurring [25,26,27,28,29] thanks to the modeling and exploitation of texture statistics. Although these methods deal with an unknown and general blur pattern, they assume that blur is not changing across the image domain. More recently, the space-varying case has been studied [30,31,32] albeit with some restrictions on the type of motion or the scene depth structure. Blurred Face Recognition: Work in the recognition of blurred faces [33] is also related to our method. Their approach extracts features from motion-blurred images of faces and then uses the subspace distance to identify the blur. In contrast, our method can be applied to space-varying blur and our analysis provides a novel method to evaluate (and design) masks.
2
Single Image Blind Deconvolution
Blind deconvolution from a single image is a very challenging problem: We need to recover more unknowns than the available observations. This challenge will be illustrated in the next section, where we present the image formation model of a blurred image obtained from a coded aperture camera. To make the problem feasible and well-behaved, one can introduce additional constraints on the solution. In particular, we constrain the higher-order statistics of sharp texture (sec. 2.2) and impose that the blur scale be piecewise smooth across the image pixels (sec. 2.3). 2.1
Image Model
In the simplest instance, a blurred image of a plane facing the camera can be described via the convolution of a sharp image with the blur kernel. However, the convolutional model breaks down with more general surfaces and, in particular, at occlusion boundaries. In this case, one can describe a blurred image with a linear model. For the sake of notational simplicity, we write images as column vectors, where all pixels are sorted in lexicographical order. Thus, a blurred
128
M. Martinello and P. Favaro
image with N pixels is a column vector g ∈ RN . Similarly, a sharp image with M pixels is a column vector f ∈ RM . Then, g satisfies g = Hd f ,
(1)
where the N × M matrix Hd represents the coded blur. d is a column vector with M pixels and collects the blur scale corresponding to each pixel of f . The i-th column of Hd is an image, rearranged as a vector, of the coded blur with scale di generated by the i-th pixel of f . Notice that this model is indeed a generalization of the convolutional case. In the convolutional model, Hd reduces to a Toeplitz matrix. Our task is to recover the unknown sharp image f given the blurred image g. To achieve this goal it is necessary to recover the blur scale at each pixel d. The theory of linear algebra tells us that: If N = M and the equations in eq. (1) are not linearly dependent, and we are given both g and Hd , then we can recover the sharp image f . However, in our case we are not given the matrix Hd and the blurred image g is affected by noise. This introduces two challenges: First, to obtain Hd we need to retrieve the blur scale d; second, because of noise in g and of the ill-conditioning of the linear system in eq. (1), the estimation of f might be unstable. The first challenge implies that we do not have a unique solution. The second challenge implies that even if the solution were unique, its estimation would not be reliable. However, not all is lost. It is possible to add more equations to eq. (1) until a unique reliable solution can be obtained. This technique is based on observing that, typically, one expects the unknown sharp image and blur scale map to have some regularity. For instance, both sharp textures and blur scale maps are not likely to look like noise. In the next two sections we will present and illustrate our sharp image and blur scale priors. 2.2
Sharp Image Prior
Images of the real world exhibit statistical regularities that have been studied intensively in the past 20 years and have been linked to the human visual system and its evolution [34]. For the purpose of image deblurring, the most important aspect of this study is that natural images form a much smaller subset of all possible images. In general, the characterization of the statistical properties of natural images is done by applying a given transform, typically related to a component of human vision. Among the most common statistics used in image processing are the second order statistics, i.e., relations between pairs of pixels. For instance, this category includes the distributions of image gradients [35,36]. However, a more accurate account of the image structure can be captured with high-order statistics, i.e., relations between several pixels. In this work, we consider this general case, but restrict the relations to linear ones of the form Σf 0
(2)
where Σ is a rectangular matrix. Eq. (2) implies that all sharp images live approximately on a subspace. Despite their crude simplicity, these linear constraints allow for some flexibility. For example, the case of second-order statistics
Single Image Blind Deconvolution with Higher-Order Texture Statistics
129
results in rows of Σ with only two nonzero values. Also, by designing Σ one can selectively apply the constraints only on some of the pixels. Another example is to choose each row of Σ as a Haar feature applied to some pixels. Notice that in our approach we do not make any of these choices. Rather, we estimate Σ directly from natural images. Natural image statistics, such as gradients, typically exhibit a peaked distribution. However, performing inference on such distributions results in minimizations of non convex functionals for which we do not have probably optimal algorithms. Furthermore, we are interested in simplifying the optimization task as much as possible to gain in computational efficiency. This has led us to enforce the linear relation above by minimizing the convex cost Σf 22 .
(3)
As we do not have an analytical expression for Σ that satisfies eq. (2), we need to learn it directly from the data. We will see later that this step is necessary only when performing the deconvolution step given the estimated blur. Instead, when estimating the blur scale our method allows us to use Σ implicitly, i.e., without ever recovering it. 2.3
Blur Scale Prior
The statistics of range images can be characterized with an approach similar to that for optical images [37]. The study in [37] verified the random collage model, i.e., that a scene is a collection of piecewise constant surfaces. This has been observed in the distributions of Haar filter responses on the logarithm of the range data, which showed strong cusps in the isoprobability contours. Unfortunately, a prior following these distributions faithfully would result in non convex energy minimization. A practical convex solution to enforce the piecewise constant model, is to use total variation [38]. Common choices are the isotropic and anisotropic total variation. In our algorithm we have implemented the latter. We minimize ∇d1 , i.e., the sum of the absolute value of the components of the gradient of d.
3
Blur Scale Identification and Image Deblurring
We can combine the image model introduced in sec. 2.1 with the priors in sec. 2.2 and 2.3 and formulate the following energy minimization problem: ˆ fˆ = argmin g − Hd f 2 + αΣf 2 + β∇d1 , d, 2 2 d,f
(4)
where the parameters α, β > 0 determine the amount of regularization for texture and blur scale respectively. Notice that the formulation above is common to many approaches including, in particular, [5]. Our approach, however, in addition to using a more accurate blur matrix Hd , considers different priors and a different depth identification procedure.
130
M. Martinello and P. Favaro
Our next step is to notice that, given d, the proposed cost is simply a leastsquares problem in the unknown sharp texture f . Hence, it is possible to compute f in closed-form and plug it back in the cost functional. The result is a much simpler problem to solve. We summarize all the steps in the following Theorem: Theorem 1. The set of extrema of the minimization (4) coincides with the set of extrema of the minimization ⎧ ⊥ 2 ⎨ dˆ = argmin Hd g2 + β∇d1 d −1 (5) T ⎩ fˆ = αΣ T Σ + H T H H g ˆ ˆ ˆ d d d −1 T . where Hd⊥ = I − Hd αΣ T Σ + HdT Hd Hd , and I is the identity matrix. Proof. See Appendix. Notice that the new formulation requires the definition of a square and symmetric matrix Hd⊥ . This matrix depends on the parameter α and the prior matrix Σ, both of which are unknown. However, for the purpose of estimating the unknown blur scale map d, it is possible to bypass the estimation of α and Σ by learning directly the matrix Hd⊥ from data. 3.1
Learning Procedure and Blur Scale Identification
We break down the complexity of solving eq. (5) by using local blur uniformity, i.e., by assuming that blur is constant within a small region of pixels. Then, we further simplify the problem by considering only a finite set of L blur sizes d1 , . . . , dL . In practice, we find that both assumptions work well. The local blur uniformity holds reasonably well except at occluding boundaries, which form a small subset of the image domain. At occluding boundaries the solution tends to favor small blur estimates. We also found experimentally that the discretization is not a limiting factor in our method. The number of blur sizes L can be set to a value that matches the level of accuracy of the method without reaching a prohibitive computational load. Now, by combining the assumptions we find that eq. (5) at one pixel x ˆ d(x) = argmin Hd⊥ (x)g22 + β∇d(x)1
(6)
d(x)
can be approximated by ⊥ ˆ d(x) = argmin Hd(x) gx 22
(7)
d(x)
where gx is a column vector of δ 2 pixels extracted from a δ × δ patch centered at the pixel x of g. Experimentally, we find that the size δ of the patch should not be smaller than the maximum scale of the coded blur in the captured image
Single Image Blind Deconvolution with Higher-Order Texture Statistics
131
⊥ g. Hd(x) is a δ 2 × δ 2 matrix that depends on the blur size d(x) ∈ {d1 , . . . , dL }. So we assume that Hd⊥ (x, y) 0 for y such that y − x1 > δ/2. Notice that the term β∇d1 drops because of the local blur uniformity assumption. ⊥ The next step is to explicitly compute Hd(x) . Since the blur size d(x) is one of ⊥ ⊥ L values, we only need to compute Hd1 , . . . , HdL matrices. As each Hd⊥i depends on α and the local Σ, we propose to learn each Hd⊥i directly from data. Suppose that we are given a set of T column vectors gx1 , . . . , gxT extracted from blurry images of a plane parallel to the camera image plane. The column vectors will all share the same blur scale di . Hence, we can rewrite the cost functional in eq. (7) for all x as Hd⊥i Gi 22 (8) . ⊥ 2 where Gi = [gx1 · · · gxT ]. By definition of Gi , Hdi Gi 2 = 0. Hence, we find that Hd⊥i can be computed via the singular value decomposition of Gi = Ui Si ViT . If Ui = [Udi Qdi ] where Qdi corresponds to the singular values of Si that are zero (or negligible), then Hd⊥i = Qdi QTdi . The procedure is then repeated for each blur scale di with i = 1, . . . , L. Next, we can use the estimated matrices Hd⊥1 , . . . , Hd⊥L on a new image g and optimize with respect to d: ⊥ dˆ = argmin Hd(x) gx 22 + β∇d(x)1 . (9) d
x
The first term represents unitary terms, i.e., terms that are defined on single pixels; the second term represents binary terms, i.e., terms that are defined on pairs of pixels. The minimization problem (9) can then be solved efficiently via graph cuts [39]. Notice that the procedure above can be applied to other surfaces as well, so that instead of a collection of parallel planes, one can consider, for example, a collection of quadratic surfaces. Also, notice that there are no restrictions on the size of a patch. In particular, the same procedure can be applied to a patch of the size of the input image. In our experiments for depth estimation, however, we consider only small patches and parallel planes as local surfaces. 3.2
Image Deblurring
In the previous section we have devised a procedure to compute the blur scale at each pixel d. In this section we assume that d is given and devise a procedure to compute the image f . In principle, one could use the closed-form solution −1 f = αΣ T Σ + HdTˆ Hdˆ HdTˆ g. (10) However, notice that computing this equation entails solving a large matrix inversion, which is not practical for moderate image dimensions. A simpler approach is to solve the least squares problem (4) in f via an iterative method. Therefore, we consider solving the problem fˆ = argmin g − Hdˆf 22 + αΣf 22 f
(11)
132
M. Martinello and P. Favaro
by using a least-squares conjugate gradient descent algorithm in f [40]. The main component for the iteration in f is the gradient ∇Ef of the cost (11) with respect to f ∇Ef = αΣ T Σ + HdTˆ Hdˆ f − HdTˆ g. (12) The descent algorithm iterates until ∇Ef 0. Because of the convexity of the cost functional with respect to f , the solution is also a global minimum. To compute Σ we use a database of sharp images F = [f1 · · · fT ] where {fi }i=1,...,T are sharp images rearranged as column vectors, and compute the singular value decomposition F = UF ΣF VFT . Then, we partition UF = [UF,1 UF,2 ] such that UF,2 corresponds to the smallest singular values of ΣF . The high-order . T prior is defined as Σ = UF,2 UF,2 , such that we have Σfi ≈ 0. The regularization parameter α is instead manually tuned. The matrix Hdˆ is computed as described in Section 2.1.
4
A Geometric Viewpoint on Blur Scale Identification
In the previous sections we have seen that the blur scale at each pixel can be obtained by minimizing eq. (9). We search among matrices Hd⊥1 , . . . , Hd⊥L the one that yields the minimum 2 norm when applied to the vector gx . We show that this has a geometrical interpretation: Each matrix Hd⊥i defines a subspace and Hd⊥i gx 22 is the distance of each vector gx from that subspace. Recall that Hd⊥i = Qdi QTdi and that Ui = [Udi Qdi ] is an orthonormal matrix. Then, we obtain that Hd⊥i gx 22 = Qdi QTdi gx 22 = QTdi gx 22 = gx 22 − UdTi gx 22 . If we now divide by the scalar number gx 22 , we obtain exactly the square of the subspace distance [41]
2 K g M(g, Udi ) = 1 − UdTi ,j (13) g j=1 where K is the rank of the subspace Udi , Udi = [Udi ,1 . . . Udi ,K ], and Udi ,j , j = 1, · · · , K are orthonormal vectors. The geometrical interpretation brings a fresh look to image blurring and deblurring. Consider the image model (1). Let us take the singular value decomposition of the blur matrix Hd Hd = Ud Sd VdT
(14)
where Sd is a diagonal matrix with positive entries, and both Ud and Vd are orthonormal matrices. Formally, the vector f undergoes a rotation (VdT ), then a scaling (Sd ), and then again another rotation (Ud ). This means that if f lives in a subspace, the initial subspace is mapped to another rotated subspace, possibly of smaller dimension (see Fig. 2, middle). Notice that as we change the blur scale, the rotations and scaling are also changing and may result in yet a different subspace (see Fig. 2, right).
Single Image Blind Deconvolution with Higher-Order Texture Statistics
133
f3
f2 f1 (a)
H d1 g1
g3
g2
H d2
g3 g2
g1 (b)
(c)
Fig. 2. Coded images subspaces. (a) Image patches on a subspace. (b) Subspace containing images blurred with H d 1 ; blurring has the effect of rotating and possibly reducing the dimensionality of the original subspace. (c) Subspace containing images blurred with Hd 2 .
It is important to understand that rotations of the vector f can result in blurring. To clarify this, consider blurred and sharp images with only 3 pixels (we cannot visualize the case of more than 3 pixels), i.e., g1 = [g1,x g1,y g1,z ]T and f1 = [f1,x f1,y f1,z ]T . Then, we can plot the vectors g1 and f1 as 3D points (see Fig. 2). Let g1 = 1 and f1 = 1. Then, we can rotate f1 about the origin and overlap it exactly on g1 . In this case rotation corresponded to blurring. The opposite is also true. We can rotate the vector g1 onto the vector f1 and thus perform deblurring. Furthermore, notice that in this simple example the most blurred images are vectors with identical entries. Such blurred images lie along the diagonal direction [1 1 1]T . In general, blurry images tend to have entries with similar values and hence tend to cluster around the diagonal direction. Our ability to discriminate between different blur scales in a blurry image boils down to being able to determine the subspaces where the patches of such blurry image live. If sharp images do not live on a subspace, but uniformly in the entire space, our only way to distinguish the blur size is that the blurring Hd scales some dimensions of f to zero and that the scaling varies with blur size. This case has links to the zero-sheet approach in the Fourier domain [42]. However, if the sharp images live on a subspace, the blurring Hd may preserve all the directions and blur scale identification is still possible by determining the rotation of the sharp images subspace. This is the principle that we exploit.
134
M. Martinello and P. Favaro
Input: A single coded image g and a collection of coded images of L planar scenes. Output: The blur scale map d of the scene. Preprocessing (offline) Pick an image patch size larger than twice the maximum blur scale; for i = 1, . . . , L do Compute the singular value decomposition Ui Si ViT of a collection of image patches coded with blur scale di ; Calculate the subspace Ud i as the columns of Ui corresponding to nonzero singular values of Si ; end Blur identification (online) Solve dˆ = arg mind∈{d ,··· ,d } M2 (gx , Ud ) + β 2 ∇d(x)1 . 1
L
x
g x 2
Algorithm 1. Blur scale identification from a single coded image via the subspace distance method.
Notice that the evaluation of the subspace distance M involves the calculation of the inner product between a patch and a column of Udi . Hence, this calculation can be done exactly as the convolution of a column of Udi , rearranged as an image patch, with the whole image g. We can conclude that the algorithm requires computing a set of L × K convolutions with the coded image, which is a stable operation of polynomial computational complexity. As we have shown that minimizing eq. (13) is equivalent to minimizing Hd⊥i gx 22 up to a scalar value, we summarize the blur scale identification procedure in Algorithm 1. 4.1
Coded Aperture Selection
In this section we discuss how to obtain an optimal pattern for the purpose of image deblurring. As pointed out in [19] we identify two main challenges: The first one is that accurate deblurring requires accurate identification of the blur scale; the second one is that accurate deblurring requires little texture loss due to blurring. A first step towards addressing these challenges is to define a metric for blur scale identification and a metric for texture loss. Our metric for blur scale identification can be defined directly from section 4. Indeed, the ability to determine which subspace a coded image patch belongs to can be measured via the distance between the subspaces associated to each blur scale 2 ¯ M(Ud1 , Ud2 ) = K − U T Ud2 ,j . (15) d1 ,i
i,j
Clearly, the wider apart all the subspaces are, and the less prone to noise the subspace association is. We find that a good visual summary of the “spacing” between all the subspaces is a (symmetric) matrix with distances between any
Single Image Blind Deconvolution with Higher-Order Texture Statistics
d20 d25
d
d20 d25
d10
¯ (Ud10, Ud20) = √K M
d10
d25
¯ (Ud25, Ud25) = 0 M
d25
d
135
d
d (a) Ideal distance matrix
(b) Circular aperture
Fig. 3. Distance matrix computation. The top-left corner of each matrix is the distance between subspaces corresponding to small blur scales, and, vice versa, the bottom-right corner is the distance between subspaces corresponding to large blur scales. Notice that large subspace √ distances are bright and small subspace distances are dark. The maximum distance ( K) is achievable when two subspaces are orthogonal to each other.
two subspaces. We compute such matrix for a conventional camera and show the results in Fig. 3, together with the ideal distance matrix. In each distance matrix, subspaces associated to blur scales ranging from the smallest to the largest ones are arranged along the rows from left to right and along the columns from top to bottom. Along the diagonal the distance is necessarily 0 as√we compare identical subspaces. Also, by definition the metric cannot exceed K, where K is the minimum rank among the subspaces. In Fig. 5 we report the distance matrices computed for each of the apertures we consider in this work (see Fig. 4). Notice that the subspace distance map for a conventional camera (Fig. 3(b)) is overall darker than the matrices for coded aperture cameras (Fig. 5). This shows the poor blur scale identifiability of the circular aperture and the improvement that can be achieved when using a more elaborate pattern. The rank K can be used to address the second challenge, i.e., the definition of a metric for texture loss. So far we have seen that blurring can be interpreted as a combination of rotations and scaling. Deblurring can then be interpreted as a combination of rotations and scaling in the opposite direction. However, when blurring scales some directions to 0, part of the texture content has been lost. This suggests that a simple measure for texture loss is the dimension of the coded subspace: The higher the dimension and the more texture content we can restore. As the (coded images) subspace dimension is K, we can immediately conclude that the subspace distance matrix that most closely resembles the ideal distance matrix (see Fig. 3(a)) is the one that simultaneously achieves the best depth identification and the least texture loss. Finally, we propose to use the average L1√fitting of any distance matrix to the ideal distance matrix scaled of √ ¯ The fitting yields the values in Table 1. We can K, i.e., | K(11T − I) − M|. also see visually in Fig. 5 that mask 4(b) and mask 4(d) are the coded apertures that we can expect to achieve the best results in texture deblurring.
136
M. Martinello and P. Favaro
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 4. Coded aperture patterns and PSFs. All the aperture patterns we consider in this work (top row) and their calibrated PSFs for two different blur scales (second and bottom row). (a) and (b) aperture masks used in both [13] and [43]; (c) annular mask used in [17]; (d) pattern proposed by [5]; (e) pattern proposed by [4]; (f) and (g) aperture masks used in [15]; (h) MURA pattern used in [10].
(a) Mask 4(a)
(b) Mask 4(b)
(c) Mask 4(c)
(d) Mask 4(d)
(e) Mask 4(e)
(f) Mask 4(f )
(g) Mask 4(g)
(h) Mask 4(h)
Fig. 5. Subspace distances for the eight masks in Fig. 4. Notice that the subspace rank K determines the maximum distance achievable, and therefore, coded apertures with overall darker subspace distance maps have poor blur scale identifiability (i.e., sensitive to noise).
The quest for the optimal mask is, however, still an open problem. Even if we look for the optimal mask via brute-force search, a single aperture pattern requires the evaluation of eq. (15) and the computation of all the subspaces associated to each blur scale. In particular, the latter process requires about 15 minutes on a QuadCore 2.8GHz with Matlab 7, which makes the evaluation of
Single Image Blind Deconvolution with Higher-Order Texture Statistics Table 1. L1 fitting of any distance matrix to the ideal distance matrix scaled of
L1 fitting
Masks 4(a) 4(b) 4(c) 4(d) 4(e) 8.24 6.62 8.21 5.63 8.37
4(f) 16.96
137 √
K
4(g) 4(h) 8.17 16.13
a large number of masks unfeasible. Devising a fast procedure to determine the optimal mask will be subject of future work.
5
Experiments
In this section we demonstrate the effectiveness of our approach on both synthetic and real data. We show that the proposed algorithm performs better than previous methods on different coded apertures and different datasets. We also show that the masks proposed in the literature do not always yield the best performance. 5.1
Performance Comparison
Before proceeding with tests on real images, we perform extensive simulations to compare accuracy and robustness of our algorithm with 4 competing methods including the current state-of-the-art approach. The methods are all based on the hypothesis plane deconvolution used by [5] as explained in the Introduction. The main difference among the competing methods is that the deconvolution step is performed either using the Lucy-Richardson method [44], or regularized filtering (i.e., with image gradient smoothness), or Wiener filtering [45], or Levin’s procedure [5]. We use the 8 masks shown in Fig. 4. All the patterns have been proposed and used by other researchers [4,5,10,13,15,17]. For each mask and a given blur scale map d, we simulate a coded image by using eq. (1), where f is an image of 4, 875 × 125 pixels with either random texture or a set of patches from natural images (examples of these patches are shown in Fig. 6). Then, for each algorithm we obtain a blur scale map estimate dˆ and compute its discrepancy with the ground-truth. The ground-truth blur scale map d that we use is shown in pseudo-colors at the top-left of both Fig. 7 and Fig. 8 and it represents a stair composed of 39 steps at different distances (and thus different blur scales) from the camera. We assume that the focal plane is set to be between the camera and the first object of interest in the scene. With this setting, the bottom part of the blur scale map (small blur sizes) corresponds to points close to the camera, and the top part (large blur sizes) to points far from the camera. Each step of the stair is a square of 125 × 125 pixels, we have squeezed the actual illustration along the vertical axis to fit in the paper. The size of the blur ranges from 7 to 30 pixels. Notice that in measuring the errors we consider all pixels, including those at the blur scale discontinuities, given by the difference of blur scale between neighboring steps. In Fig. 7 we show, for each mask in Fig. 4, the results of the proposed method (right) together with the results obtained by
138
M. Martinello and P. Favaro
image noise level σ= 0
image noise level σ= 0.002 Fig. 6. Real texture. Some of the patches extracted from real images that have been used in our tests. The same patches are shown with no noise (top part) and when a Gaussian noise is added to them (bottom part).
the current state-of-the-art algorithm (left) on random texture. The same procedure, but with texture from natural images, is reported in Fig. 8. For the three best performing masks (mask 4(a), mask 4(b), and mask 4(d)), we report the results with the same graphical layout in Fig. 9, in order to better appreciate the improvement of our method over previous ones, especially for large blur scales. Every plot shows, for each of the 39 steps we consider, the mean and 3 times the standard deviation of the estimated blur scale values (ordinate axis) against the true blur scale level (abscissa axis). The ideal estimate is the diagonal line where each estimated level corresponds to the correct true blur scale level. If there is no bias in the estimation of the blur scale map, the ideal estimate should lie between 3 times the standard deviation about the mean with probability close to 1. Our method performs consistently well with all the masks and at different blur scale levels. In particular, the best performances are observed for mask 4b (Fig. 9(b)) and d (Fig. 9(c)), while the performance of competing methods rapidly degenerates with increasing pattern scales. This demonstrates that our method has potential for restoring objects at a wider range of blur scales and with higher accuracy than in previous algorithms. A quantitative comparison among all the methods and masks is given in Table 2 and Table 4 (for random texture) and in Table 3 and Table 5 (for real texture). In each table, the left half reports the average error of the blur scale ˆ 1 , where d and dˆ are the ground-truth and the estimate (measured as ||d − d|| estimated blur scale map respectively); the right half reports the error on the reconstructed sharp image fˆ, measured as ||f − fˆ||2 + ||∇f − ∇fˆ||2 , where 2
2
f is the ground-truth image. The gradient term is added to improve sensitivity to artifacts in the reconstruction. As one can see from Tables 2 - 5, several levels of noise have been considered in the performance comparison: σ = 0 (Table 2 and
Single Image Blind Deconvolution with Higher-Order Texture Statistics
139
Far
Close
GT
(a) Mask 4(a)
(b) Mask 4(b)
(c) Mask 4(c)
(d) Mask 4(d)
(e) Mask 4(e)
(f) Mask 4(f )
(g) Mask 4(g)
(h) Mask 4(h)
Fig. 7. Blur scale estimation - random texture. GT: Ground-truth blur scale map. (a-h) Estimated blur scale maps for all the eight masks we consider in the paper. For each mask, the figure reports the blur scale map estimated with both Levin et al.’s method (left) and our method (right).
Table 3), σ = 0.001, σ = 0.002, and σ = 0.005 (Table 4 and Table 5). The noise level is however adjusted to accommodate the difference in overall incoming light between the masks, i.e., if the mask i has an incoming light of li 1 , the noise level for that mask is given by: 1 σi = ∗ σ. (16) li Thus, masks such as 4(f), 4(g) and 4(h) are subject to lower noise levels than masks such as 4(a) and 4(b). Our method produces more consistent and accurate blur scale maps than previous methods for both random texture and natural images, and across the 8 masks that it has been tested with. 5.2
Results on Real Data
We now apply the proposed blur scale estimation algorithm to coded aperture images captured by inserting the selected mask into a Canon 50mm f /1.4 lens mounted on a Canon EOS-5D DSLR as described in [5,15]. Based on the analysis 1
The value of li represents the quantity of lens aperture that is open: when the lens aperture is totally open, li = 1; instead, when the mask completely blocks the light, li = 0.
140
M. Martinello and P. Favaro
Far
Close
GT
(a) Mask 4(a)
(b) Mask 4(b)
(c) Mask 4(c)
(d) Mask 4(d)
(e) Mask 4(e)
(f) Mask 4(f )
(g) Mask 4(g)
(h) Mask 4(h)
Fig. 8. Blur scale estimation - real texture. GT: Ground-truth blur scale map. (a-h) Estimated blur scale maps for all the eight masks we consider in the paper. For each mask, the figure reports the blur scale map estimated with both Levin et al.’s method (left) and our method (right).
in section 4.1 we choose mask 4(b) and mask 4(b). Each of the 4 holes in the first mask is 3.5mm large, which corresponds to the same overall section of a conventional (circular) aperture with diameter 7.9mm (f /6.3 in a 50mm lens). All indoor images have been captured by setting the shutter speed to 30ms (ISO 320-500) while outdoors the exposure has been set to 2ms or lower (ISO 100). Firstly, we need to collect (or synthesize) a sequence of L coded images, where L is the number of blur scale levels we want to distinguish. There are two techniques to acquire these coded images: (1) If the aim is just to estimate the depth map (or blur scale map), one can capture real coded images of a planar surface with sharp natural texture (e.g., a newspaper) at different blur scale levels. (2) If the goal is to reconstruct both depth map and all-in-focus image, one has to capture the PSF of the camera at each depth level, by projecting a grid of bright dots on a plane and using a long exposure; then, coded images are simulated by applying the measured PSFs on sharp natural images collected from the web. In the experiments presented in this paper, we use the latter approach since we estimate both the blur scale map and the all-in-focus image. The PSFs have been captured on a plane at 40 different depths between 60cm and 140cm from the camera. The focal plane of the camera was set at 150cm. In the first experiments, we show the advantage of our approach over Levin et al.’s method on a scene with blur sizes similar to the ones used in the performance
Single Image Blind Deconvolution with Higher-Order Texture Statistics 40 35
30 25 20 15 10 5 0 0
40
10
20 True blur scale
30
30 25 20 15 10
0 0
40
40
LucyïRichardson Levin Our method
35
30 25 20 15 10 5 0 0
35
5
Estimated blur scale
Estimated blur scale
35
40
LucyïRichardson Levin Our method Estimated blur scale
LucyïRichardson Levin Our method Estimated blur scale
Estimated blur scale
35
20 True blur scale
30
20 True blur scale
30
(a) Mask 4(a)
40
25 20 15 10
40
LucyïRichardson Levin Our method
35
30 25 20 15 10
0 0
30
0 0
40
5 10
LucyïRichardson Levin Our method
5 10
Estimated blur scale
40
141
10
20 True blur scale
30
40
30
40
LucyïRichardson Levin Our method
30 25 20 15 10 5
10
20 True blur scale
30
(b) Mask 4(b)
40
0 0
10
20 True blur scale
(c) Mask 4(d)
Fig. 9. Comparison of the estimated blur scale levels obtained from the 3 best methods using both random (top) and real (bottom) texture. Each graph reports the performance of the algorithms with (a) masks 4(a), (b) masks 4(b), and (c) mask 4(d). Both mean and standard deviation (in the graphs, we show three times the computed standard deviation) of the estimated blur scale are shown in an errorbar with the algorithms performances (solid lines) over the ideal characteristic curve (diagonal dashed line) for 39 blur sizes. Notice how the performance dramatically changes based on the nature of texture (top row vs bottom row). Moreover, in the case of real images the standard deviation of the estimates obtained with our method are more uniform for mask 4(b) than for mask 4(d). In the case of mask 4(d) the performance is reasonably accurate only with small blur scales.
test. The same dataset has been captured by using mask 4(b) (see Fig. 11) and mask 4(d) (see Fig. 12). The size of the blur, especially at the background, is very large; This can be appreciated in Fig. 10(a), which shows the same scenario captured with the same camera setting, but without mask on the lens. For a fair comparison, we do not use any regularization or user intervention to the estimated blur scale maps. As already seen in the Section 5.1 (especially in Fig. 9), Levin et al.’s method yields an accurate blur scale estimate with mask 4(d) when the size of the blur is small, but it fails with large amounts of blur. The proposed approach overcomes this limitation and yields to a deblurred image that in both cases, Fig. 11(e) and Fig. 12(e), is closer to the ground-truth (Fig. 10(b)). Notice also that our method gives an accurate reconstruction of the blur scale, even without using regularization (β = 0 in eq. (9)). Some artefacts are still present in the reconstructed all-in-focus images. These are mainly due to the very large size of the blur and to the raw blur-scale map: When adding regularization to the blur-scale map (β > 0), the deblurring algorithm yields to better results, as one can see in the next examples.
142
M. Martinello and P. Favaro
Table 2. Random texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, assuming there is not noise.
Methods Lucy-Richardson Regularized filtering Wiener filtering Levin et al.[5] Our method
a 16.8 18.4 8.8 16.7 1.2
Blur scale b c d 14.4 17.2 2.9 17.2 18.6 6.8 13.8 14.4 16.6 13.7 16.7 1.4 0.9 3.7 0.9
Masks - (image noise level σ = 0) estimation Image deblurring e f g h a b c d e f 17.0 18.1 17.8 15.4 0.22 0.22 0.21 0.22 0.22 0.22 16.7 12.3 18.8 13.4 0.30 0.32 0.27 0.32 0.25 0.42 16.3 15.3 14.1 15.3 0.23 0.29 0.29 0.33 0.31 0.32 16.6 16.8 17.6 13.3 0.22 0.21 0.22 0.21 0.21 0.22 4.2 10.3 3.8 9.6 0.20 0.20 0.21 0.21 0.21 0.22
g h 0.22 0.21 0.23 0.25 0.27 0.30 0.22 0.21 0.21 0.22
Table 3. Real texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, assuming there is not noise. Methods Lucy-Richardson Regularized filtering Wiener filtering Levin et al.[5] Our method
a 17.0 18.5 17.1 16.3 3.3
Masks - (image noise level σ = 0) Blur scale estimation Image deblurring b c d e f g h a b c d e f 16.4 18.4 15.6 17.9 18.5 18.0 18.3 0.22 0.20 0.22 0.18 0.20 0.20 16.8 18.2 8.6 16.8 11.4 17.9 15.4 0.51 0.49 0.52 1.08 0.28 0.67 16.4 18.2 14.4 17.0 18.0 17.5 17.6 0.25 0.22 0.26 0.21 0.21 0.24 14.8 17.9 9.9 17.0 18.2 17.6 17.0 0.25 0.21 0.23 0.19 0.20 0.21 3.3 6.8 3.3 6.1 12.6 5.9 11.7 0.18 0.16 0.21 0.16 0.17 0.21
g 0.20 0.28 0.23 0.21 0.19
h 0.20 0.40 0.21 0.20 0.21
In Fig. 13 we have the same indoor scenario, but now the items are slightly closer to the focal plane of the camera; then the maximum amount of blur is reduced. Although the background is still very blur in the coded image (Fig. 13(a)), our accurate blur-scale estimation yields to a deblurred image (Fig. 13(b)), where the text of the magazine becomes readable. Since the reconstructed blur-scale map corresponds to the depth map (relative depth) of the scene, we can use it together with the all-in-focus image to generate a 3D image 2 . This image, when watched with red-cyan glasses, allows one to perceive the depth information extracted with our approach. All the regularized blur-scale maps in this work are estimated from eq. (9) by setting β = 0.5; the raw maps, instead, are obtained without regularization term (β = 0). We have tested our approach on different outdoor scenes: Fig. 15 and Fig. 14. In these scenarios we apply the subspaces we have learned within 150cm from the camera to a very large range of depths. Several challenges are present in these scenes, such as occlusions, shadows, and lack of texture. Our method demonstrates robustness to all of them. Notice again that the raw blur-scale maps shown in Fig. 15(c) and Fig. 14(c) are already very close to the maps that include regularization (Fig. 15(d) and Fig. 14(d) respectively). For each dataset, a 2
In this work, a 3D image corresponds to an image captured with a stereo camera, where one lens has a red filter and the second lens has a cyan filter. When one watches this type of images with red-cyan glasses, each eye will see only one view: The shift between the two views gives the perception of depth.
Single Image Blind Deconvolution with Higher-Order Texture Statistics
143
Table 4. Random texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, under different levels of noise. Methods a Lucy-Richardson 18.5 Regularized filtering 19.0 Wiener filtering 15.7 Levin et al. 18.4 Our method 9.6 Methods
b 17.1 17.5 16.7 16.3 8.7
a 18.5 18.9 15.5 18.5 11.3
b 17.1 17.4 16.4 16.9 11.1
a 18.4 18.9 15.4 18.5 12.8
b 17.0 17.4 16.2 16.9 12.6
Lucy-Richardson Regularized filtering Wiener filtering Levin et al. Our method Methods Lucy-Richardson Regularized filtering Wiener filtering Levin et al. Our method
Masks - (image noise level σ = 0.001) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 11.7 16.6 16.2 18.3 17.3 0.39 0.36 0.27 0.28 0.35 0.29 19.0 14.3 16.8 18.3 18.9 15.6 0.88 0.96 0.61 1.03 0.93 0.61 16.8 17.5 17.2 17.6 16.8 17.0 0.35 0.37 0.36 0.39 0.38 0.38 18.1 11.0 16.7 17.3 18.3 17.5 0.32 0.31 0.26 0.28 0.30 0.28 12.7 10.1 12.5 13.2 12.9 13.9 0.20 0.21 0.22 0.22 0.21 0.23 Masks - (image noise level σ = 0.002) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 12.1 16.6 16.3 18.3 17.3 0.49 0.46 0.31 0.34 0.44 0.33 18.8 12.7 16.7 16.9 18.9 16.9 0.76 0.69 0.47 0.50 0.67 0.46 16.7 17.3 17.1 17.5 16.8 17.0 0.35 0.37 0.37 0.39 0.38 0.39 18.0 12.1 16.7 17.6 18.4 17.7 0.39 0.38 0.29 0.34 0.37 0.31 13.2 11.3 12.6 13.5 12.8 14.0 0.22 0.22 0.23 0.23 0.22 0.23 Masks - (image noise level σ = 0.005) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 12.6 16.5 16.6 18.4 17.3 0.66 0.62 0.41 0.47 0.61 0.40 18.8 13.1 16.6 17.1 18.8 16.9 1.17 1.04 0.69 0.75 1.03 0.59 16.5 17.3 17.2 17.3 16.7 17.0 0.35 0.37 0.37 0.39 0.38 0.39 18.0 12.5 16.7 17.7 18.4 17.7 0.55 0.54 0.37 0.45 0.51 0.37 13.4 12.0 12.8 13.5 13.5 14.0 0.25 0.25 0.26 0.25 0.25 0.26
g 0.26 0.61 0.35 0.25 0.21
h 0.27 0.91 0.38 0.26 0.23
g 0.30 0.49 0.35 0.28 0.23
h 0.32 0.46 0.38 0.29 0.24
g 0.40 0.73 0.35 0.36 0.26
h 0.43 0.68 0.38 0.39 0.27
Table 5. Real texture. Performance (mean error) of 5 algorithms in blur scale estimation and image deblurring for the apertures in Fig. 4, under different levels of noise. Methods Lucy-Richardson Regularized filtering Wiener filtering Levin et al.[5] Our method Methods Lucy-Richardson Regularized filtering Wiener filtering Levin et al. Our method Methods Lucy-Richardson Regularized filtering Wiener filtering Levin et al. Our method
a 18.5 19.0 13.8 18.4 8.7
b 17.2 17.5 14.5 16.8 7.8
a 18.5 19.0 14.7 18.4 10.6
b 17.2 17.5 15.8 16.8 9.5
a 18.3 19.0 15.6 18.5 12.2
b 17.1 17.5 16.5 16.9 11.8
Masks - (image noise level σ = 0.001) estimation Image deblurring e f g h a b c d e f 16.8 17.8 18.4 18.1 0.38 0.35 0.26 0.24 0.32 0.23 16.8 17.6 19.0 15.6 0.96 1.05 0.66 1.39 0.94 0.68 15.2 14.4 14.8 14.5 0.21 0.23 0.22 0.22 0.23 0.21 16.7 17.0 18.2 17.8 0.34 0.33 0.27 0.30 0.30 0.27 11.9 13.5 11.5 13.8 0.21 0.18 0.22 0.17 0.19 0.20 Masks - (image noise level σ = 0.002) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.3 13.2 16.7 17.5 18.4 17.9 0.47 0.44 0.30 0.29 0.40 0.26 19.0 14.1 16.8 18.1 19.0 15.7 1.26 1.38 0.87 1.72 1.30 0.74 15.2 15.8 16.0 15.1 15.0 15.7 0.23 0.25 0.24 0.24 0.25 0.24 18.1 11.1 16.7 17.1 18.3 17.7 0.41 0.40 0.30 0.37 0.37 0.30 12.1 9.0 12.3 13.5 12.1 14.1 0.24 0.19 0.23 0.17 0.19 0.20 Masks - (image noise level σ = 0.005) Blur scale estimation Image deblurring c d e f g h a b c d e f 18.2 12.9 16.6 17.4 18.4 17.9 0.61 0.58 0.39 0.40 0.55 0.34 19.0 14.1 16.8 18.1 18.9 15.7 1.89 2.07 1.31 2.38 2.03 0.88 16.1 16.8 16.7 16.5 16.0 16.7 0.26 0.27 0.26 0.27 0.27 0.26 18.1 11.3 16.7 17.4 18.4 17.7 0.56 0.55 0.38 0.49 0.51 0.37 13.3 10.8 12.7 13.7 13.4 13.7 0.26 0.22 0.24 0.19 0.21 0.22 Blur scale c d 18.3 13.7 19.0 14.0 14.1 14.6 18.1 10.6 11.8 7.7
g 0.24 0.64 0.21 0.24 0.20
h 0.25 1.02 0.23 0.25 0.20
g 0.27 0.87 0.22 0.28 0.21
h 0.30 1.34 0.25 0.29 0.20
g 0.37 1.31 0.24 0.35 0.22
h 0.40 2.02 0.26 0.39 0.25
144
M. Martinello and P. Favaro
(a) Conventional aperture
(b) Ground-truth (pinhole camera)
Fig. 10. (a) Picture taken with the conventional camera without placing the mask on the lens. (b)Image captured by simulating a pinhole camera (f/22.0), which can be used as ground-truth for the image texture.
3D image (Fig. 14(e) and Fig. 15(e)) has been generated by using just the output of our method: the deblurred images (b) and the blur-scale maps (d). The ground-truth images have been taken by simulating a pinhole camera (f/22.0). 5.3
Computational Cost
We downsample 4 times the input images from an original resolution of 12,8 megapixel (4, 368 × 2, 912) and use sub-pixel accuracy, in order to keep the algorithm efficient. We have seen from experiments on real data that the raw blur-scale map is already very close to the regularized map. This means that we can obtain a reasonable blur scale map very efficiently: When β = 0 the value of the blur scale at one pixel is independent of the other pixels and the calculations can be carried out in parallel. Since the algorithm takes about 5ms for processing 40 blur scale levels at each pixel, it is suitable for real-time applications. We have run the algorithm on a QuadCore 2.8GHz with 16GB memory. The code has been written mainly in Matlab 7. The deblurring procedure, instead, takes about 100s to process the whole image for 40 blur scale levels.
6
Conclusions
We have presented a novel method to recover the all-in-focus image from a single blurred image captured with a coded aperture camera. The method is split in two steps: A subspace-based blur scale identification approach and an image deblurring algorithm based on conjugate gradient descent. The method is simple, general, and computationally efficient. We have compared our method to existing algorithms in the literature and showed that we achieve state of the art performance in blur scale identification and image deblurring with both synthetic and real data while retaining polynomial time complexity.
Single Image Blind Deconvolution with Higher-Order Texture Statistics
(a) Input image
(b) Raw blur-scale map
(c) Deblurred image
(d) Raw blur-scale map
(e) Deblurred image
145
Fig. 11. Comparison on real data - mask 4(b). (a) Input image captured by using mask 4(b). (b-c) Blur-scale map and all-in-focus image reconstructed with Levins et al.’s method [5]; (d-e) Results obtained from our method.
(a) Input image
(b) Raw blur-scale map
(c) Deblurred image
(d) Raw blur-scale map
(e) Deblurred image
Fig. 12. Comparison on real data - mask 4(d). (a) Input image captured by using mask 4(d). (b-c) Blur-scale map and all-in-focus image reconstructed with Levins et al.’s method [5]; (d-e) Results obtained from our method.
146
M. Martinello and P. Favaro
(a) Input
(b) All-in-focus image
(c) Blur-scale map
(d) 3D image
Fig. 13. Close-range indoor scene [exposure time: 1/30s]. (a) coded image captured with mask 4(b); (b) estimated all-in-focus image; (c) estimated blur-scale map; (d) 3D image (to be watched with red-cyan glasses).
Single Image Blind Deconvolution with Higher-Order Texture Statistics
(a) Input image
(b) Deblurred image
(c) Raw blur-size map
(d) Estimated blur-size map
(e) 3D image
(f) Ground-truth image
147
Fig. 14. Long-range outdoor scene [exposure time: 1/200s]. (a) coded image captured with mask 4(b); (b) estimated all-in-focus image; (c) raw blur-scale map (without regularization); (d) regularized blur-scale map; (e) 3D image (to be watched with red-cyan glasses); (f) ground-truth image.
148
M. Martinello and P. Favaro
(a) Input image
(b) Deblurred image
(c) Raw blur-size map
(d) Estimated blur-size map
(e) 3D image
(f) Ground-truth image
Fig. 15. Mid-range outdoor scene [exposure time: 1/200s]. (a) coded image captured with mask 4(b); (b) estimated all-in-focus image; (c) raw blur-scale map (without regularization); (d) regularized blur-scale map; (e) 3D image (to be watched with red-cyan glasses); (f) ground-truth image.
Single Image Blind Deconvolution with Higher-Order Texture Statistics
149
Appendix Proof of Theorem 1 To prove the theorem we rewrite the least squares problem in f as 2 Hd g = H ¯ d f − g¯22 √ Hd f − g22 + αΣf 22 = f − αΣ 0 2
(17)
¯ d = H T √αΣ T T and g¯ = g T 0T T . Then, we can where we have defined H d T −1 T ¯ H ¯d ¯ g¯. By substituting the solution define the solution in f as fˆ = H H d d for f back in the least squares problem, we obtain ¯ ⊥ g¯2 Hd f − g22 + αΣf 22 = H (18) d 2 T ¯⊥ =I −H ¯d H ¯ H ¯ d −1 H ¯ T. where H d d d ¯ ⊥ rather than H ⊥ and g¯ rather than g We have shown that we can use H d d in the minimization problem (5) without affecting the solution. The rest of the ¯ ⊥ g¯2 . The step proof then assumes that the energy in eq. (5) is based on H 2 d ⊥ ⊥ ¯ ¯ above is necessary to fully exploit the properties of Hd . Hd is a symmetric ¯ ⊥ )T = H ¯ ⊥ ) and is also idempotent (i.e, H ¯ ⊥ = (H ¯ ⊥ )2 ). By matrix (i.e, (H d d d d applying the above properties we can write the argument of the first term of the cost in eq. (5) as ¯ ⊥ g¯ = g¯T (H ¯ ⊥ )T H ¯ ⊥ g¯ = ||H ¯ ⊥ g¯||2 g¯T H d d d d ¯ ⊥ we know that Moreover, from the definition of H d . ⊥ ¯ =I −H ¯ d (H ¯ TH ¯ d )−1 H ¯T H d d d † ¯ ¯ H = I − Hd d
Thus, the necessary conditions for an extremum of eq. (5) become ⎧ T ⎨ g¯ − H ¯ dH ¯ † g¯ ¯ dH ¯† +H ¯ d ∇H ¯ † g¯ = ∇ · ∇d ∇H d d d ∇d1 ⎩ ¯ † g¯. f =H d
(19)
(20)
(21)
¯ d is the gradient of H ¯ d with respect to d, and the right hand side where ∇H of the first equation is the gradient of ∇d1 with respect to d. Similarly, the necessary conditions for eq. (4) are ⎧ ⎨ g¯ − H ¯ d f T ∇H ¯ d f = ∇ · ∇d (22) ∇d1 ⎩ ¯ T g¯ − H ¯ d f = 0. H d It is now immediate to apply the same derivation as in [46] and demonstrate that the left hand side of the first equation in both system (22) and system (21) are identical. Since the right hand sides are also identical, this implies that the first equations have the same solutions. The second equations in (22) and (21) are instead identical by construction.
150
M. Martinello and P. Favaro
References 1. Jones, D., Lamb, D.: Analyzing the visual echo: Passive 3-d imaging with a multiple aperture camera. Technical report, McGill University (1993) 2. Dowski, E.R., Cathey, T.W.: Extended depth of field through wave-front coding. Applied Optics 34, 1859–1866 (1995) 3. Farid, H.: Range Estimation by Optical Differentiation. PhD thesis, University of Pennsylvania (1997) 4. Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Trans. Graph 26, 69 (2007) 5. Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera with a coded aperture. ACM Trans. Graph 26, 70 (2007) 6. Bishop, T., Zanetti, S., Favaro, P.: Light field superresolution. In: ICCP (2009) 7. Cossairt, O., Nayar, S.: Spectral focal sweep: Extended depth of field from chromatic aberrations. In: ICCP (2010) 8. Liang, C.K., Lin, T.H., Wong, B.Y., Liu, C., Chen, H.: Programmable aperture photography: Multiplexed light field acquisition. ACM Trans. Graph 27, 55:1–55:10 (2008) 9. Ng, R., Levoy, M., Br´edif, M., Duval, G., Horowitz, M., Hanrahan, P.: Light field photography with a hand-held plenoptic camera. Technical Report CSTR 2005-02, Stanford University CS (2005) 10. Gottesman, S.R., Fenimore, E.E.: New family of binary arrays for coded aperture imaging. Applied Optics 28, 4344–4352 (1989) 11. Zomet, A., Nayar, S.K.: Lensless imaging with a controllable aperture. In: CVPR, vol. 1, pp. 339–346 (2006) 12. Raskar, R., Agrawal, A.K., Tumblin, J.: Coded exposure photography: Motion deblurring using fluttered shutter. ACM Trans. Graph 25, 795–804 (2006) 13. Hiura, S., Matsuyama, T.: Depth measurement by the multi-focus camera. In: CVPR, vol. 2, pp. 953–961 (1998) 14. Zhou, C., Lin, S., Nayar, S.K.: Coded aperture pairs for depth from defocus. In: ICCV (2009) 15. Zhou, C., Nayar, S.: What are good apertures for defocus deblurring? In: IEEE ICCP (2009) 16. Levin, A., Hasinoff, S., Green, P., Durand, F., Freeman., W.T.: 4d frequency analysis of computational cameras for depth of field extension. ACM Trans. Graph 28 (2009) 17. McLean, D.: The improvement of images obtained with annular apertures. Royal Society of London 263, 545–551 (1961) 18. Greengard, A., Schechner, Y.Y., Piestun, R.: Depth from diffracted rotation. Optics Letters 31, 181–183 (2006) 19. Dowski, E.R., Cathey, T.W.: Single-lens single-image incoherent passive-ranging systems. Applied Optics 33, 6762–6773 (1994) 20. Johnson, G.E., Dowski, E.R., Cathey, W.T.: Passive ranging through wave-front coding: Information and application. Applied Optics 39, 1700–1710 (2000) 21. Cossairt, O., Zhou, C., Nayar, S.K.: Diffusion coding photography for extended depth of field. ACM Trans. Graph (2010) 22. Dou, Q., Favaro, P.: Off-axis aperture camera: 3d shape reconstruction and image restoration. In: CVPR (2008)
Single Image Blind Deconvolution with Higher-Order Texture Statistics
151
23. Georgiev, T., Zheng, K., Curless, B., Salesin, D., Nayar, S., Intawala, C.: Spatioangular resolution tradeoffs in integral photography. In: Eurographics Workshop on Rendering, pp. 263–272 (2006) 24. Levoy, M., Ng, R., Adams, A., Footer, M., Horowitz, M.: Light field microscopy. ACM Trans. Graph 25, 924–934 (2006) 25. Fergus, R., Singh, B., Hertzmann, A., Roweis, S., Freeman, W.: Removing camera shake from a single photograph. ACM Trans. Graph 25, 787–794 (2006) 26. Shan, Q., Jia, J., Agarwala, A.: High-quality motion deblurring from a single image. ACM Trans. Graph (2008) 27. Levin, A., Weiss, Y., Durand, F., Freeman., W.T.: Understanding and evaluating blind deconvolution algorithms. In: CVPR, pp. 1964–1971 (2009) 28. Cho, S., Lee, S.: Fast motion deblurring. Siggraph Asia 28 (2009) 29. Xu, L., Jia, J.: Two-phase kernel estimation for robust motion deblurring. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 157–170. Springer, Heidelberg (2010) 30. Shan, Q., Xiong, W., Jia, J.: Rotational motion deblurring of a rigid object from a single image. In: ICCV, pp. 1–8 (2007) 31. Whyte, O., Sivic, J.: Zisserman, A., Ponce, J.: Non-uniform deblurring for shaken images. In: CVPR, pp. 491–498 (2010) 32. Gupta, A., Joshi, N., Zitnick, C., Cohen, M., Curless, B.: Single image deblurring using motion density functions. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 171–184. Springer, Heidelberg (2010) 33. Nishiyama, M., Hadid, A., Takeshima, H., Shotton, J., Kozakaya, T., Yamaguchi, O.: Facial deblur inference using subspace analysis for recognition of blurred faces. IEEE T.PAMI 33, 1–8 (2011) 34. Pouli, T., Cunningham, D.W., Reinhard, E.: Image statistics and their applications in computer graphics. Eurographics, State of the Art Report (2010) 35. Ruderman, D.L.: The statistics of natural images. Network: Computation in Neural Systems 5, 517–548 (1994) 36. Huang, J., Mumford, D.: Statistics of natural images and models. In: CVPR, vol. 1, pp. 1541–1548 (1999) 37. Huang, J., Lee, A., Mumford, D.: Statistics of range images. In: CVPR, pp. 324–331 (2000) 38. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 39. Kolmogorov, V., Zabih, R.: Multi-camera scene reconstruction via graph cuts. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 82–96. Springer, Heidelberg (2002) 40. Press, W., Flannery, B., Teukolsky, S., Vetterling, W.: Numerical Recipes in C. Cambridge University Press, Cambridge (1988) 41. Sun, X., Cheng, Q.: On subspace distance. In: Image Analysis and Recognition, pp. 81–89 (2006) 42. Premaratne, P., Ko, C.: Zero sheet separation of blurred images with symmetrical point spread functions. Signals, Systems, and Computers, 1297–1299 (1999) 43. Martinello, M., Bishop, T.E., Favaro, P.: A bayesian approach to shape from coded aperture. In: ICIP (2010) 44. Snyder, D., Schulz, T., O’Sullivan, J.: Deblurring subject to nonnegativity constraints. IEEE Trans. on Signal Processing 40(5), 1143–1150 (1992) 45. Bertero, M., Boccacci, P.: Introduction to inverse problems in imaging. Institute of Physics Publishing, Bristol (1998) 46. Favaro, P., Soatto, S.: A geometric approach to shape from defocus. TPAMI 27, 406–417 (2005)
Compressive Rendering of Multidimensional Scenes Pradeep Sen, Soheil Darabi, and Lei Xiao Advanced Graphics Lab, University of New Mexico, Albuquerque, NM 87113
Abstract. Recently, we proposed the idea of using compressed sensing to reconstruct the 2D images produced by a rendering system, a process we called compressive rendering. In this work, we present the natural extension of this idea to multidimensional scene signals as evaluated by a Monte Carlo rendering system. Basically, we think of a distributed ray tracing system as taking point samples of a multidimensional scene function that is sparse in a transform domain. We measure a relatively small set of point samples and then use compressed sensing algorithms to reconstruct the original multidimensional signal by looking for sparsity in a transform domain. Once we reconstruct an approximation to the original scene signal, we can integrate it down to a final 2D image which is output by the rendering system. This general form of compressive rendering allows us to produce effects such as depth-of-field, motion blur, and area light sources, and also renders animated sequences efficiently.
1
Introduction
The process of rendering an image as computed by Monte Carlo (MC) rendering systems involves the estimation of a set of integrals of a multidimensional function that describes the scene. For example, for a scene with depth-of-field and motion blur, we can think of the distributed ray tracing system as taking point samples of a 5D continuous “scene signal” f (x, y, u, v, t), where f () is the scene-dependent function, (x, y) represents the position of the sample on the image, (u, v) is its position on the aperture for the depth-of-field, and t which describes the time at which the sample is calculated. The ray tracing system can compute point samples of this function by fixing the parameters (x, y, u, v, t) and evaluating the radiance of a ray with those parameters. The basic idea of Monte Carlo rendering is that by taking a large set of random point samples of this function, we can approximate the definite integral: 1 j+ 2
1 i+ 2
t1 1
1
−1
−1
I(i, j) =
f (x, y, u, v, t) du dv dt dx dy, j− 12
i− 12
t0
(1)
which gives us the value of the final image I at pixel (i, j) by integrating over the camera aperture from [−1, 1], over the time that the shutter is open [t0 , t1 ], and over the pixel for antialiasing. In rendering, we use Monte Carlo integration D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 152–183, 2011. c Springer-Verlag Berlin Heidelberg 2011
Compressive Rendering of Multidimensional Scenes
153
to estimate integrals like these because finding an analytical solution to these integrals is nearly impossible for real scene functions f (). Unfortunately, Monte Carlo rendering systems require a large number of multidimensional samples in order to converge to the actual value of the integral, because the variance of the estimate of the integral decreases as O(1/k) with the number of samples k. If a small number of samples is used, the resulting image is very noisy and cannot be used for high-end rendering applications. The noise in the Monte Carlo result is caused by variance in the estimate, and there have been many approaches proposed in the past for reducing the variance in MC rendering. One common method for variance reduction is stratified sampling, wherein the integration domain is broken up into a set of equally-sized non-overlapping regions (or strata) and a single sample is placed randomly in each, which reduces the variance of the overall estimate [1]. Other techniques for variance reduction exist, but they typically require more information about f (). For example, importance sampling positions samples with a distribution p(x) that mimics f () as closely as possible. It can be shown that if p(x) is set to a normalized version of f (), then the variance of our estimator will be exactly zero [2]. However, this normalization involves knowing the integral of f (), which is obviously unknown in our case. Nevertheless, importance sampling can be useful when some information about the shape of f () is known, such as the position of light sources in a scene. In this work, however, we assume that we do not know anything about the shape of f () that we can use to position samples, which makes our approach a kind of technique often known as blind Monte Carlo. The only assumption we will make is that f () is a real-world signal that is sparse or compressible in a transform domain. Other kinds of variance reduction techniques have been proposed that introduce biased estimators, meaning that the expected value of the estimator is not equal to the exact value of the integral. Although methods such as stratified or importance sampling are both unbiased, biased Monte Carlo algorithms are also common in computer graphics (e.g., photon mapping [3]) because they sometimes converge much faster while yielding plausible results. The proposed approach in this paper also converges much more quickly than the traditional unbiased approaches, but it results in a slightly biased result. As we shall see, this occurs because of the discretization of the function when we pose it within the framework of compressed sensing (CS). However, this bias is small while the improvement in the convergence rate is considerable. This chapter is based on ideas presented by the first two authors in work published in the IEEE Transactions of Visualization and Computer Graphics [4] and the Eurographics Rendering Symposium (EGSR) [5]. In the first work, we introduced the idea of using compressed sensing as a way of filling missing pixel information in order to accelerate rendering. In that approach, we first render only a fraction of the pixels in the image (which provides the speedup) and then we estimate the values of the missing pixels using compressed sensing by assuming that the final image is compressible in a transform domain. In the second work, we began to expand this idea to the concept of estimating an
154
P. Sen, S. Darabi, and L. Xiao
(a) Original image (b) 2D sparsification (c) 3D sparsification (d) 4D sparsification Fig. 1. Showing the effect on dimensionality on the compressibility of the signal in the Fourier domain. As the dimension of our scene function f () increases, the compressibility of the data increases as well. Here we show a 4D scene with pixel antialiasing (2D) and depth-of-field (2D), which we have sparsified to 98% sparsity in the Fourier domain by zeroing out 98% of the Fourier coefficients. (a) Reference image. (b) Image generated by integrating down the function to 2D and then sparsifying it to 98% in the Fourier domain. We can see a significant amount of ringing and artifacts, which indicates that the 2D signal is not very compressible in the Fourier domain. This is the reason that we use wavelets for compression when handling 2D signals (see Secs. 5 and 6). (c) Image generated by integrating down the function to 3D (by integrating out the u parameter) and then sparsifying it to 98% in the Fourier domain. There are less artifacts than before, although they are still visible. (d) Image generated by sparsifying the original scene function f () to 98% in the Fourier domain. The artifacts are greatly reduced here, indicating that as the dimensionality of the signal goes up the transform-domain sparsity also increases.
underlying multidimensional signal which we then integrate down to produce our final image. At the Dagstuhl workshop on Computational Video [6], we presented initial results on applying these ideas to animated video sequences (see Sec. 8). This work presents a more general framework for compressive rendering that ties all of these ideas together into an algorithm that can handle a general set of Monte Carlo effects by estimating a multidimensional scene function from a small set of samples. By moving to higher-dimensional data sets, we improve the quality of the reconstructions because compressed sensing algorithms improve as the signal becomes more sparse, and as the dimension of the problem increases the sparsity (or technically, compressibility) of the signal also increases. The reason for this is that the amount of data in the signal goes up exponentially with the dimension, but the amount of actual information does not increase at this rate. As shown in Fig. 1, a 4D signal sparsified to 98% produces a much better quality image than a 2D signal sparsified the same amount. To present this work, we first begin by describing previous work in rendering as it relates to Monte Carlo rendering and transform-domain accelerations proposed in the past. Next, we present a brief introduction to the theory of compressed sensing, since it is a field still relatively new to computer graphics. In Sec. 4, we present an overview of our general approach as well as a simple 1D example to compare the reconstruction of a signal from CS with those of traditional techniques such as parametric fitting. Secs. 5 – 10 then show applications of this
Compressive Rendering of Multidimensional Scenes
155
framework starting with 2D signals and building up to more complex 4D scenes. Finally, we end the chapter with some discussion and conclusions. We note that since this paper is in fact a generalization of two previous papers [5, 4], we have taken the liberty to heavily draw from our own text from these papers and the associated technical report [7], often verbatim, to maintain consistency across all the publications. We also duplicate results as necessary for completeness of this text.
2
Previous Work
Our framework allows to produce noise-free Monte Carlo rendered images with a small set of samples by filling in the missing samples of the multidimensional function using compressed sensing. Similar topics have been the subject of research in the graphics community for many years. We break up the previous work into algorithms that exploit transform-domain sparsity for rendering, algorithms that accelerate the rendering process outright, algorithms that are used to fill in missing sample information, and finally applications of compressed sensing in computer graphics. 2.1
Transform Compression in Rendering
There is a long history of research into transform-based compression to accelerate or improve rendering algorithms. We briefly survey some of the relevant work here and refer readers to more in-depth surveys (e.g., [8, 9]) for more detail. For background on wavelets, the texts by Stollnitz et al. [10] and Mallat [11] offer good starting points. In the area of image rendering, transform compression techniques have been used primarily for accelerating the computation of illumination. For example, the seminal work of Hanrahan et al. [12] uses an elegant hierarchical approach to create a multiresolution model of the radiosity in a scene. While it does not explicitly use wavelets, their approach is equivalent to using a Haar basis. This work has been extended to use different kinds of wavelets or to subdivide along shadow boundaries to further increase the efficiency of radiosity algorithms, e.g., [13, 14, 15]. Recently, interest in transform-domain techniques for illumination has been renewed through research into efficient pre-computed radiance transfer methods using bases such as spherical harmonics [16, 17] or Haar wavelets [18, 19]. Again, these approaches focus on using the sparsity of the illumination or the BRDF reflectance functions in a transform domain, not on exploiting the sparsity of the final image. In terms of using transform-domain approaches to synthesize the final image, the most successful work has been in the field of volume rendering. In this area, both the Fourier [20, 21] and wavelet domains [22] have been leveraged to reduce rendering times. However, the problem they are solving is significantly different than that of image rendering, so their approaches do not map well to
156
P. Sen, S. Darabi, and L. Xiao
the problem addressed in this work. Finally, perhaps the most similar rendering approach is the frequency-based ray tracing work of Bolin and Meyer [23]. Like our own approach, they take a set of samples and then try to solve for the transform coefficients (in their case the Discrete Cosine Transform) that would result in those measurements. However, the key difference is that they solve for these coefficients using least-squares, which means that they can only reconstruct the frequencies of the signal that have sufficient measurements as given by the Nyquist-Shannon sampling theorem. Our approach, on the other hand, is based on the more recent work on compressed sensing, which specifies that the sampling rate is dependent on the sparsity of the signal rather than on its band-limit. This allows us to reconstruct frequencies higher than that specified by the Nyquist rate. We show an example of this in Sec. 4.1 that highlights this difference. By posing the problem of determining the value of missing samples within the framework of compressed sensing, we leverage the diverse set of tools that have been recently developed for these kinds of problems. 2.2 Accelerating Ray Tracing and Rendering Most of the work in accelerating ray tracing has focused on novel data structures for accelerating the scene traversal [24]. These methods are orthogonal to ours since we do not try to accelerate the ray tracing process (which involves pointsampling the multidimensional function) but rather focus on generating a better image with less samples. However, there are algorithms to accelerate rendering that take advantage of the spatial correlation of the final image, which in the end is related to the sparsity in the wavelet domain. Most common is the process of adaptive sampling [25, 26], in which a fraction of the samples are computed and new samples are computed only where the difference between measured samples is large enough, e.g., by a measure of contrast. Unlike our approach, however, adaptive sampling still computes the image in the spatial domain which makes it impossible to apply arbitrary wavelet transforms. For example, in the 2D missing pixel case of Sec. 5 we use the CDF 9/7 wavelet transform because it has been shown to be very good at compressing imagery. In other sections, we use sparsity in the Fourier domain. It is unclear how existing adaptive methods could be modified to use bases like this. There is also a significant body of work which attempts to reconstruct images from sparse samples by using specialized data structures. First, there are systems which try to improve the interactivity of rendering by interpolating a set of sparse, rendered samples, such as the Render Cache [27] and the Shading Cache [28]. There are also approaches that perform interpolation while explicitly observing boundary edges to prevent blurring across them. Examples include Pighin et al.’s image-plane discontinuity mesh [29], the directional coherence map proposed by Guo [30], the edge-and-point image data structure of Bala et al. [31], and the real-time silhouette map by Sen [32,33]. Our work is fundamentally different than these approaches because we never explicitly encode edges or use a data structure to improve the interpolation between samples. Rather, we take advantage of the compressibility of the final multidimensional signal in a
Compressive Rendering of Multidimensional Scenes
157
transform domain in order to reconstruct it and produce the final image. This allows us to faithfully reconstruct edges in the image, as can be seen by our results. 2.3
Reconstruction of Missing Data
Our approach only computes a fraction of the samples and uses compressed sensing to “guess” the values of the missing samples of the multidimensional scene function. In computer graphics and vision, many techniques have been proposed to fill in missing sample data. In the case of 2D signals such as images, techniques such as inpainting [34] and hole-filling [35] have been explored. Typically, these approaches work by taking a band of measured pixels around the unknown region and minimizing an energy functional that propagates this information smoothly into the unknown regions while at the same time preserving important features such as edges. Although we could use these algorithms to fill in the missing pixels in our 2D rendering application, the random nature of the rendered pixels makes our application fundamentally different from that of typical hole-filling, where the missing pixels have localized structure due to specific properties of the scene (such as visibility) which in our case are not available until render time. Furthermore, these methods become a lot less effective and more complex when trying to fill the missing data for higher dimensional cases, specially as we get to 4D scene functions or larger. Nevertheless, we compare our algorithm to inpainting in Sec. 5 to help validate our approach. Perhaps the most successful approaches for reconstructing images from nonuniform samples for the 2D case come from the non-uniform sampling community, where this is known as the “missing data” problem since one is trying to reconstruct the missing samples of a discrete signal. Readers are referred to Ch. 6 of the principal text on the subject by Marvasti [36] for a complete explanation. One successful algorithm is known as ACT [37] which tries to fit trigonometric polynomials (related to the Fourier series) to the point-sampled measurements in a least-squares sense by solving the system using Toeplitz matrix inversion. This is related to the frequency-based ray tracing by Bolin and Meyer [23] described earlier. Another approach, known as the Marvasti method [38], solves the missing data problem by iteratively building up the inverse of the system formed by the non-uniform sampling pattern combined with a low pass filter. However, both the ACT and Marvasti approaches fundamentally assume that the image is bandlimited in order to do the reconstruction, something that is not true in our rendering application. As we show later in this paper, our algorithm relaxes the bandlimited assumption and is able to recover some of the high-frequency components of the image signal. Nevertheless, since ACT and Marvasti represent state-of-the-art approaches in the non-uniform sampling community for the reconstruction of missing pixels in a non-uniformly sampled image, we will compare our approach against these algorithms in Sec. 5. Unfortunately, neither of these algorithms is suitable for higher dimensional signals.
158
2.4
P. Sen, S. Darabi, and L. Xiao
Compressed Sensing and Computer Graphics
In this paper, we use tools developed for compressed sensing to solve the problem of reconstructing rendered images with missing pixel samples. Although compressed sensing has been applied to a wide range of problems in many fields, in computer graphics there are only a few published works that have used CS. Other than the work on compressive rendering on which this paper is based [5,4], most of the other applications of CS in graphics are in the area of light-transport acquisition [39,40,41]. The important difference between this application and our own is that it is not easy to measure arbitrary linear projections of the desired signal in rendering, while it is very simple to do so in light transport acquisition through structured illumination. In other words, since computing the weighted sum of a set samples is linearly harder than calculating the value of a single sample, our approach for a rendering framework based on compressive sensing had to be built around random point sampling. This will become more clear as we give a brief introduction to the theory of compressed sensing in the next section.
3
Compressed Sensing Theory
In this section, we summarize some of the key theoretical results of compressed sensing in order to explain our compressive rendering framework. A summary of the notation we shall use in this paper is shown in Table 1. Readers are referred to the key papers of Cand`es et al. [42] and Donoho [43] as well as the extensive CS literature available through the Rice University online repository [44] for a more comprehensive review of the subject. 3.1
Theoretical Background
The theory of compressed sensing allows us to reconstruct a signal from only a few samples if it is sparse in a transform domain. To see how, suppose that we have an n-dimensional signal f ∈ Rn we are trying to estimate with k random point samples, where k n. We can write this sampling process with the linear sampling equation y = Sf , where S is an k × n sampling matrix that contains a single “1” in each row and no more than a single “1” in each column to represent each point-sampled value, and with zeros for all the remaining elements. This maps well to our rendering application, where the n-pixel image we want to render (f ) is going to be estimated from only k pixel samples (y). Initially, it seems that perfect estimation of f from y is impossible, given that there are (n − k) pixels which we did not observe and could possibly have any value (ill-posedness). This is where we use the key assumption of compressed sensing: we assume that the image f is sparse in some transform domain ˆ f = Ψ−1 f . Mathematically, the signal ˆ f is m-sparse if it has at most m nonzero coefficients (where m n), which can be written in terms of the 0 norm (which effectively “counts” the number of non-zero elements): ˆ f 0 ≤ m. This is not an unreasonable assumption for real-world signals such as images, since
Compressive Rendering of Multidimensional Scenes
159
Table 1. Notation used in this paper n k m f ˆ f y S Ψ A
size of final signal number of evaluated samples number of non-zero coefficients in transform domain high-resolution final signal, represented by an n × 1 vector transform of the signal, represented by a n × 1, m-sparse vector k × 1 vector of samples of f computed by the ray tracer k × n sampling matrix of the ray tracer, s.t. y = Sf n × n “synthesis” matrix, s.t. f = Ψˆ f , and its associated inverse Ψ−1 k × n “measurement” matrix, A = SΨ
this fact is exploited in transform-coding compression systems such as JPEG and MPEG. The basic idea of compressed sensing is that through this assumption, we are able to eliminate many of the images in the (n − k)-dimensional subspace which do not have sparse properties. To formulate the problem within the compressed sensing framework, we substitute our transform-domain signal ˆ f into our sampling equation: y = Sf = SΨˆ f = Aˆ f, (2) where A = SΨ is a k × n measurement matrix that includes both the sampling and compression bases. If we could solve this linear system correctly for ˆ f given y, we could then recover the desired f by taking the inverse transform. Unfortunately, solving for ˆ f is difficult to do with traditional techniques such as least squares because the system is severely undetermined because k n. However, one of the key results in compressed sensing demonstrates that if k ≥ 2m and the Restricted Isometry Condition (RIC) condition is met (Sec. 3.4), then we can solve for ˆ f uniquely by searching for the sparsest ˆ f that solves the equation. A proof of this remarkable conclusion can be found in the paper by Cand`es et al. [42]. Therefore, we can pose the problem of computing the transform of the final rendered image from a small set of samples as the solution of the 0 optimization problem: min ˆ f 0 s.t. y = Aˆ f.
(3)
Unfortunately, algorithms to solve Eq. 3 are NP-hard [45] because they involve a combinatorial search of all m-sparse vectors ˆ f to find the sparsest one that meets the constraint. Fortunately, the CS research community has developed fast algorithms that find approximate solutions to this problem. In this paper we use solvers such as ROMP and SpaRSA to compute the coefficients of signal ˆ f within the context of compressed sensing. We give an overview of these algorithms in the following two sections. 3.2
Overview of ROMP Algorithm
Since the solution of the 0 problem in Eq. 3 requires a brute-force combinatorial search of all the the ˆ f vectors with sparsity less than m, the CS research community has been developing fast, greedy algorithms that find approximate solutions
160
P. Sen, S. Darabi, and L. Xiao
Algorithm 1. ROMP algorithm Input: measured vector y, matrices A and A† , target sparsity m Output: the vector ˆ f , which is an m-sparse solution of y = Aˆ f Initialize: I = ∅ and r = y 1: while r = 0 and sparsity not met do /* multiply residual by A† to approx. larger coeffs of ˆ f */ 2: u ⇐ A† r 3: J ⇐ sort coefficients of u in non-increasing order 4: J0 ⇐ contiguous set of coefficients in J with maximal energy 5: I ⇐ I ∪ J0 /* add new indices to overall set */ 6: /* find vector of I coeffs that best matches measurement */ ˆ 7: fnew ⇐ argmin y − Az2 z : supp (z)=I
8: r ⇐ y − Aˆ fnew 9: end while 10: return ˆ fnew
/* recompute residual */
to the 0 problem. One example is Orthogonal Matching Pursuit (OMP) [46], which iteratively attempts to find the non-zero elements of ˆ f one at a time. To do this, OMP is given the measured vector y and measurement matrix A as input and it finds the coefficient of ˆ f with the largest magnitude by projecting y onto each column of A through the inner product |y, aj | (where aj is the j th column of A) and selecting the largest. After the largest coefficient has been identified, we assume that this is the only non-zero coefficient in ˆ f and approximate its value by solving the y = Aˆ f using least-squares. The new estimate for ˆ f with a single non-zero coefficient is then used to compute the estimated signal f , which is subtracted from the original measurements to get a residual. The algorithm then iterates again, using the residual to solve for the next largest coefficient of ˆ f , and so on. It continues to do this until an m-sparse approximation of the transform domain vector is found. Despite its simplicity, OMP has a weaker guarantee of exact recovery than the 1 methods [47]. For this reason, Needell and Vershynin proposed a modification to OMP called Regularized Orthogonal Matching Pursuit (ROMP) which recovers multiple coefficients in each iteration, thereby accelerating the algorithm and making it more robust to meeting the RIC. Essentially, ROMP approximates the largest-magnitude coefficients of ˆ f in a similar way to OMP, by projecting y onto each column of A and sorts them in decreasing order. It then finds all of the continuous sets of coefficients in this list whose largest coefficient is at most twice as big as the smallest member, and selects the set with the maximal energy. These indices are added to a list that is maintained by the algorithm which keeps track of the non-zero coefficients of ˆ f , and the values of those coefficients are computed by solving y = Aˆ f through least-squares assuming that these are the only non-zero coefficients. As in OMP, the new estimate for ˆ f is then used to compute the estimated signal f which is subtracted from y to get a residual. The algorithm continues to iterate using the residual as the input and solving for the next largest set of coefficients of ˆ f until an m-sparse approximation of the transform domain vector is found, an error criteria is met, or the number of
Compressive Rendering of Multidimensional Scenes
161
iterations exceeds a certain limit without convergence. Although the 1 problem requires N = O(m log k) samples to be solved uniquely, in practice we find that the ROMP algorithm requires around N = 5m samples to start locking in to the correct solution and N = 10m to work extremely robustly. Since we use the ROMP algorithm in both 2D signal reconstruction applications (see Secs. 5 and 6), we provide a pseudocode description for reference in Alg. 1. 3.3
Overview of SpaRSA Algorithm
Another algorithm we use in this paper is known as SpaRSA. One of the key results of recent compressed sensing theory is that the problem of Eq. 3 can be framed as an 1 problem instead, where the 1 norm is defined as the sum of the k absolute values of the elements of the vector (v1 = i=1 |vi |): min ˆ f 1
s.t. y = Aˆ f.
(4)
Cand`es et al. [42] showed that this equation has the same solution as Eq. 3 if A satisfies the RIC and the number of samples N = O(m log k), where m is the sparsity of the signal. This fundamental result spurred the flurry of research in compressed sensing, because it demonstrated that these problems could be solved by tractable algorithms such as linear programming. Unfortunately, it is still difficult to solve the 1 problem, so researchers in applied mathematics have been working on novel algorithms to provide a fast solution. One successful avenue of research is to reformulate Eq. 4 into what is known as the 2 – 1 problem: 1 min y − Aˆ f 22 + τ ˆ f 1 . (5) ˆ 2 f In this formulation, the first term enforces the fit of the solution to the measured values y while the second term looks for the smallest 1 solution (and hence the sparsest solution). The parameter τ balances the optimization towards one constraint or the other. Recently, Wright et al. proposed a novel solution to Eq. 5 by solving a simple iterative subproblem with an algorithm they call Sparse Reconstruction by Separable Approximation (SpaRSA) [48]. In this work, we use SpaRSA to reconstruct scene signals f that are 3D and larger. Unfortunately, even a simple explanation of the SpaRSA is beyond the scope of this paper. Interested readers are referred to the technical report associated with our EGSR paper [7] for more information. 3.4
Restricted Isometry Condition (RIC)
It is impossible to solve y = Aˆ f for any arbitrary A if k n because the system is severely underdetermined. However, compressed sensing can be used to solve uniquely for ˆ f if matrix A meets the Restricted Isometry Condition (RIC): (1 − )||v||2 ≤ ||Av||2 ≤ (1 + )||v||2 ,
(6)
162
P. Sen, S. Darabi, and L. Xiao
with parameters (z, ), where ∈ (0, 1) for all z-sparse vectors v [47]. Effectively, the RIC states that in a valid measurement matrix A, every possible set of z columns of A will form an approximate orthogonal set. Another way to say this is that the sampling and compression bases S and Ψ that make up A must be incoherent. Examples of matrices that have been proven to meet RIC include Gaussian matrices (with entries drawn from a normal distribution), Bernoulli matrices (binary matrices drawn from a Bernoulli distribution), and partial Fourier matrices (randomly selected Fourier basis functions) [49]. In this paper, we can use point-sampled Fourier (partial Fourier matrices) for all of the applications with a scene function 3D or higher, but for the 2D cases where we are reconstructing an image, the Fourier basis does not provide enough compression (as shown in Fig. 1) and cannot be used. In this case, we would like to use the wavelet transform which does provide enough compression, but unfortunately a point-sampled wavelet basis does not meet the RIC. We discuss our modifications to the wavelet basis to improve this and allow our framework to be used for the reconstruction of 2D scene functions in Sec. 5. With this theoretical background in place, we can now give an overview of our algorithm and a simple 1D example.
4
Algorithm Overview
The basic idea of the proposed rendering framework is quite simple. We use a distributed ray tracing system that takes a small set of point samples of the multidimensional scene function f (). In traditional Monte Carlo, these samples would be added together to estimate the integral of f (), but in our case assume that the signal f () is sparse in a transform domain and use the compressed sensing theory described in the previous section to estimate a discrete reconstruction of f (). This reconstructed version is then integrated down to form our final image. Since the CS solvers operate on discrete vector and matrices, we first approximate the unknown function f () with a discrete vector f of size n by taking uniform samples of f (). This approximation is reasonable as long as n is large enough, since it is equivalent to discretizing f (). For example, if we assume that the signal f () is sparse in the Fourier basis composed of 2π-periodic basis functions, then the signal f () must also be 2π periodic. Therefore, the samples that form f must cover the 2π interval of f () to ensure periodicity and therefore maintain the sparsity in the Discrete Fourier Transform domain. In this case, for example, the ith component of f is given by fi = f ( 2π k (i − 1)). By sampling the function f () in this manner when discretizing it, we guarantee that f will also be sparse in transform domain Ψ , where the columns of Ψ are discrete versions of the basis functions ψi from the equations above. Note that we do not explicitly sample f () to create f (since we do not know f () a priori), but rather we assume a n-length vector f exists which is the discrete version of the unknown f () and which we will solve for through CS. We can now take our k random measurements of f , as given by y = Sf where S is the k × n sampling matrix, by point-sampling the original function f ()
0
−2000 −4000 −6000 −0.25 −0.2
−0.15 −0.1
−0.05
0
x
0.05
0.1
0.15
0.2
0.25
4
400 200 0 −200 −400 −600 −800
50
100
150
Number of samples (k)
(e)
200
250
8
5
1 2
10
0
8
6
4
12 10 8 3
6
10
10
10
−5
−10
−15
4 2 4
2 0 −200
−150
−100
−50
0
50
100
150
0 0
200
10
2
50
150
4
200
10
250
200 0 −200 −400 −600 −800
0
50
100
150
Number of samples (k)
(f )
0
50
200
250
150
200
250
200
250
(d) 120
Compressed sensing
Parametric fit
800 600 400 200 0 −200 −400 −600 −800
−1000
100
Number of samples (k)
1000
400
3
−25
(c)
600
−1000
100
Number of samples (k)
Stratified
800
−20
1
Frequency (Hz)
Estimated value of integral
Estimated value of integral
Estimated value of integral
600
10
14
1000
Random
0
2 3
16
(b)
1000
−1000
10
Random Stratified Parametric fit Compressed sensing
1
18
(a)
800
x 10
20
Variance
f(x)
2000
x 10
10
Estimated value of integral
4000
163
4
4
12
Variance
6000
Magnitude of Fourier Transform of f(x)
Compressive Rendering of Multidimensional Scenes
0
50
100
150
200
250
100
80
60
40
20
0
−20
0
50
100
150
Number of samples (k)
Number of samples (k)
(g)
(h)
Fig. 2. Results of the simple example of Sec. 4.1. (a) Plot of the signal f (x) which we want to integrate over the interval shown. (b) Magnitude of the Fourier Transform of f (x). Since 3 cosines of different frequencies are added together to form f (x), its Fourier Transform has six different spikes because of symmetry (so sparsity m = 6). (c) Linear and (d) log plots of variance as a function of the number of samples to show the convergence of the four different integration algorithms. The faint dashed lines in the “random” and “stratified” curves (visible in the pdf) show the theoretical variance, which matches the experimental results. The log plot clearly shows the “waterfall” curve characteristic of compressed sensing reconstruction, where once the adequate number of samples are taken the signal is reconstructed perfectly every time. In this case, we need around 70 samples, which is roughly 10× the number of spikes in (b). (e – h) Plots of the estimated value of the integral vs. the number of samples for one run with the different integration algorithms. The correct value of the integral (100) is shown as a gray line. We can see that the compressed sensing begins to approximate the correct solution around k = 5m = 30 samples and then “snaps” to the right answer at for k > 10m = 60, which is much more quickly than the other approaches.
at the appropriate discrete locations. Therefore, unlike traditional Monte Carlo approaches, our random samples do not occur arbitrarily along the continuous domain of f (), but rather at a set of discrete locations that represent the samples of f . Once the set of samples that form measurement vector y have been taken, we use compressed sensing to solve for the coefficients that correspond to the non-zero basis functions ˆ f . We can then take the inverse transform f = Ψˆ f and integrate it down to get our final image. To help explain how our algorithm works, we now look at a simple 1D example. 4.1
Example of 1D Signal Reconstruction
At first glance, it might seem that what we are proposing to do is merely just another kind of parametric fit, somehow fitting a function to our samples to
164
P. Sen, S. Darabi, and L. Xiao
approximate f (). Although we are fitting a function to the measured data, compressed sensing offers us a fundamentally different way to do this than the traditional methods used for parametric fitting, such as least-squares. We can see the difference with a simple 1D example. Suppose we want to compute the definite integral from − 41 to 14 of the following function f (x), which is unknown to us a priori but is shown in Fig. 2(a): f (x) = α[aπ cos(2πax)] + β[bπ cos(2πbx)] + γ[cπ cos(2πcx)]. The signal is made up of three pure frequencies, and in this experiment we set a = 1, b = 33, and c = 101 so that there is a reasonable range of frequencies represented in the signal. This particular function is constructed so that the analytic integral is easy to compute; the integral of each of the terms in square brackets over the specified interval is equal to 1. This means that in this case, the desired definite integral is 14 I= f (x) dx = α + β + γ. − 14
In this experiment, we set α = 70, β = 20, and γ = 10, so that the desired integral I = 100. Within the context of our rendering problem, we assume that do not know f (x) in analytical form, so our goal is to compute the integral value of 100 simply from a set of random point samples of f (x). The most common way to do this in computer graphics is to use Monte Carlo integration, which takes k uniformly-distributed, random samples over the entire interval [− 14 , 14 ] and use these to estimate the integral. Although the answer fluctuates based on the position of our measurements, as we add more and more samples the estimator slowly converges to the correct answer as can be seen in the variance curves of Fig. 2(c, d) and the results of a single run while varying the number of samples in (e). We can compute the theoretical variance of a random Monte Carlo approach analytically which gives us a theoretical variance shown in Fig. 2(c, d) as a thin, dashed line. We can see that the theoretical variance calculated matches well with the experimental results. The slow k1 decay of variance with random Monte Carlo is less than desirable for rendering applications, so a common variance-reduction technique is stratified sampling. Fig. 2(f ) shows the result of one run with this method, and indeed we notice that the estimate of the integral gets closer to the correct solution (shown by a thin gray line) more quickly than with the random approach. The theoretical variance of the stratified approach can be computed in software by computing the variance of each of the strata for every size k. The resulting curve also matches the experimental results, even predicting a small dip in variance around k = 101. However, stratified sampling still takes considerable time to converge, so it is worthwhile to examine other techniques that might be better. Another way we might consider computing the integral of this function from the random samples is to try to fit a parametric model to the measurements and then perform the integral on the model itself. Indeed, both our approach and the parametric fit approach require us to know something about the signal (e.g.,
Compressive Rendering of Multidimensional Scenes
165
that it can be compactly represented with sinusoids). However, closer observation reveals that our framework is based on fundamentally different theory than typical methods for fitting parametric models and so it yields a considerably different result. To see why, let us work through the process of actually fitting a parametric sinusoidal model to our measured samples. Typically, this involves solving a least-squares problem which in this case means solving y = Aˆ f for the coefficients of ˆ f , where A = SΨ is the sampling matrix multiplied by the Fourier basis. Because we are solving the problem with least squares, we need to have a “thin” matrix (or at least square) for A, which means that the number of unknown coefficients in ˆ f can be at most k, matching the number of observations at y. Since the Fourier transform of f has two sets of complex conjugate elements in ˆ f , the highest frequency we can solve for uniquely in this manner is at most k/2. This traditional approach is closely related to the Nyquist-Shannon sampling theorem [50], well-known in computer graphics, which states that to correctly reconstruct a signal we must have sampled at more than twice its highest frequency. Indeed, trying to fit a parametric model in this traditional manner means that we only get correct convergence of the integral we have more samples than twice the highest frequency c = 101, or around k = 202. As can be seen in Fig. 2(g), the estimated value of the integral bounces around with fairly high variance until this point and then locks down to I = 100 when we start having enough samples to fit the sinusoids correctly. However, this process is not scalable, since it is dependent on the highest frequency of the signal. If we set c = 1, 000, we would need ten times more samples to converge correctly. Our compressed sensing approach, on the other hand, has the useful property that its behavior is independent of the highest frequency. In another approach, we might consider solving the parametric fit problem using a “fat” A matrix, so that the number of unknown elements in ˆ f , which is n, can be much bigger than k. Perhaps this will allow us to solve for higher frequency sinusoids even though we do not have enough samples. In this case the problem is under-determined and there could be many solutions with the same square error. Traditionally, these kinds of problems are solved with a least-norm algorithm, which finds the least-squares solution that also has the least norm in the 2 sense (where 2 is the square-root of the sum of the squares of the components). In our case the least-norm solution is given by ˆ fln = AT (AAT )−1 y. Unfortunately, this works even worse than the least-squares fit for our example. It turns out that there are many “junk” signals that have a lower 2 norm than the true answer (because they contain lots of small values in their frequency coefficients instead of a few large ones), yet they still match the measured values in the least-squares sense. Although these traditional methods for fitting parametric models are commonly used in computer graphics, they do not work for this simple example because they are dependent on the frequency content of our signal. The theory of compressed sensing, on the other hand, offers us new possibilities since it
166
P. Sen, S. Darabi, and L. Xiao
states that the number of samples is dependent on the sparsity of the signal, not the particular frequency content it may have. In this example, we can apply CS to solve for y = Aˆ f and then integrate the signal as discussed in the previous section. The results shown in Fig. 2(h) show that we “snap” to the answer after about 70 samples, which makes sense since we have observed empirically that ROMP typically requires 10 times more samples than the sparsity m, which is 6 in this case. At this point, the estimated value of the integral is 100.0030, where the 0.003 error is caused by the discretization of f as compared to f (x). This is the bias of our estimator, but it is considerably small especially considering the value of the integral we are computing. Finally, we note that CS is able to reconstruct the discrete signal f perfectly and consistently with k = 75, even though we have less than one sample per period of the highest frequency (c = 101) in the signal, well below the Nyquist limit. Before we finish this discussion we should mention some of the implementation details used to acquire the results of Fig. 2. While the random Monte Carlo and stratified experiments sample the analytic versions of f (x) directly to compute the integral, the parametric and CS approaches need to solve linear systems of equations and therefore use the discrete version f . For the CS experiments we set the size of f and ˆ f to be n = 212 + 1, which means that the highest frequency that we can solve for is 2048Hz. This is not a limiting factor, however, since we can increase this value (e.g., our later experiments for antialiasing in Sec. 6 use n = 220 ). When measuring the variance curves for random Monte Carlo, stratified sampling, and parametric fit, we ran 250 trials for each k to reduce their noisiness. The compressed sensing reconstruction was more consistent and we only used 75 trials for each k to compute its variance. Note that the variance of the CS approach is down to the 10−24 range when it has more than 70 samples, which after 75 independent trials means that the value of the integral is rock solid and stable at this point. This simple example might help motivate our approach in theory, but we need to validate our framework by using it to solve a real-world problem in computer graphics. We spend the rest of the chapter discussing the application of this framework to a set of different problems in rendering.
5
Application to 2D Signals – Image Reconstruction
In our first example, we begin by looking at the problem of image reconstruction from a subset of pixels. The basic idea is to accelerate the rendering process by simply computing a subset of pixels and then reconstructing the missing pixels using the ones we measured. We plan to use compressed sensing to do this by looking for the pixel values that would create the sparsest signal possible in a transform domain. In this case, since we are working with 2D images, the suitable compression basis would be wavelets. Unfortunately, although wavelets are very good at compressing image data, they are incompatible with the point-sampling basis of our rendering system because they are not incoherent with point samples as required
Compressive Rendering of Multidimensional Scenes
167
by Restricted Isometry Condition (RIC) of compressed sensing. To see why, we note that the coherence between a general sampling basis Ω and compression basis Ψ can be found by taking the maximum inner product between any two basis elements of the two: √ μ(Ω, Ψ) = n · max |ωj , ψk |. (7) 1≤j,k≤n
Because the matrices are orthonormal, the resulting coherence lies in the range √ μ(Ω, Ψ) ∈ [1, n] [51], with a fully incoherent pair having a coherence of 1. This is the case for the point-sampled Fourier transform, which is ideal for compressed sensing but unfortunately is not suitable for our application because of its lack of compressibility for 2D images as shown in Fig. 1. If we use a wavelet as the compression basis (e.g., a 642 × 642 Daubechies-8 wavelet (DB-8) matrix for n = 642 ), the coherence with a √ point-sampled basis is 32, which is only half the maximum coherence possible ( n = 64). This large coherence makes wavelets unsuitable to be used as-is in the compressed sensing framework. In order to reduce coherency yet still exploit the wavelet transform, we propose a modification to Eq. 2. Specifically, we assume that there exists a blurred image fb which can be sharpened to form the original image: f = Φ−1 fb , where Φ−1 is a sharpening filter. We can now write the sampling process as y = Sf = SΦ−1 fb . Since the blurred image fb is also sparse in the wavelet domain, we can incorporate the wavelet compression basis in the same way as before and get fb . We can now solve for the sparsest ˆ fb : y = SΦ−1 Ψˆ min ˆ fb 0 s.t. y = Aˆ fb ,
(8)
where A = SΦ−1 Ψ, using the greedy algorithms such as OMP or ROMP. Once ˆ fb has been found, we can compute our final image by taking the inverse wavelet transform and sharpening the result: f = Φ−1 Ψˆ fb . In this work, our filter Φ is a Gaussian filter, and since we can represent the filtering process as multiplication in the frequency domain, we write Φ = F H GF, where F is the Fourier transform matrix and G is a diagonal matrix with values of a Gaussian function along its diagonal. Substituting this in to Eq. 8, we get: min ˆ fb 0 s.t. y = SF H G−1 F Ψˆ fb .
(9)
We observe that G−1 is also a diagonal matrix and should have the values G−1 i,i along its diagonal. However, we must be careful when inverting the Gaussian function because it is prone to noise amplification. To avoid this problem, we use a linear Wiener filter to invert the Gaussian [52], which means that the diagonal 2 elements of our inverse matrix G−1 have the form G−1 i,i = Gi,i /(Gi,i + λ). Since the greedy algorithms (such as ROMP) we use to solve Eq. 3 require a “backward” matrix A† that “undoes” the effect of A (i.e., A† Av ≈ v), where A = SΦ−1 Ψ, we use a backwards matrix of the form A† = Ψ−1 ΦST = Ψ−1 F H GF ST .
168
P. Sen, S. Darabi, and L. Xiao
35
Coherence
30 25 20 15 0
200 400 600 800 1000 Variance of Gaussian filter (Freq domain)
1200
Fig. 3. Coherence vs. variance of Gaussian matrix G. Since G is in the frequency domain, larger variance means a smaller spatial filter. As the variance grows, the coherence converges to 32, the coherence of the point-sampled, 642 × 642 DB-8 matrix. The coherence should be as small as possible, which suggests a smaller variance for our Gaussian filter in the frequency domain. However, this results in a blurrier image fb which is harder to reconstruct accurately. The optimal values for the variance were determined empirically and are shown in Table 2 for different sampling rates.
Note that for real image sizes, the matrix A will be too large to store in memory. For example, to render a 1024 × 1024 image with a 50% sampling rate, our measurement matrix A will have k × n = 5.5 × 1011 elements. Therefore, our implementation must use a functional representation for A that can compute the required multiplications such as Aˆ fb on the fly as needed. The addition of the sharpening filter means that our measurement matrix is composed of two parts: the point-samples S and a “blurred wavelet” matrix Φ−1 Ψ which acts as the compression basis. This new compression basis can be thought of as either blurring the image and then taking the wavelet transform, or applying a “filtered wavelet” transform to the original image. To see how this filter reduces coherence, we plot the result of Eq. 7 as a function of the variance σ2 of Gaussian function of G in Fig. 3 for our 642 × 642 example. Note that the Gaussian G is in the frequency domain, so as the variance gets larger the filter turns into a delta function in the spatial domain and the coherence approaches 32, the value of the unfiltered coherence. As we reduce the variance of G, the filter gets wider in the spatial domain and coherence is reduced by almost a factor of 2. Although it would seem that the variance of G should be as small as possible (lowering coherence), this increases the amount of blur of fb and hence the noise in our final result due to inversion of the filter. We determined the optimal variances empirically on a single test scene and used the same values for all our experiments (see Table 2). In the end, the reduction of coherence by a factor of 2 through the application of the blur filter was enough to yield good results with compressed sensing.
Compressive Rendering of Multidimensional Scenes
169
Table 2. Parameters for the Gaussian (1/σ 2 ) and Wiener filters (λ). We iterated over the parameters of the Gaussian filter to find the ones that yielded the best reconstruction for the ROBOTS scene at the given sampling rates (%) for 1024 × 1024 reconstruction. % 6% 13% 25% 33% 43%
1/σ 2 0.000130 0.000065 0.000043 0.000030 0.0000225
λ 0.089 0.109 0.209 0.209 0.234
% 53% 60% 72% 81% 91%
1/σ 2 0.000015 0.000014 0.000013 0.000011 0.000008
λ 0.259 0.289 0.289 0.299 0.399
To test our algorithm, we integrated our compressive sensing framework into both an academic ray tracing system (PBRT [24]) and a high-end, open source ray tracer (LuxRender [53]). The integration of both was straightforward since we only had to control the pixels being rendered (to compute only a fraction of the pixels), and then add on a reconstruction module that performed the ROMP algorithm. In order to select the random pixels to measure, we used the Boundary sampling method of the Poisson-disk implementation from Dunbar and Humphreys [54] to space out the samples in image-space. For LuxRender, for example, these positions were provided to the ray tracing system through the PixelSampler class. The “low discrepancy” sampler was used to ensure that samples were only made in the pixels selected. After the ray tracer evaluated the samples, the measurement was recorded into a data structure that was fed into the ROMP solver. The rest of the ray tracing code was left untouched. The ROMP solver was based on the code by Needell et al. [47] available on their website but re-written in C++ for higher performance. We leverage the Intel Math Kernel Library 10.1 (MKL) libraries [55] to accelerate linear algebra computation and to perform the Fast-Fourier Transform for our Gaussian filter. In addition, we use the Stanford LSQR solver [56] to solve the least-squares step at the end of ROMP. The advantage of LSQR is that it is functional-based so we do not need to represent the entire A matrix in memory since it can get quite large as mentioned earlier. To describe the implementation of the functional version of the measurement matrix A, we first recall that A = SF H G−1 F Ψ from Eq. 9. The inverse wavelet transform Ψ was computed using the lifting algorithm [11], and the MKL library was used to compute the Fourier and inverse-Fourier transforms of the signal. To apply the filter, we simply weighted the coefficients by the Gaussian function described in the algorithm. After applying the inverse Fourier transform to the filtered signal, we then simply take the samples from the desired positions. This gives us a way to simulate the effect of matrix A in our ROMP algorithm without explicitly specifying the entire matrix. In addition, we found empirically that ROMP behaved better when the maximum number of coefficients added in each iteration was bounded. For the experiments in this chapter, we used a bound of 2k/i, where k is the number of pixels measured and i is the maximum number of ROMP iterations which we set to 30.
170
P. Sen, S. Darabi, and L. Xiao
Table 3. Timing results in minutes of our algorithm. Pre-process includes loading the models, creating the acceleration data structures. Full render is the time to sufficiently sample every pixel to generate the ground-truth image. CS Recon is the time it took our reconstruction algorithm to solve for a 1024 × 1024 image with 75% of samples. The last column shows the percentage of rays that could be traced instead of using our approach. Because our CS reconstruction is fast, this number is fairly small. We ignore post-processing effects because these are in the order of seconds and are negligible. We also do not include the cost of the interpolation algorithm since it took around 10 seconds to triangulate and interpolate the samples. Scene Robots Watch Sponza
Pre-process Full Render CS Recon 0.25 611 11.9 0.28 903 13.5 47 634 12.0
% 1.9% 1.4% 1.9%
After the renderer finishes computing the samples, ROMP operates on the input vector y. We set the target sparsity to one fifth of the number of samples k, which has been observed to work well in the CS literature [57]. ROMP uses the Gaussian filter of Eq. 9 as part of the reconstruction process. The parameters of the Gaussian filter were set through iterative experiments on a single scene, but once set they were used for all the scenes in this chapter. Table 2 shows the actual values used in our experiments. If a sampling rate is used that is not in the table, the nearest entries are interpolated. The compression basis used in this work is the Cohen-Daubechies-Feauveau (CDF) 9-7 biorthogonal wavelet, which is particularly well-suited for image compression and is the wavelet used in JPEG2000 [58]. Since we are dealing with color images, the image signal must be reconstructed in all three channels: R, G, and B. To accelerate reconstruction, we transform the color to YUV space and use the compressive rendering framework for only the Y channel and use the Delaunay-interpolation (described below) for the other two. The error introduced by doing this is not noticeable as we are much more sensitive to the Y channel in an image than the other the two. To compare our results, we test our approach against a variety of other algorithms that might be used to fill in the missing pixel data in the renderer. For example, we compare against the popular inpainting method of Bertalmio et al. [34], using the implementation by Alper and Mavinkurve [59]. To compare against approaches from the non-uniform sampling community, we implemented the Marvasti algorithm [38] and used the MATLAB version for the ACT algorithm provided in the Nonuniform Sampling textbook [36]. Finally, rendering systems in practice typically use interpolation methods to estimate values between computed samples. Unfortunately, many of the convolution-based methods that work so well for uniform sampling simply do not work when dealing with non-uniform sample reconstruction (see, e.g., the discussion by Mitchell [26]). For this work we implemented the piece-wise cubic multi-stage filter described by Mitchell [26], with the modification that we put back the original samples at every stage to improve performance. Finally, we also implemented the most common interpolation algorithm used in practice, which uses Delaunay
Compressive Rendering of Multidimensional Scenes
171
10000
Run- time (secs)
1000
100
10
1 n lo g (n ) R OMP
0 .1 1 E +2
1 E +3
1 E +4
1 E +5
1 E +6
1 E +7
Image size (n)
Fig. 4. Run-time complexity of our ROMP reconstruction algorithm, where n is the total size of our image (width × height). The curve of n log n is shown for comparison. We tested our algorithm on images of size 32×32 all the way through 2048×2048. Even at the larger sizes, the performance remained true to the expected behavior. Note that the complexity of the reconstruction algorithm is independent of scene complexity.
triangulation to mesh the samples and then evaluates the color of the missing pixels in between by interpolating each triangle of the mesh, e.g., as described by Painter and Sloan [60]. This simple algorithm provides a piece-wise linear reconstruction of the image, which turned out to be one of the better reconstruction techniques. The different algorithms were all tested on a Dell Precision T3400 with a quadcore, 3.0 Ghz Intel Core2 Extreme CPU QX6850 CPU with 4GB RAM capable of running 4 threads. The multi-threading is used by LuxRender during pixel sampling and by the Intel MKL library when solving the ROMP algorithm during reconstruction. Since most of the reconstruction algorithms have border artifacts, we render a larger frame and crop out a margin around the edges. For example, the 1000 × 1000 images were rendered at 1024 × 1024 with a 12-pixel border. 5.1
Timing Performance
In order for the proposed framework to be useful, the CS reconstruction of step 2 has to be fast and take less time than the alternative of simply bruteforce rendering more pixels. Table 3 shows the timing parameters of various scenes rendered with LuxRender. We see that the CS step takes approximately 10 minutes to run for a 1024 × 1024 image with 75% samples with our unoptimized C implementation. Since the full-frame rendering times are on the order of 6 to 15 hours, the CS reconstruction constitutes less than 2% of the total rendering time, which means that in the time to run our reconstruction algorithm only 2% of extra pixels could be computed. On the other hand, the inpainting
172
P. Sen, S. Darabi, and L. Xiao
Table 4. MSE performance of the various algorithms (×10−4 ) for different scenes. All scenes were sampled with 60% of pixel samples. Scene Interp Cubic Robots 2.00 4.94 Watch 6.42 11.00 Sponza 2.38 6.77
ACT Marvasti Inpaint CS 1.98 2.11 4.24 1.72 6.15 6.68 15.00 5.34 2.38 2.65 4.70 2.11
implementation we tested took approximately one hour to compute the missing pixels, making our ROMP reconstruction reasonably efficient by comparison. We also examine the run-time complexity of the ROMP reconstruction as a function of image size to see how the processing times would scale with image size for images from 32×32 to 2048×2048 (see Fig. 4). We can see that it behaves as O(n log n) as predicted by the model. For the image sizes that we are dealing with (≤ 107 pixels) this is certainly acceptable, given the improvement in image quality we get with our technique. Finally, we note that our algorithm runs in image-space so it is completely independent of scene complexity. On the other hand, rendering algorithms scale as O(n), but the constants involved depend on scene complexity and have a significant impact on the rendering time. Over the past few decades, feature film rendering times have remained fairly constant as advances in hardware and algorithms are offset by increased scene complexity. Since our algorithm is independent of the scene complexity, it will continue to be useful in the foreseeable future. 5.2
Image Quality
Standard measures for image quality are typically 2 distance measures. In this work, we use the mean squared error (MSE) assuming that the pixels in the image have a range of 0 to 1, and compare the reconstructed images from all the approaches to the ground-truth original. Table 4 shows the MSE for the various algorithms we tested: the first two are interpolation algorithms, followed by the two algorithms from the non-uniform sampling community, then the result of inpainting and finally the result of the CS-based reconstruction proposed in this work. Our algorithm has the lowest MSE, something that we observed in all our experiments. A few additional points are worth mentioning. First of all, we noticed that the inpainting algorithm performed fairly poorly in our experiments. The reason for this is that in our application the holes are randomly positioned, while these techniques require bands of known pixels around the hole (i.e., spatial locality). Unfortunately, this is not easy to do in a rendering system since we cannot cluster the samples a priori without knowledge of the resulting image. Also disappointing was Mitchell’s multistage cubic filter, which tended to overblur the image when we set the kernel large enough to bridge the larger holes in the image. Although the algorithms from the non-uniform sampling community (ACT and Marvasti) perform better, they are on par with the Delaunay-interpolation used in rendering which works remarkably well.
Compressive Rendering of Multidimensional Scenes Robots
Watch CS
3 .2 E-0 4
CS
1 .4 E-0 3
D . in t
D . in t
ACT
ACT
7 .2 E-0 4
MSE
MSE
1 .6 E-0 4
8 .0 E-0 5
3 .6 E-0 4
1 .8 E-0 4
4 .0 E-0 5
2 .0 E-0 5 0 .3
173
0 .4
0 .5 0 .6 0 .7 Fraction of samples (k/n)
0 .8
0 .9
9 .0 E-0 5 0 .3
0 .4
0 .5 0 .6 0 .7 Fraction of samples (k/n)
0 .8
0 .9
Fig. 5. Log error curves as a function of the number of samples for four test scenes using our technique and the best two other competing reconstruction algorithms. Our CS reconstruction beats both Delaunay-interpolation (D. int) as well as ACT, requiring 5% to 10% less samples to achieve a given level of quality.
To see how our algorithm would work at different sampling rates, we compare it against the two best competing methods (Delaunay-Interpolation and ACT) for two of our scenes in Fig. 5. We observe that to achieve a given image quality, Delaunay interpolation and ACT require about 5% to 10% more samples than our approach. When the rendering time is 10 hours, this adds up to an hour of savings to achieve comparable quality. Furthermore, since our algorithm is completely independent of scene complexity, the benefit of our approach over interpolation becomes more significant as the rendering time increases. However, MSE is not the best indicator for visual quality, which is after all the most important criterion in high-end rendering. To compare the visual quality of our results, we refer the readers to Fig.12 at the end of the chapter. We observe that compressive rendering performs much better than interpolation in regions with sharp edges or those that are slightly blurred, a good property for a rendering system. To see this, we direct readers to the second inset of the robots scene in Fig. 12. Although the fine grooves in the robot’s arm cannot be reconstructed faithfully by any of the other algorithms, compressed sensing is able to do this by selecting the values for the missing pixel locations that yield a sparse wavelet representation. Another good example can be found on the third row of the watch scene. Here there is a pixel missing between two parts of the letter “E,” which our algorithm is the only one to be able to correctly reconstruct. All other techniques simply interpolate between the samples on either side of the missing pixel and fill in this sample incorrectly. However, since clean, straight lines are more sparse than the jumbled noise estimated by the other approaches, they are selected by our technique. Although the ACT algorithm performs reasonably well overall, it suffers from ringing when the number of missing samples is high and there is a sharp edge (see e.g., the last inset of watch) because of the fitting of trigonometric polynomials to the point samples.
174
P. Sen, S. Darabi, and L. Xiao
pixel to integrate
(a)
pixel to integrate
(b)
(c)
(d)
Fig. 6. Illustration of our antialiasing algorithm. (a) Original continuous signal f (x) to be antialiased over the 4 × 4 pixel grid shown. (b) In our approach, we first take k samples of the signal aligned on an underlying grid of fixed resolution n. This is equivalent of taking k random samples of discrete signal f . (c) The measured samples form our vector of measurements y, with the unknown parts of f shown in green. Using ROMP, we solve y = Aˆ f for ˆ f , where A = SΨ. S is the sampling matrix corresponding to the samples taken, Ψ is the blurred-wavelet basis described in Sec. 5. (d) Our approximation to f , computed by applying the synthesis basis to ˆ f (i.e., f = Ψˆ f ). We integrate this approximation over each pixel to get our antialiased result.
6
Application to 2D Signals – Antialiasing
In this section, we present another example of 2D scene reconstruction by applying our framework to the problem of box-filtered antialiasing. The basic idea is simple (see overview in Fig. 6). We first take a few random point samples of the scene function f () per pixel. Unlike the previous section, we are no longer dealing with pixels of the image but rather samples on an underlying grid of higher resolution than the image that matches the size of the unknown discrete function f and is aligned with its samples. We then use ROMP to approximate a solution to Eq. 3 which can then be used to calculate f . Once we have f , we can integrate it over the pixel to perform our antialiasing. The observation is that if f is sparse in the transform domain, we will need only a small set of samples to evaluate this integral accurately. Fig. 7 shows a visual comparison of our approach against stratified and random supersampling which are also used for antialiasing images. This is very similar to our previous approach, except now we have introduced the notion of applying integration to the reconstructed function to produce the final image.
7
Application to 3D Signals – Motion Blur
We now describe the application of our framework to the rendering of motion blur, which involves the reconstruction of a 3D scene. Motion blur occurs in dynamic scenes when the projected image changes as it is integrated over the time the camera aperture is open. Traditionally, Monte Carlo rendering systems emulate motion blur by randomly sampling rays over time and accumulating them together to estimate the integral [61]. Conceptually, our approach is very
Compressive Rendering of Multidimensional Scenes Random samples
Stratified sampling
175
Our approach
Fig. 7. Visual comparison for Garden scene. Each row has a different number of samples/pixel (from top to bottom: 1, 4).
similar to that of our antialiasing algorithm. We first take a set of samples of the scene y, except that now the measurements are also spaced out in time to sample the discrete spatio-temporal volume f , which represents a set of video frames over the time the aperture was open. We then use compressed sensing to reconstruct ˆ f , the representation of the volume in the 3D Fourier transform domain Ψ. After applying the inverse transform to recover an approximation to the original f , we can then integrate it over time to achieve our desired result. An example of the computed motion blur is shown in Fig. 8.
8
Application to 3D Signals – Video
An obvious extension to the motion blur application of the previous section is to view the individual frames of the reconstructed spatio-temporal volume directly, resulting in an algorithm to render animated sequences. This is a significant advantage over the more complex adaptive approaches that have been proposed for rendering (e.g., MDAS [62] or AWR [63]) which are difficult to extend to animated scenes because of their complexity. This is the reason that these previous approaches have dealt exclusively with the rendering of static imagery. To generate an animated sequence, these approaches render a set of static frames by evaluating each frame independently and do not take into account their temporal coherence. Our approach, on the other hand, uses compressed sensing to
176
P. Sen, S. Darabi, and L. Xiao Reference image
Random sampling
Our approach
Fig. 8. Visual comparison of motion blur results for the Train scene. The reference image was rendered with 70 temporal samples/pixel, while the other two where rendered with a single random sample per pixel in time. Our result was reconstructed assuming a spatio-temporal volume of 24 frames. Images were rendered at a resolution of 1000 × 1000.
evaluate a sparse version of the signal in x, y, and t in the 3D Fourier domain, so it fully computes the entire spatio-temporal volume which we can view as frames in the video sequence. To demonstrate this, we show individual frames in Fig. 9 from the dynamic train scene of Fig. 8. For comparison, we implemented an optimized linear interpolation using a 3-D Gaussian kernel, which is the best convolution methods can do when the sampling rate is so low. We also compare against our earlier compressive rendering work [5], which reconstructs the set of static images individually using sparsity in the wavelet domain. The second column of Fig. 9 shows the missing samples in green, which results in an image that is almost entirely green when we have a 1% sampling rate (only 1/100 pixels in the spatio-temporal volume are calculated). It is remarkable that our algorithm can reconstruct a reasonable image even at this extremely low sampling density, while the other two approaches fail completely. This suggests that our technique could be useful for pre-visualization, since we get a reasonable image with 100× less samples. The reconstruction time is less than a minute per frame using the unoptimized C++ SpaRSA implementation, while the rendering time is 8 minutes per frame on an Intel Xeon 2.93 GHz CPUbased computer with 16 GB RAM. This means that the ground-truth reference 128-frame video would take over 17 hours to compute. On the other hand, using our reconstruction with only 1% of samples we get a video with a frame shown in Fig. 9 (right image) in less than 2 hours.
9
Application to 4D Signals – Depth of Field
We now show the application of our framework to a 4D scene function to demonstrate the rendering of depth-of-field. Monte Carlo rendering systems compute
Compressive Rendering of Multidimensional Scenes Reference Frame
Sample Positions
Measured Samples
Gaussian
2-D CS rendering
177
3-D CS rendering
Fig. 9. Reconstructing frames in an animated sequence. We can use our approach to render individual frames in an animated sequence and leverage the coherence both spatially and temporally in the Fourier domain. The three rows represent different frames in the train sequence with the sampling rate varied for each (1% for the top row, 10% for the middle, and 25% for the bottom). The first column shows the reference frame of the fully-rendered sequence, the second column shows the positions of the samples (shown in white), the third column shows the samples available for this particular frame (unknown samples shown in green), the fourth column shows a reconstruction by convolving the samples with an optimized 3-D Gaussian kernel with variance adjusted for the sampling rate (best possible linear filter), the fifth column shows the results of reconstructing each frame separately using the 2D algorithm from Sec. 5, and the last column is reconstructs the entire 3-D volume using a 3D Fourier transform. These images are rendered at 512 × 512 with 128 frames in the video sequence.
depth-of-field by estimating the integral of the radiance of incoming rays over the aperture of the lens through a set of random points on the virtual lens [61]. This means that we must choose two additional random parameters for each sample which tell us the position the ray passes through the virtual lens. Therefore, in this application we parameterize each sample by its image-space coordinate (x, y) and these two additional parameters (u, v). As mentioned in the previous sections, since our compressed sensing reconstruction works on discrete positions, we uniformly choose the positions on the virtual lens to lie on a grid. The proposed framework is general and therefore easy to map to this new problem. We take our sample measurements y by sampling this new 4D space, and then reconstruct the whole space f using compressed sensing by assuming the sparsity in the 4D Fourier domain. For this example, we use the SpaRSA solver to compute the sparse transform-domain signal ˆ f . The final image is calculated in the end by integrating the reconstructed 4D signal over all u and v for each pixel. Fig. 10 shows an example of the output of algorithm for the depth-of-field.
178
10
P. Sen, S. Darabi, and L. Xiao
Application to 4D Signals – Area Light Source
We can also apply the proposed framework to reconstruct a 4D scene with an area light source. This extension is fairly similar to that of the depth-of-field effect, with the only difference in that here the two random variables represent points on the area light source. In this way, we parametrize each sample by its image-space coordinate (x, y) along with the position on the area light source (p, q). Again, we sample this 4D space and reconstruct it with SpaRSA assuming the sparsity in the 4D Fourier domain. Finally, we integrate over the area light coordinates p and q for each pixel in the reconstructed space to determine the final pixel color. Fig. 11 shows an example of the result of the algorithm for an area light source. Reference image
Random sampling
Our approach
Fig. 10. Visual comparison of depth-of-field results for the Dragon scene. The reference image was rendered with 256 samples per pixel. Our result was generated by reconstructing a signal of size 340 × 256 × 16 × 16 and integrating the 16 × 16 samples over (u, v) for each of the 340 × 256 pixels. Reference image
Random sampling
Our approach
Fig. 11. Visual comparison of area light source results for the Buddha scene. The reference image was rendered with 256 samples per pixel. Our result was generated by reconstructing a signal of size 300×400×16×16 and integrating the 16×16 samples over (p, q) for each of the 300 × 400 pixels.
Compressive Rendering of Multidimensional Scenes
4
179
GT
samples
D. int
ACT
inpainting
CS
GT
samples
D. int
ACT
inpainting
CS
GT
samples
D. int
ACT
inpainting
CS
1
2 3
1
3
2
4
1 2 3
4
Fig. 12. Results of scenes with the reconstruction algorithms, each with a different % of computed samples. From top to bottom: robots (60%), watch (72%), sponza (87%). The large image shows the ground truth (GT) rendering to show context. The smaller columns show the ground truth (GT) of the inset region and the ray-traced pixels (unknown pixels shown in green), followed by the results of Delaunay interpolation (D. int), ACT, inpainting, and compressed sensing (CS). It can be seen that our algorithm produces higher-quality images than the others.
180
11
P. Sen, S. Darabi, and L. Xiao
Discussion
The framework proposed in this chapter is fairly general and as shown in these last few sections it can reconstruct a wide range of Monte Carlo effects. However, there are some issues that currently affect its practical use for production rendering. One of the current limiting factors for the performance of the system is the speed of the reconstruction by the solver. In this work, we used C++ implementations of ROMP [47] and SpaRSA [48], but these were still relatively slow (requiring tens of minutes for some of the reconstructions) which decreases the performance of the overall system. However, the applied mathematics community is constantly developing new CS solvers, and there are already solvers that appear to be much faster than ROMP or SpaRSA that we are just starting to experiment with. There is also a possibility to implement the CS solver on the GPU, which would give us a further speed up for our algorithm. Another issue is the memory usage of the algorithm. Currently, the solvers need to store the entire signal f (or its transform ˆ f ) in memory while the solver is calculating its components. As the dimension of our scene function grows, the size of f grows exponentially. For example, if we want to do compressive rendering for a 6D scene with depth-of-field and an area light source, say with an image resolution of 1024 × 1024 and sample grids of 16 × 16 for a lens and an area light source, we would have to store an f of size n = 236 entries, which would require 256 GB of memory. Therefore, our current approach suffers from the “curse of dimensionality” that can plague other approaches for multidimensional signal integration. We are currently working on modifying the solvers to ease the memory requirements of the implementation. Nevertheless, this chapter presents a novel way to look at the Monte Carlo rendering by treating it as a multidimensional function that we can reconstruct fully by assuming that it is sparse in a transform domain using the tools from compressed sensing. This work might encourage other researchers to explore new ways to solve the rendering problem.
12
Conclusion
In this chapter, we have presented a general framework for compressive rendering that shows how we can use a distributed ray tracing system to take a small set of point samples of a multidimensional function f (), which we then can approximately reconstruct using compressed sensing algorithms such as ROMP and SpaRSA by assuming sparsity in a transform domain. After reconstruction, we can then integrate the signal down to produce the final rendered image. This algorithm works for a general set of Monte Carlo effects, and we demonstrate results with motion-blur, depth-of-field, and area light sources. Acknowledgments. The authors would like to thank Dr. Yasamin Mostofi for fruitful discussions regarding this work. Maziar Yaesoubi, Nima Khademi Kalantari, and Vahid Noormofidi also helped to acquire some of the results presented. The Dragon (Figs. 1 and 10), Buddha (Fig. 11), Sponza (Fig. 12)
Compressive Rendering of Multidimensional Scenes
181
and Garden (Fig. 7) scenes are from the distribution of the PBRT raytracer by Pharr and Humphreys [24]. The Robot (Fig. 12) and Train (Fig. 6, 8 and 9) scenes are from J. Peter Lloyd, and the Watch (Fig. 12) scene is from Luca Cugia. This work was funded by the NSF CAREER Award #0845396 “A Framework for Sparse Signal Reconstruction for Computer Graphics.”
References 1. Veach, E.: Robust monte carlo methods for light transport simulation. PhD thesis, Stanford University, Stanford, CA, USA (1998); Adviser-Leonidas J. Guibas 2. Dutr´e, P., Bala, K., Bekaert, P., Shirley, P.: Advanced Global Illumination. AK Peters Ltd.,Wellesley (2006) 3. Jensen, H.W.: Realistic image synthesis using photon mapping. A. K. Peters, Ltd., Natick (2001) 4. Sen, P., Darabi, S.: Compressive rendering: A rendering application of compressed sensing. IEEE Transactions on Visualization and Computer Graphics 17, 487–499 (2011) 5. Sen, P., Darabi, S.: Compressive estimation for signal integration in rendering. In: Computer Graphics Forum (Proceedings of Eurographics Symposium on Rendering (EGSR) 2010), vol. 29, pp. 1355–1363 (2010) 6. Sen, P., Darabi, S.: Exploiting the sparsity of video sequences to efficiently capture them. In: Magnor, M., Cremers, D., Zelnik-Manor, L. (eds.) Dagstuhl Seminar on Computational Video (2010) 7. Sen, P., Darabi, S.: Details and implementation for compressive estimation for signal integration in rendering. Technical Report EECE-TR-10-0003, University of New Mexico (2010) 8. Schr¨ oder, P.: Wavelets in computer graphics. Proceedings of the IEEE 84, 615–625 (1996) 9. Schr¨ oder, P., Sweldens, W.: Wavelets in computer graphics. In: SIGGRAPH 1996 Course Notes (1996) 10. Stollnitz, E.J., Derose, T.D., Salesin, D.H.: Wavelets for computer graphics: theory and applications. Morgan Kaufmann Publishers Inc., San Francisco (1996) 11. Mallat, S.: A Wavelet Tour of Signal Processing, 2nd edn. Academic Press, London (1999) 12. Hanrahan, P., Salzman, D., Aupperle, L.: A rapid hierarchical radiosity algorithm. SIGGRAPH Comput. Graph 25, 197–206 (1991) 13. Gortler, S.J., Schr¨ oder, P., Cohen, M.F., Hanrahan, P.: Wavelet radiosity. In: SIGGRAPH 1993, pp. 221–230 (1993) 14. Lischinski, D., Tampieri, F., Greenberg, D.P.: Combining hierarchical radiosity and discontinuity meshing. In: SIGGRAPH, pp. 199–208 (1993) 15. Schr¨ oder, P., Gortler, S.J., Cohen, M.F., Hanrahan, P.: Wavelet projections for radiosity. Computer Graphics Forum 13 (1994) 16. Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. In: SIGGRAPH (2001) 17. Sloan, P.P., Kautz, J., Snyder, J.: Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments. In: SIGGRAPH, pp. 527–536 (2002) 18. Ng, R., Ramamoorthi, R., Hanrahan, P.: All-frequency shadows using non-linear wavelet lighting approximation. ACM Trans. Graph 22, 376–381 (2003)
182
P. Sen, S. Darabi, and L. Xiao
19. Ng, R., Ramamoorthi, R., Hanrahan, P.: Triple product wavelet integrals for allfrequency relighting. In: SIGGRAPH, pp. 477–487 (2004) 20. Malzbender, T.: Fourier volume rendering. ACM Trans. Graph 12, 233–250 (1993) 21. Totsuka, T., Levoy, M.: Frequency domain volume rendering. In: SIGGRAPH, pp. 271–278 (1993) 22. Gross, M.H., Lippert, L., Dittrich, R., H¨ aring, S.: Two methods for wavelet-based volume rendering. Computers and Graphics 21, 237–252 (1997) 23. Bolin, M., Meyer, G.: A frequency based ray tracer. In: SIGGRAPH 1995: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 409–418 (1995) 24. Pharr, M., Humphreys, G.: Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann Publishers Inc., San Francisco (2004) 25. Whitted, T.: An improved illumination model for shaded display. Communications of the ACM 23, 343–349 (1980) 26. Mitchell, D.P.: Generating antialiased images at low sampling densities. In: SIGGRAPH, pp. 65–72 (1987) 27. Walter, B., Drettakis, G., Parker, S.: Interactive rendering using the render cache. In: Lischinski, D., Larson, G. (eds.) Proceedings of the 10th Eurographics Workshop on Rendering, vol. 10, pp. 235–246. Springer-Verlag/Wien, New York, NY (1999) 28. Tole, P., Pellacini, F., Walter, B., Greenberg, D.: Interactive global illumination in dynamic scenes. ACM Trans. Graph 21, 537–546 (2002) 29. Pighin, F., Lischinski, D., Salesin, D.: Progressive previewing of ray-traced images using image plane disconinuity meshing. In: Proceedings of the Eurographics Workshop on Rendering 1997, pp. 115–125. Springer, London (1997) 30. Guo, B.: Progressive radiance evaluation using directional coherence maps. In: SIGGRAPH 1998: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 255–266. ACM, New York (1998) 31. Bala, K., Walter, B., Greenberg, D.P.: Combining edges and points for interactive high-quality rendering. ACM Trans. Graph 22, 631–640 (2003) 32. Sen, P., Cammarano, M., Hanrahan, P.: Shadow silhouette maps. ACM Transactions on Graphics 22, 521–526 (2003) 33. Sen, P.: Silhouette maps for improved texture magnification. In: HWWS 2004: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pp. 65–73. ACM, New York (2004) 34. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: SIGGRAPH 2000: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 417–424 (2000) 35. Masnou, S., Morel, J.M.: Level lines based disocclusion. In: Proceedings of ICIP, pp. 259–263 (1998) 36. Marvasti, F.: Nonuniform Sampling: Theory and Practice. Kluwer Academic Publishers, Dordrecht (2001) 37. Feichtinger, H., Gr¨ ochenig, K., Strohmer, T.: Efficient numerical methods in nonuniform sampling theory. Numer. Math. 69, 423–440 (1995) 38. Marvasti, F., Liu, C., Adams, G.: Analysis and recovery of multidimensional signals from irregular samples using nonlinear and iterative techniques. Signal Process 36, 13–30 (1994) 39. Gu, J., Nayar, S., Grinspun, E., Belhumeur, P., Ramamoorthi, R.: Compressive structured light for recovering inhomogeneous participating media. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS, vol. 5305, pp. 845–858. Springer, Heidelberg (2008)
Compressive Rendering of Multidimensional Scenes
183
40. Peers, P., Mahajan, D., Lamond, B., Ghosh, A., Matusik, W., Ramamoorthi, R., Debevec, P.: Compressive light transport sensing. ACM Trans. Graph. 28, 1–18 (2009) 41. Sen, P., Darabi, S.: Compressive Dual Photography. Computer Graphics Forum 28, 609–618 (2009) 42. Cand`es, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on Information Theory 52, 489–509 (2006) 43. Donoho, D.L.: Compressed sensing. IEEE Trans. on Information Theory 52, 1289– 1306 (2006) 44. Rice University Compressive Sensing Resources website (2009), http://www.dsp.ece.rice.edu/cs/ 45. Cand`es, E.J., Rudelson, M., Tao, T., Vershynin, R.: Error correction via linear programming. In: IEEE Symposium on Foundations of Computer Science, pp. 295– 308 (2005) 46. Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Information Theory 53, 4655–4666 (2007) 47. Needell, D., Vershynin, R.: Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit (2007) (preprint) 48. Wright, S., Nowak, R., Figueiredo, M.: Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing 57, 2479–2493 (2009) 49. Cand`es, E.J., Tao, T.: Near optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. on Information Theory 52, 5406–5425 (2006) 50. Shannon, C.E.: Communication in the presence of noise. Proc. Institute of Radio Engineers 37, 10–21 (1949) 51. Donoho, D.L., Huo, X.: Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory 47, 2845–2862 (2001) 52. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (2001) 53. LuxRender (2009), http://www.luxrender.net/ 54. Dunbar, D., Humphreys, G.: A spatial data structure for fast Poisson-disk sample generation. ACM Trans. Graph. 25, 503–508 (2006) 55. Intel Math Kernel Library (2009), http://www.intel.com/ 56. Stanford Systems Optimization Laboratory software website (2009), http://www.stanford.edu/group/SOL/software/lsqr.html 57. Tsaig, Y., Donoho, D.L.: Extensions of compressed sensing. Signal Process 86, 549–571 (2006) 58. Taubman, D.S., Marcellin, M.W.: JPEG 2000: Image Compression Fundamentals, Standards and Practice. Springer, Heidelberg (2001) 59. Alper, E., Mavinkurve, S.: Image inpainting implementation (2002), http://www.eecs.harvard.edu/~ sanjay/inpainting/ 60. Painter, J., Sloan, K.: Antialiased ray tracing by adaptive progressive refinement. SIGGRAPH Comput. Graph. 23, 281–288 (1989) 61. Cook, R.L., Porter, T., Carpenter, L.: Distributed ray tracing. SIGGRAPH Comput. Graph. 18, 137–145 (1984) 62. Hachisuka, T., Jarosz, W., Weistroffer, R.P., Dale, K., Humphreys, G., Zwicker, M., Jensen, H.W.: Multidimensional adaptive sampling and reconstruction for ray tracing. ACM Trans. Graph. 27, 1–10 (2008) 63. Overbeck, R.S., Donner, C., Ramamoorthi, R.: Adaptive wavelet rendering. ACM Trans. Graph. 28, 1–12 (2009)
Efficient Rendering of Light Field Images Daniel Jung and Reinhard Koch Computer Science Department, Christian-Albrechts-University Kiel Hermann-Rodewald-Str. 3, 24118 Kiel, Germany
Abstract. Recently a new display type has emerged that is able to display 50,000 views offering a full parallax autostereoscopic view of static scenes. With the advancement in the manufacturing technology, multiview displays come with more and more views of dynamic content, closing the gap to this high quality full parallax display. The established method of content generation for synthetic stereo images is to render both views. To ensure a high quality these images are often ray traced. With the increasing number of views, rendering of all views is not feasible for multi-view displays. Therefore methods are required that can render the large amount of different views required by those displays efficiently. In the following a complete solution is presented that describes how all views for a full parallax display can be rendered from a small set of input images and their associated depth images with an image-based rendering algorithm. An acceleration of the rendering of two orders of magnitude is achieved by different parallelization techniques and the use of efficient data structures. Moreover, the problem of finding the best-next-view for an imagebased rendering algorithm is addressed and a solution is presented that ranks possible viewpoints based on their suitability for an image-based rendering algorithm. Keywords: Image-based rendering, depth compensated interpolation, best-next-view-selection, viewpoint planning, full parallax display, multiview display.
1
Introduction
Three-dimensional displays add an important depth clue to realistic perception, and are becoming ever more popular. Hence there is an increasing demand to produce and display 3D content to the general public. So far, the large-scale 3D presentation is mostly confined to 3D cinemas using glasses-based stereoscopy. Glasses-free autostereoscopic displays exist that project multiple views (typically below 10 views), but still have a poor depth resolution and horizontal parallax only. For outdoor 3D-advertisement, one calls for high-resolution large-scale displays of several square meters size that emit a full light field with unrestricted 3D perception to a passing audience without glasses. In this case, one needs to D. Cremers et al. (Eds.): Video Processing and Computational Video, LNCS 7082, pp. 184–211, 2011. c Springer-Verlag Berlin Heidelberg 2011
Efficient Rendering of Light Field Images
185
produce not 10 but several thousands of slightly different views that capture all details of the 3D scene. This can be achieved by a light field display that emits all possible light rays of the scene in an integral image. Each pixel of such an integral image is indeed a display element in itself, holding all possible light rays for this position in space. Such a display may easily need several Gigapixel per square meter, and each display element consists of a lens system with an associated light ray image. Currently, no digital version of such a display is available, although experiments with multi-projector-based systems show its feasibility [1], [2]. There is, however, a display capable of displaying 12 Gigapixel/m2 for a static image by coupling a 2D lens array with a photographic film and back-lighting. The display emits a full 2D parallax light field with 50,000 light rays per lens, the 230,000 lenses arranged in a 2D-array of 1 m2 with an inter-lens distance of 2 mm. Figure 1 depicts a prototype of the display (left) and a schematic of one of the display elements with a lens system that projects its lens image (right). Each lens system projects a circular light ray image with an opening angle of 40 degree. The high-density photographic film contains the colors for the light rays, and an observer in front of the display perceives a light field with both eyes, indistinguishable from the real 3D scene.
Fig. 1. The light field display with 230,000 display elements (Prototype of corporation: Realeyes GmbH, left) and a sketch of one display element projecting its lens image consisting of 50,000 rays (right)
As this display is targeted towards 3D outdoor advertisement, the light field to be coded into the film will be produced by the advertisement company as a 3D scene model and rendered by a high-end ray tracer for highest quality. However, ray tracing of such an enormous amount of data is a challenge to the renderer. Each lens image itself holds only 50,000 pixel, but the render system needs to produce 230,000 images, one for each lens, which might well take a few months on a current multi-core desktop computer. It is possible to speedup
186
D. Jung and R. Koch
the rendering by using large render farms but at the expense of very high costs. In [3] we evaluated the render performance for such a display. We came to the conclusion that full ray tracing is not economically feasible, but point out that the image content is highly redundant as the images are very similar. This leads to the concept to ray trace only a small subset of images and to interpolate the in-between images using depth-compensated view interpolation. In this case, a speedup of up to two orders of magnitude is possible while preserving most of the image’s quality. In the following a complete solution is presented that describes how all views for a full parallax display can be rendered out of a small set of input images and their associated depth images with an image-based rendering algorithm. It is described how the viewpoints of the input images are chosen and how the presented algorithm is accelerated by parallelization on the central processing unit and the graphics hardware. Afterwards the results of the viewpoint selection algorithm are presented and the rendering algorithm is evaluated whereby the different parallelization techniques are compared. This work extends [3] and [4] by giving a more detailed insight into the image-based rendering algorithm and the viewpoint selection as well as the acceleration techniques used.
2
Prior Work
The plenoptic function can be used to describe all light that exists within a scene. It has been introduced by Adelson and Bergen [5] and describes the direction (θ, φ) and the wavelength (λ) of all light rays passing through every position in the scene (Vx , Vy , Vz = V) in dependence of the time (t) P = P (θ, φ, λ, t, Vx , Vy , Vz ) = P (θ, φ, λ, t, V) .
(1)
For static scenes and under constant illumination the plenoptic function can be reduced to the five-dimensional function P5D = P5D (θ, φ, Vx , Vy , Vz ) = P5D (θ, φ, V) ,
(2)
which characterizes a full panoramic image at the viewpoint V. McMillan and Bishop [6] linked image-based rendering to the plenoptic function and they came to the conclusion that a unit sphere centered around the viewpoint would be the most natural representation. However, they used a cylindrical projection for their representation because it simplified their correspondence search and could easily be unrolled onto a planar map. The light field of a scene can be reduced to a 4D function under the limitation that the radiance of light rays do not change while passing through the scene. Ashdown [7] used this representation to simplify the calculation method of illuminance and pointed out its suitability for ray tracing. With this representation it is not possible to describe transparent surfaces or occlusions but the direct illuminance can be predicted without knowledge about the geometry of the scene. Levoy and Hanrahan [8] introduced the two plane parameterization of the 4D function and the idea of rendering new viewpoints out of this representation.
Efficient Rendering of Light Field Images
187
Their parameterization is well suited for the fast rendering of viewpoints and does not depend on the geometry of the scene. A drawback of their algorithm is that very dense sampling is needed to avoid ghosting due to an unmodeled scene geometry. Gortler et al. [9] also used the two plane parameterization for rendering of novel viewpoints but introduced a depth correction by a 3D model to reduce rendering artefacts at object boundaries. For representation of the model they chose an octree, as described by Szeliski [10]. Another approach to incorporate geometric support has been made by Lischinski and Rappoport [11]. They introduced the Layered Light Field (LLF) that incorporates parallel layered depth images (LDI), surface normals, diffuse shading, visibility of light sources and the material properties at the surface point. That way they were able to render images of synthetic scenes that correctly handled view-dependent shading and reflections. A detailed summary of image-based rendering techniques and image-based view synthesis can be found in Koch and Evers-Senne [12]. One important issue is the rendering speed for efficient view interpolation. Halle and Kropp [13] used the technically mature rendering of OpenGL for image rendering for full parallax displays. This way they took advantage of the parallelization on the graphics hardware with a minimum effort and full compatibility with the graphics toolkits that are based on OpenGL. Dietrich et al. [14], [15] achieved a remarkable acceleration in real time ray tracing by sample caching and reuse of shading computations exploiting frame-to-frame coherence. Similar to texture MIP map levels they cached computation results on different levels of the hierarchical space partition for reuse. An important consideration for image-based interpolation is the selection of proper reference images as key frames for the interpolation. This leads to the problem of view selection and view planning. Best-next-view problems find applications in many real world scanning scenarios ranging from large scale urban model acquisition [16] to automated object reconstruction [17]. One of the first applications of viewpoint planning was to calculate the optimal position for feature detectability. Tarabanis et al. [18] give an overview over viewpoint planning algorithms and introduce a classification for view planning algorithms dividing them into algorithms that follow the generate-and-test paradigm and the synthesis paradigm. The generate-and-test approach is characterized by a discretized viewing space that reduces the number of possible viewpoints to be evaluated. In contrast the synthesis approach puts constraints in form of an analytic function on the best next view, which ensure that the derived viewpoints satisfy these constraints. Werner et al. [19] used dynamic programming to select the optimal set of input images from a set of reference views for one degree of freedom. In addition to the visibility criterion Massios and Fisher [20] introduced a best-next-view selection algorithm that utilizes a quality criterion as well. A common quality criterion for best-next-view selection algorithms is the correlation between the captured surface normal and the viewing angle (Wong et al. [21]). Automatic camera placement has many applications in synthetic [22] and real world scenarios [23]. A comprehensive review of view planning algorithms for automated object reconstruction algorithms can be found in [24].
188
3
D. Jung and R. Koch
System Overview
Although most of the effects inherently available in ray tracers can be simulated on the graphics pipeline of GPUs, e.g., reflections by environment maps, advertisement companies produce their content almost exclusively with modeling tools using ray tracing. The reason might be that they do not want to limit their design possibilities in advance. Another reason might be that many effects that are readily available on ray tracers are much harder to achieve when rendering on the graphics pipeline of a GPU. Therefore, rendering on the graphics pipeline of a GPU would require more technical training of the artists and would also increase the time to design a 3D model. In this contribution we present an approach for time-efficient rendering of dense high-resolution light field images from a modeled 3D scene. As discussed before, full ray-tracing of all lens images for the light field display is not feasible as it would take several months to render. Instead, we propose to ray trace only a small set of reference lens images, and exploit depth-compensated view interpolation for all other. Hence, we need to define which of the display elements carry most information and are to be selected as reference views. These images are ray traced with full resolution. For all other lens images, we merely render a subset of all depth images from the 3D model, which are obtained with marginal costs. The input of our interpolation algorithm is a subset of all depth images and few selected reference lens images, which constitutes a convenient generic interface to all possible 3D modeling tools. Hence the proposed algorithm is independent of the modeling tool used for content creation, and can even be used with real world scenes as long as depth and color can be obtained by existing range and color cameras, although the presented algorithm is not intended for that purpose. In order to minimize the number of input images a very exact geometry is required making the algorithm sensitive to errors in the geometry that are introduced by range sensors and camera calibration. Image-based rendering algorithms that are robust against errors in the geometry or don’t use geometry at all, e.g., see [8], compensate the missing geometry by taking a high number of light field samples, which contradicts our goal to minimize the number of input images. Figure 2 sketches the proposed system. Depth images are obtained from the 3D model directly and a visual-geometric octree representation (the Light Field Nodes LFNs) of the visible scene surfaces is produced. The best-next-view selection is driven by evaluating the visible surfaces of the scene and a ranking of the most important reference lens images is obtained. These images are fully ray traced and view-dependent surface color is added as a quad tree for each LFN. Finally, all remaining lens images are rendered directly from the LFNs by a simple ray cast and interpolation of the view dependent color information. The proposed image-based rendering algorithm can be divided into four stages. In the first stage the geometric model of the scene is built using depth images of the scene. After the depth images are added to the scene the geometric model is refined. In the second stage all viewpoints of a discrete viewpoint area are evaluated for their suitability as reference images for the proposed image-based
Efficient Rendering of Light Field Images
"
#
189
'&
$ %&
(& )& !
Fig. 2. Overview of the proposed algorithm
rendering algorithm incorporating criteria like the size of the display and the distance to the display’s zero plane. In the third stage the color information of the light field is assembled from the previously chosen set of input images. The final stage is the rendering of all views for the full parallax display. In the following the different stages and the image-based rendering algorithm are described in detail. 3.1
Building the Geometric Model
The first stage of the algorithm is to build a geometric model of the scene. The model is built from a set of depth images of the scene that can easily be rendered by any modeling tool. This avoids the problem of exporting the geometric model from different modeling tools and provides a convenient and generic interface. For every depth image and for every pixel of each depth image a ray cast is done in which the depth value defines the length of the ray. The result is a scene description in form of a 3D point cloud. Figure 3 (left) depicts the 3D point representation where V is the position of the point relative to the model’s coordinate system, according to the fivedimensional plenoptic function P5D (eq. 2). The sphere that is drawn around the 3D point in figure 3 (left) illustrates the storage capacity that is needed to save the different rays that may pass through the 3D point. The 3D points are stored in an octree. An octree is a hierarchical data structure that partitions 3D space and is able to handle sparse data efficiently. The root node of the octree defines a cube that encloses the complete scene. When a 3D point is inserted into the octree it gets a size assigned that will define at which hierarchical level of the octree the point will be inserted. Figure 3 (right) shows a 3D point that is placed at the center of its bounding volume. Starting at the root node it is checked recursively, in which sub-volume of the root node the point lies and the 3D point is inserted in this child of the current node until the volume of the next hierarchical level is smaller than the volume assigned to the 3D point. If the volume where the 3D point is inserted is empty, a new node is added, otherwise, the point is merged with the existing 3D point. The volume assigned to the 3D point depends on the distance of the point to the display’s zero plane. The display’s zero plane is a plane in the virtual scene,
190
D. Jung and R. Koch
Fig. 3. Representation (left) and bounding volume (right) of a 3D point
defined by the location of the multi-view display. The camera centers of all views are located on that plane and an observer of the display will focus on that plane of the virtual scene when focusing on the display. The size of the cube limits the geometric detail that can be described by the geometric representation. Near the display’s zero plane the size of the cubes is chosen to be very small because near the camera centers the viewing rays allow a very high spatial resolution. Far away from the display’s zero plane the viewing rays diverge, limiting the spatial resolution and allowing for a much coarser geometric model.
Fig. 4. Overlapping field of view (left) and 3D point near the display’s zero plane (right)
The situation that 3D points are merged occurs often because of the overlapping field of view of the input images (fig. 4, left). This will also occur near the display’s zero plane as the different rays are close together near the camera center (fig. 4, right). After the depth images are added to the model the visible surface of the model is extracted. Due to the limited numerical precision of the depth images a surface of the scene may be thicker than one layer of volumetric 3D points in the octree representation (fig. 5). These additional 3D points lie directly under the surface and are occluded in all views. A 3D scene contains a huge amount
Efficient Rendering of Light Field Images
191
of 3D points, typically between 250,000 and 550,000. Therefore, it is beneficial to remove the occluded points, which will save memory and computation time due to the reduced complexity of the model.
Fig. 5. Quantization leading to occluded sub-surface points
In order to remove the occluded points the model is intersected with all possible viewing rays of the display. When a 3D point is intersected with a viewing ray it is marked as visible. After all viewing rays have been intersected with the model all non-marked 3D points are removed from the geometric model of the scene. This ensures that no visible parts of the model are removed without resorting to a priori knowledge of the model’s complete geometry. 3.2
Determine the Best Viewpoints
After the model of the scene is complete a set of color input images has to be selected in order to obtain the color information of the light field. Therefore the next step is the computation of the best possible input viewpoints. In order to determine the viewpoint-ranking the possible viewpoints are restricted to the positions of the display elements, following the generate-and-test paradigm as introduced by Tarabanis et al. [18]. For the presented algorithm a best-next-view algorithm is utilized that we introduced in [4]. In this algorithm the importance of a viewpoint is weighted by the number of visible 3D points and its viewing distance to each 3D point. The viewing distance was chosen as a criterion because 3D points near the display’s zero plane contain more view dependent color information that is relevant for the display than points far away from the display’s zero plane and thus need to be sampled more often.
d P
s
Fig. 6. Opening angle in relation to the distance of the display and its diagonal size
Figure 6 shows the relation between the distance d to the display’s zero plane, the diagonal size s of the display and the opening angle α for a point P located
192
D. Jung and R. Koch
on a line that runs orthogonally through the center of the display. The relation between the opening angle and the distance to the display’s zero plane is described by s α = 2 · arctan . (3) 2·d It can be seen that the opening angle becomes smaller as the distance to the display’s zero plane increases, therefore, points far away from the display’s zero plane can be safely regarded as ambient and need to be sampled less often than points near the display’s zero plane, which may contain view dependent color information as they are sampled under a large opening angle from the display. To account for this each visible 3D point is weighted by its distance to the display’s zero plane. The algorithm to calculate the best next view creates a list where all possible viewpoints are ordered by their measured importance. This allows for an interactive selection of the number of input images that are used as input images for the image-based rendering algorithm. The weighting function 0, if P is not visible from V w(V, P, σ, f, a) := (4) V −P 2 2 else f · e− 2σ2 + a, takes as arguments the viewpoint V , the 3D point P , a display fitting value σ, a scaling factor f and a minimum weight a. The importance of an input image In out of N input images with n∈ {1, ..., N } is then measured by the sum over the weights of all 3D points M In :=
M
w(Vn , Pm , σ, f, a)
(5)
m=1
with the single 3D point Pm and m∈ {1, ..., M }. After all possible viewpoints have been evaluated a list is created, ordered by the importance of the viewpoint with the most important viewpoint on top. The depth images of the selected viewpoints are used to create a geometric model for evaluation before the color images of these viewpoints are rendered. This allows for an evaluation of the model because an insufficient number of input images may lead to visible holes in the model, caused by non-overlapping viewpoints (fig. 7, left) or occlusions (fig. 7 right). Therefore the interpolated view will expose artifacts that are caused by this incomplete geometry as shown by the dashed cameras in the figures. The effects of occlusions or self occlusion can be very complex even for a simple scene with one object, e.g., a concave object. Hence an evaluation of the geometry by examination of the geometry and the possibility to add additional viewpoints is necessary to ensure that the number of input images is sufficient. 3.3
Assembling of the Light Field
After the viewpoints have been selected the color information is added to the geometric model. Each 3D point of the geometric model has a data structure
Efficient Rendering of Light Field Images
193
Fig. 7. Hole in the geometry caused by sub-sampling (left) and by an occlusion (right)
attached that holds the view dependent color information. In the following this data structure is called a Light Field Node (LFN). There are two requirements the data structure has to fulfill that are linked to the properties of the objects in the scene. For objects that exhibit lambertian reflectance or belong to the background of the scene where the viewing angle doesn’t change significantly for the different views it is sufficient to save one color value per LFN, whereas LFNs near the display’s zero plane may need to store more than one color information per LFN, due to specular reflections. Therefore the data structure of the LFN has to handle sparse data efficiently. Another requirement is to encode the direction of the saved viewing ray and the possibility to efficiently find the angular closest viewing ray in the data structure for interpolation. The view dependent color information is saved in an image of the same properties and orientation as the camera that renders the light field for the display. In this case it is a spherical image and the direction of the viewing rays is encoded in the spherical coordinates of its position in the image. Figure 8 shows a LFN and how the ray direction is encoded in its data structure. The LFN can be imagined as a unit sphere with its center at the position of the 3D point. Therefore a viewing ray is completely described by the position of the LFN and the two angles (Φ and Θ) of the spherical coordinates. All input images for the display are rendered with a camera of a constant orientation and its viewpoint restricted to a plane. Under this condition it is advantageous to orient the LFN the same way as the camera that rendered the input images. This setup is shown in figure 9. The benefit is that it is not required to project the center of the camera to the spherical image of the LFN to get its image index and correctly encode the viewing ray’s direction, but the image index of the LFN is the same as the image index of the input image. The interpolation of all views will also benefit from this setup as the orientation of the interpolated views is the same as the orientation of the input views and all viewpoints are located on the display’s zero plane.
194
D. Jung and R. Koch
Fig. 8. Light field node and encoding of the ray direction in the image’s index
Fig. 9. Relation of the image index of the reference image and the image index of the LFN
Another benefit of this configuration is that the opening angle of the LFN, in which viewing rays can be saved, is the same as the opening angle of the display elements. Therefore the LFN can only save viewing rays that are in the field of view of the display elements. The second requirement is to handle the sparse data of the view dependent color information, saved by a LFN. If every LFN would save a full spherical image the memory requirement would limit the presented algorithm to about 5,000 LFNs. The scenes processed typically have 250,000 to 550,000 LFNs, which is about two orders of magnitude more. Therefore a quad tree [25] was chosen, which has several properties that are beneficial for the proposed algorithm. First, a quad tree can save sparse data efficiently allowing to save as many viewing rays as necessary per LFN, allocating memory only for saved viewing rays. Another advantage is the efficient look-up of angular closest viewing rays, which is important for both, the decision if a viewing ray has new information and should be saved in the data structure and also for the interpolation of novel views for the display. In the following it is described how the color information is saved in the LFN.
Efficient Rendering of Light Field Images
1 2 3 4 5 6 (a)
1
2
3
4
5
195
6
(b) (c)
LFN
Fig. 10. Similarity based selection of samples, saved by a LFN (one dimensional draft)
The color information is added to the geometric model by intersecting the model with the viewing rays. When a viewing ray intersects a LFN it is decided whether the viewing ray should be saved or discarded. Figure 10.a shows a draft of successive viewing rays that intersect with the LFN, reduced by one dimension for the sake of clarity. The rays are labeled in their initial intersection order with the LFN and their brightness shall represent their color. Figure 10 shows which viewing rays are saved after all input images have been processed. When the first viewing ray is processed, the LFN has no color information saved. Therefore the viewing ray is saved, denoted by a white circle in the figure. When the second ray is intersected with the LFN it has the same color as the angular closest ray and the viewing ray is discarded. Due to this reason all following viewing rays are discarded until the last viewing ray is processed, which has a different color than the first viewing ray and is therefore saved by the LFN. This approach allows to reduce the number of saved rays, which is important for lambertian surfaces of the scene or when the surface color does not change within the viewing angle. A drawback is that the interpolation between the angular closest viewing rays during view interpolation will introduce errors for the reconstructed viewing rays between the saved samples, as depicted in figure 10.b. This problem is solved by adding the input images to the geometric model a second time. This time the viewing rays will be intersected with the LFN in reversed order. The viewing ray number six will not be saved because it was saved in the first run. When viewing ray number five is processed the angular closest ray is viewing ray number six. Therefore the color dissimilarity is detected and the viewing ray is saved by the LFN. The following viewing rays are not saved because of their color similarity to the angular closest rays. Figure 10.c shows the saved color values after the second run. Practice has shown that it suffices to add the input images three times in alternating reversed order as there are very few viewing rays saved after the third run. 3.4
Rendering
After the color information has been added to the model all views are rendered for the display. This is done by ray intersection of all viewing rays with the geometric model. At the first intersection of a viewing ray with a LFN the
196
D. Jung and R. Koch
intersection test is canceled and the color value of the viewing ray is interpolated by that LFN. The color is interpolated from angular closest views, as described earlier in section 3.3. The intersection of viewing rays with the geometric model is a central part of the presented algorithm that is used by all of the different stages. Therefore it is worthwhile to pay particular attention to the parallelization and implementation of the ray cast. The data structures have been chosen because of their hierarchical partition of space that allow for efficient ray intersection and look-up of the angular closest viewing rays. Another benefit is the inherent ability to handle sparse data sets. In the following the different approaches to accelerate the ray cast are described and compared with regards to the used data structures.
4
Efficient Ray Cast Implementation
The ray cast has been implemented using different approaches. The reference implementation is implemented single threaded on the central processing unit utilizing only one of the available kernels. Listing 1.1 shows the pseudo code for rendering of an image via ray cast. Every viewing ray of the display element is cast and its intersection with the model is calculated. The first LFN that is intersected by the viewing ray is then used to, e.g., interpolate the color from the angular closest color samples of that LFN. Listing 1.1. Pseudo code for ray casting of an image
f o r ( e v e r y row y ) { f o r ( e v e r y column x ) { r a y = MakeRay ( y , x ) LFN = ocTree−>I n t e r s e c t ( r a y ) o u t P i x e l [ y ] [ x ] = LFN−>I n t e r p o l a t e C o l o r ( r a y ) } } Ray casters are particularly suited for parallelization, as all viewing rays are processed independently although there are implementations, which exploit image and object space coherence [26]. 4.1
Multi-threaded Implementation on the Central Processing Unit
The parallelization on the central processing unit (CPU) has been implemented using Open Multi-Processing (OpenMP), which offers an easy way to utilize multi-core processors. For rendering of the interpolated images the parallelization is straight forward because access on the data structure is read only. Listing 1.2 shows how OpenMP can be used to parallelize the outer rendering loop to take advantage of the multi-core processor.
Efficient Rendering of Light Field Images
197
Listing 1.2. Parallelization of the outer rendering loop with OpenMP
#pragma omp p a r a l l e l f o r s c h e d u l e ( dynamic ) f o r ( e v e r y row y ) { ... } The other parts of the algorithm that utilize the ray cast modify the data structures and therefore need a locking mechanism when executed in parallel. The operations that need to be made thread safe are the insertion of new points and the modification of existing nodes. The locking mechanism is implemented by an OpenMP lock and each node of the octree is given a lock to guarantee exclusive access. An OpenMP lock can only be held by one thread at a time and if more than one thread attempts to acquire the lock all but the thread that holds the lock are paused. When the lock is released by the thread that holds the lock the next thread can acquire it.
Fig. 11. Insertion (left) and modification (right) of an octree’s node
The procedure of inserting a new node into the octree is depicted in figure 11 (left). The node that should be inserted is dyed white and connected with its parent node by a dashed line. New nodes are either inserted as leafs to hold a LFN or as inner nodes for hierarchical space partitioning. To guarantee that the thread that inserts a node has exclusive access to the data structure the parent node is locked. Then, the new node is created and locked by the thread, which holds the lock of the parent node. The new node is attached to its parent node and both locks, which are held by the thread, are released. Figure 11 (right) shows a situation where the node or its content should be modified by a thread. The node that should be modified is dyed white. Then, the node is locked, its content modified and afterward the lock is released by the thread. The data structure of the LFN is attached to the octree as content and is thread safe because access to the content of the octree is guaranteed to be exclusive by its locking mechanism. Accelerating the rendering with OpenMP on the central processing unit is straight forward and takes full advantage of the hierarchical data structures but is limited by the number of available CPUs. In order to further accelerate the algorithm an approach has been implemented to accelerate the intersection of the viewing rays of the display with the geometric model on the graphics hardware.
198
4.2
D. Jung and R. Koch
Implementation on the Graphics Hardware
Halle and Kropp [13] introduced an algorithm for efficient rendering of images for full parallax displays that takes advantage of conventional graphics hardware and OpenGL. In order to achieve this goal the rendering process had to be adapted to fulfill the special requirements of image rendering for full parallax displays. We extend their work by combining their efficient rendering on conventional graphics hardware for full parallax displays with our image-based rendering algorithm. In the following the rendering algorithm will be discussed in detail and it is described how it is used to accelerate our algorithm.
Fig. 12. Viewing frustum of a display element of a full parallax display (sketch in 2D)
The viewing frustum for one display element of a full parallax display is depicted in figure 12. The image for a display element is rendered by placing a camera into its position. Then a ray is cast for every viewing ray of the display element, starting in front of the display at a defined distance to the display’s zero plane. At the position of the display element, the camera center, the foreground geometry is point reflected and, therefore, appears upside down in the image of the display element with left and right inverted in the resulting image. The ray ends at the specified maximum distance behind the display’s zero plane. If we assume that a viewing ray doesn’t change along its path the intersection of the viewing ray with the model is canceled at the first intersection point, marked by the white cross in figure 12. In order to render content for the full parallax display with the OpenGL library the algorithm introduced by Halle and Kropp divides the viewing frustum of the display element into two parts. The first part is behind the display plane and can be rendered directly with OpenGL (see fig. 13, right). In the OpenGL camera model the near clipping plane can not be set to the camera’s center. To avoid this limitation the camera center is slightly shifted backwards along the optical axis in order to place the near clipping plane on the display’s zero plane. According to Halle and Kropp this imposes the drawback of small image distortions and a potentially coarse depth resolution due to the small distance of the near clipping plane to the camera center. They concluded that these problems have very little impact on the final image and the coarse
Efficient Rendering of Light Field Images
199
Fig. 13. Viewing frustum in front of (left) and behind (right) the display’s zero plane
depth resolution could be avoided by separating the viewing frustum into smaller pieces. The second part is the rendering of those parts of the scene, which are in front of the display’s zero plane. Therefore the camera is rotated to face the opposite direction and shifted backwards along its optical axis such that the near clipping plane lies at the display’s zero plane. Figure 13 (left) shows the frustum of one display element that lies in front of the display. In order to achieve the same rendering behavior as in figure 12 the intersection test has to choose the intersection that has the greatest distance to the camera center, marked by a white cross in figure 13 (left), as opposed to the closest intersection, marked by a black cross. In OpenGL this can be achieved by setting the function that compares the depth values to prefer greater values and by clearing the depth buffer to zero (see listing 1.3). Listing 1.3. Adjustment of the depth order for rendering of pseudoscopic images with OpenGL
glDepthFunc (GL GREATER) ; glClearDepth ( 0 . 0 ) ; The surfaces that are visible for the camera lie on the back side of the objects. Hence it is important to turn back face culling off or to set OpenGL to cull front faces of objects. Halle and Kropp did several adjustments to correct the lighting for rendering of the pseudoscopic image, which will not be discussed here as the presented rendering algorithm does not depend on it. For the remainder of this section the part of the scene that lies in front of the display’s zero plane is called front geometry and the part that lies behind the display’s zero plane is called back geometry. The scene is divided by the display’s zero plane into the front and back geometry that are rendered separately. The front geometry will be rendered to produce a pseudoscopic image, whereas the back geometry will be rendered orthoscopic. Every LFN of the back geometry is now assigned a unique identification number that is encoded as a 24-bit color. This mapping between the LFN and a unique color will be used to render the
200
D. Jung and R. Koch
geometric model and identify the LFN that is hit by the viewing ray. The color black is assigned to the background of the scene and is used to identify viewing rays, which have no intersection with the geometric model. Each LFN belongs either to the front geometry or the back geometry. Therefore, the mapping of the front geometry of the scene to a color is independent of the back geometry’s mapping and color values can be reused. For rendering each LFN is represented by a quad in the associated color. The front and back geometry are rendered separately to different textures with the previously described modifications of the rendering. In a last rendering pass, both images are combined. This is done by texturing the rendered images on a viewport aligned quad. By rendering with an orthographic projection it is ensured that the pixels of both textures map exactly on the viewport of the final image. The pseudoscopic front geometry image is side-corrected via separate texture coordinates that are mapped onto the quad. Both images are combined by a fragment shader. Those parts of the front geometry’s image that have the black background color are filled with the values of the back geometry’s image. The result is a pseudo-colored image of the scene where the intersections of the viewing rays with the geometric model are obtained and encoded in the mapping of the color to the LFNs. The final image is generated on the CPU in parallel by identifying the intersected LFNs and by interpolating the color value from the angular closest color information that is saved by the LFN. So far, rendering is limited by the camera model of OpenGL. Some displays may require a special lens model for content generation, which is needed to sample the artificial scene with the viewing rays of the display. In the following it is described how a vertex shader can be used to simulate a fisheye lens that is used to project the different viewing rays of a display element of the full parallax display. The idea is to use a vertex shader to move the vertices to a position where the image rendered with OpenGL equals an image taken with a fisheye camera. This modification has to be done in dependence of the viewpoint because this lens effect is achieved by modifying the geometric model of the scene. First, the modelview matrix is set to the display element’s position and the projection matrix is set to identity. Hence the vertex position can be obtained by the optimized GLSL function ftransform().1 Afterwards the vertex position (x, y, z) is transformed into spherical coordinates by θ = arctan
1
x2 + y 2 z
and
φ = arctan
y . x
(6)
The complete source code for the vertex shader can be found in the appendix (listing 1.4).
Efficient Rendering of Light Field Images
201
The field of view (fov ) of the fisheye camera is accounted for and the new vertex position (x , y , z , w )T is assigned by ⎛
⎛ θ ⎞ ⎞ x f ov · cos(φ) ⎜ y ⎟ ⎜ θ · sin(φ)⎟ ⎜ ⎟ = ⎜ f ov ⎟ ⎝ z ⎠ ⎝ z+zN ear ⎠ . zF ar−zN ear w 1
5
(7)
Results
In the following the proposed algorithm is evaluated and the results are shown. First, the results obtained by the best-next-view selection algorithm are discussed. This is followed by an evaluation of the interpolated views against ground truth and at last the runtime of the different acceleration techniques are compared. 5.1
Best Next View Selection
Figure 14 (left) shows an overview of the synthetic scene that was used for evaluation. For the best-next-view selection algorithm it is sufficient to use a reduced data set where every fourth row and column of the full data set is used. Figure 14 (left) was created by a simulation of this reduced display where the viewpoint was centered about 60 centimeters in front of the display. In dependence of this viewpoint the view dependent color information was computed for each display element, which yielded the final image. The reduced data set consists of 14,400 lens images with a resolution of 512x512 pixel. Figure 15 (left) shows the contribution of each input image to the geometry of the scene where only newly inserted 3D points are counted. In figure 15 (right) the overall completeness of the model is shown by integration over the contribution of the images. In figure 15 (left) can be seen that the first viewpoint adds about nine percent of the geometry to the model. Around the
Fig. 14. Simulated display overview with the viewpoint centered about 60 centimeters in front of the display (left) and position of the top ranking 393 viewpoints (right). The darker a viewpoint is colored the more important it is ranked.
202
D. Jung and R. Koch
Number of input images
10000
1000
100
10
100 90 80 70 60 50 40 30 20 10 0 1
10000
1000
100
10
Geometry [%]
9 8 7 6 5 4 3 2 1 0 1
Geometry [%]
100th input image the contribution of new 3D points to the model’s geometry is only marginal. However, in figure 15 (right) it can be seen that after the first 100 images are added, the obtained geometry is incomplete and covers only 70 percent of the scene. This is due to the many marginal contributions of single images that add up to about 30 percent of the complete geometry. For the complete geometry the first 4,762 input images are required, which is about 33 percent of the input image set. However, even with only 1,000 reference images, which is about 7 percent of the reduced image set, no visible artifacts can be seen and a high quality reconstruction can be achieved.
Number of input images
Fig. 15. Contribution of the single input images to the geometry of the model (left) and overall completeness of the model in dependence of the number of input images (right). The axes of abscissae are in logarithmic scale.
The final result of the best-next-view algorithm is pictured in figure 14 (right). The perspective is the same as in the left figure. Therefore, the selected viewpoints can be directly related to the overview. The importance of a viewpoint is represented by its brightness with the most important viewpoint set to black. Viewpoints with a minor contribution to the scene’s geometry have been colored white for the sake of clarity. The two viewpoints that are rated most important are in the upper left corner of the display followed by the lower right corner of the display. Both viewpoints cover the background, which has a larger surface than the foreground. Therefore, the majority of the 3D points that are required to describe the geometry of the scene belong to the background, which makes those viewpoints most important for the geometric completeness. The second rated viewpoint lies in the opposite display corner of the most important viewpoint where most of the background regions could be sampled, which were occluded in the first viewpoint. In general it can be seen that the foreground on the mask is sampled more often than the background of the scene. This is because the weighting function is set to prefer foreground objects as they are more relevant for view dependent color information. Another reason is that near the display’s zero plane sampling points must be closer for an overlapping field of view or holes in foreground objects may occur.
13
20
12
19
11
18
10
17
9
16
8
15
AD PSNR
PSNR
21
203
10000
14
1000
22
100
15
10
AD
Efficient Rendering of Light Field Images
Number of input images
Fig. 16. Average difference (AD) and peak signal to noise ratio (PSNR) of the simulated overview (reduced data set) from the reconstructed images compared to the simulated overview from the ground truth images. The axis of abscissae is in logarithmic scale.
The average difference (AD) and the peak signal to noise ratio (PSNR) have been measured by comparing a simulated overview of the display from the reconstructed images against a simulated overview from the ground truth images on the reduced data set. First, the mean squared error (MSE) is calculated by
⎞2 ⎛
G
B R R G B − i + − i + − i
i
i
i
x,y x,y x,y x,y x,y x,y 1 ⎝ ⎠ M SE = X · Y x=1 y=1 3 X Y
(8)
taking the sum over the squared difference of all pixel (X ·Y ) between the ground truth image i and the interpolated image i and averaging the difference of the color channels (RGB). Afterwards the PSNR is calculated by P SN R = 10 · log
2552 . M SE
(9)
The results are depicted in figure 16. The evaluation has been performed on the top ranking 50, 100, 1,000, 1,500 and 2,000 images from the best-next-view algorithm. As expected, the average difference is drastically reduced when the number of input images is increased. In the reconstructed images the object boundaries are one to two pixel off due to aliasing, which results in a relatively low peak signal to noise ratio. Figure 17 and 18 show close up views of the foreground area comparing the simulated display overview from reconstructed images against a simulated overview from the ground truth images using 104 input images (fig. 17) and 144 input images (fig. 18). In each case the number of input images was the same and all images of the reduced data set were interpolated with the same IBR
204
D. Jung and R. Koch
algorithm. The left result in each case was obtained by using regular sampled viewpoints for the interpolation and the right result was obtained by using the same number of top ranking viewpoints from the best-next-view algorithm for the interpolation.
Fig. 17. Close up of the foreground area of a simulated display overview (reduced data set) from reconstructed images using 104 input images. Comparison of regular sampled viewpoints (l) against the same number of top ranking viewpoints from the best-next-view algorithm (r).
In Figure 17 and 18 it can be seen that with regular sampling the reconstruction with 104 and 144 input images lacks parts of the geometry near the display’s zero plane due to non-overlapping fields of view and occlusions, leading to severe artifacts in the reconstruction. When the top ranking viewpoints were used for image reconstruction the geometry of the foreground is much more complete. Although there are still holes in the geometry of the foreground object the geometry of the foreground is almost complete using 144 input images resulting in less artifacts compared to the reconstruction from regular sampled viewpoints. Good image interpolation results can be achieved with less than 100 percent of complete geometry and without manual user interaction by the proposed solution. For premium high-quality rendering additional reference views can be progressively added from the ranking, which has been obtained by the bestnext-view selection algorithm. The result can be interactively evaluated on the geometric model and further reference views can be added until a satisfying result is obtained. In the following interpolation results are shown and the interpolation is evaluated against ground truth images. 5.2
Interpolation Results
The interpolation algorithm was evaluated on foreground and background parts of the scene. This evaluation was performed on the full data set with about 0.5 percent of the viewpoints selected as reference views. The reference images, the interpolated images and the difference images between the original and the interpolated images are shown in figures 19-21. The numbers in the lower right
Efficient Rendering of Light Field Images
205
Fig. 18. Close up of the foreground area of a simulated display overview (reduced data set) from reconstructed images using 144 input images. Comparison of regular sampled viewpoints (l) against the same number of top ranking viewpoints from the best-next-view algorithm (r).
of the different views are used to identify the discrete viewpoints of the display in relation to the nearest reference views, e.g., between the reference viewpoints 00 and 89 the 88 viewpoints that are in between were not used as input images for the interpolation. The nearest neighboring reference images of the interpolation for the background are shown in figure 19 and the nearest neighboring reference images of the foreground are depicted in figure 20. For the foreground area the reference viewpoints are close together to avoid subsampling of the geometry. Another reason is that more light field samples are needed in the foreground area to capture the specular reflections of the golden ornamentation of the mask. Figure 21 (left and right) shows the results of the interpolation. The top row shows the ground truth input images as rendered by the modeling tool. In the center of the figures are the interpolated views and at the bottom are the difference images of the ground truth images and the interpolated ones. The difference images show that most interpolation errors are located at object boundaries. This is because at object boundaries the depth images either belong to the foreground or the background and no anti aliasing can be applied. On the contrary the color images are rendered with anti aliasing, which leads to a color transfer from the foreground objects to the background and vice versa. Another reason lies in the quantization of the octree in 3D space, which can lead to color blending during angular view interpolation on textured surfaces. 5.3
Runtime
The proposed algorithm has been evaluated on an Intel Core i7-950 CPU with a Nvidia GeForce GTX 460 graphics card. The different implementations and parallelization approaches have been compared for the initialization and the rendering phase. The initialization phase consists of loading the geometric model and assembling of the light field. During the rendering phase all images of the display are interpolated. The results are listed in Table 1.
206
D. Jung and R. Koch
Fig. 19. Reference images of the background area with sparse sampling. The depth images are scaled by 255.
Fig. 20. Reference images of the foreground area with dense sampling. The depth images are scaled by 8355.
Fig. 21. Ground truth (upper row), interpolated (central row) and difference image (lower row) of the background (left) and the foreground (right)
Efficient Rendering of Light Field Images
207
Table 1. Comparison of the runtime for the different implementations and the average single image rendering time. The example scene has about 380,000 LFNs and 1,000 out of 14,400 input images were used for the interpolation. Ray tracing
Initialization Acceleration Rendering Acceleration Acc. parallel. [s] [s]
CPU (8 Threads)
-
-
84.006
1.00
-
8,610.00 2,041.00 750.00
1.00 4.22 11.48
1.722 0.403 0.210
48.78 208.45 400.03
1.00 4.27 8.20
Interpolation CPU (1 Thread) CPU (8 Threads) GPU
Ray tracing of an image for the display takes on average 84 seconds when rendered with the modeling tool. When only the image rendering times are compared, single threaded interpolation already achieves an acceleration factor of 48.78 compared to ray tracing of an image. The parallelization on the central processing unit using OpenMP further accelerates the rendering by a factor of 4.27. The implementation on the graphics hardware achieved an acceleration factor of 8.20 compared to rendering with a single thread only. This relatively small acceleration factor is because the graphics hardware takes no advantage of the hierarchical data structure of the octree.
Mean time [ms]
Rendering time on the GPU 320 300 280 260 240 220 200 180 160 140 120 100
150
200
250
300
350
400
450
3
Number of LFNs [10 ] Fig. 22. Rendering time in dependence of the number of LFNs on the graphics hardware
208
D. Jung and R. Koch
Figure 22 shows the mean rendering times for an image in dependence of the number of LFNs. The results have been retrieved on an Intel Core i7-920 CPU with a Nvidia GeForce GTX 285 graphics card. The straight line in figure 22 was fitted to the data points by a least-squares method. It can be seen that the rendering time has a linear correlation to the number of LFNs that represent the scene. This is because all vertices, which represent the LFNs are processed by the graphics hardware and visibility is determined by comparing the depth values. Hence, the graphics hardware does not take advantage of the hierarchical data structure and an early abort for viewing rays is not possible. The resulting acceleration for the reduced data set, which are given in table 1 do not take into account the rendering time of the input images and the viewpoint selection. Table 2 shows the effective acceleration that can be achieved for a full sized display with 230,400 display elements. The upper part of the table shows the rendering times when all images for the display are ray traced and computed on a single work station. The center of the table shows the different stages that are required for the interpolation. First, the depth images are ray traced with the modeling tool. For the best-next-view selection algorithm it is sufficient to sample only every fourth row and column of the data set. As long as parts of the scene do not lie inside or extremely close to the display’s zero plane, the reduced data set contains enough geometric detail for proper viewpoint selection. The viewpoints are ordered by the best-next-view selection algorithm followed by ray tracing of the chosen viewpoints with the modeling tool. Finally, the interpolation is initialized and all images of the display are interpolated. For interpolation the algorithm implemented on the graphics hardware was chosen. The lower part of table 2 shows the sum of all steps involved including the rendering time of the input color and the depth images. It can be seen that the acceleration factor is mainly influenced by the ratio of the interpolated images to the input images that are required for interpolation. For a full sized display an effective acceleration factor of almost 100 was achieved by the proposed algorithm. Table 2. Effective acceleration of the rendering compared to ray tracing (RT) of all color images for the full data set Rendering method Number of images Time [h] Acceleration Ray tracing
RT (8 Threads)
230,400 color
5,376.38
1.00
Initialization
RT (8 Threads) GPU (BNV† ) RT (8 Threads)
14,400 depth 14,400 depth 1,000 color
4.00 5.30 23.34
-
View interpolation
GPU (rendering)
230,400 color
22.19
242.29
230,400 color
54.83
98.06
Effective acceleration GPU (summation) †
BNV: Best Next View selection
Efficient Rendering of Light Field Images
6
209
Conclusion
Full parallax displays require a large amount of images. Ray tracing of these images on a single workstation would take several months. The rendering process could be accelerated by a render farm but at the cost of high charges. A complete solution for image rendering for full parallax displays has been presented that solely depends on depth images and few ray traced input images making it independent of the modeling tool. The problem of finding the best viewpoints for the input images has been addressed and a solution has been integrated into the rendering system that allows for a considerable reduction of input images and ensures completeness of the geometric model at the same time. The rendering benefits from the efficient data structures, which are well suited for the processing of very large amounts of sparse data. The presented algorithm has been accelerated by parallelization on the central processing unit and the graphics hardware. This allowed reducing the rendering time for all images from several months to a few days on a single work station, achieving an effective acceleration of two orders of magnitude. The evaluation of the geometric model and the final decision how many input images are used for the interpolation is currently aided by user interaction. Future work will approach the integration of a criterion that measures the impact of missing geometry on the view interpolation. The benefit could be a further reduction of input images for the interpolation leading to an even higher effective acceleration of the rendering. Acknowledgments. This project was funded by the Federal Ministry of Education and Research under the support code 16 IN 0655. The sole responsibility for the content of this publication lies with the authors. Founded by the BMWi due to a directive of the German Parliament.
References 1. Yang, R., Huang, X., Li, S., Jaynes, C.: Toward the Light Field Display: Autostereoscopic Rendering via a Cluster of Projectors. IEEE Transactions on Visualization and Computer Graphics 14, 84–96 (2008) 2. Baker, H., Li, Z.: Camera and projector arrays for immersive 3D video. In: Proceedings of the 2nd International Conference on Immersive Telecommunications, IMMERSCOM 2009, pp. 1–23. Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, Brussels (2009) 3. Jung, D., Koch, R.: Efficient Depth-Compensated Interpolation for Full Parallax Displays. In: 5th International Symposium 3D Data Processing, Visualization and Transmission (3DPVT) (2010) 4. Jung, D., Koch, R.: A Best-Next-View-Selection Algorithm for Multi-View Rendering. In: 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT) (accepted 2011) 5. Adelson, E.H., Bergen, J.R.: The Plenoptic Function and the Elements of Early Vision. In: Computational Models of Visual Processing, pp. 3–20. MIT Press, Cambridge (1991)
210
D. Jung and R. Koch
6. McMillan, L.: Plenoptic Modeling: An Image-Based Rendering System. In: Proceedings of SIGGRAPH 1995, pp. 39–46. ACM, New York (1995) 7. Ashdown, I.: Near-Field Photometry: A New Approach. Journal of the Illuminating Engineering Society 22, 163–180 (1993) 8. Levoy, M., Hanrahan, P.: Light Field Rendering. In: Proceedings of SIGGRAPH 1996, pp. 31–42. ACM, New York (1996) 9. Gortler, S.J., Grzeszczuk, R., Szeliski, R., Cohen, M.F.: The Lumigraph. In: Proceedings of SIGGRAPH 1996, pp. 43–54. ACM, New York (1996) 10. Szeliski, R.: Rapid octree construction from image sequences. In: CVGIP, vol. 58, pp. 23–32 (1993) 11. Lischinski, D., Rappoport, A.: Image-Based Rendering for Non-Diffuse Synthetic Scenes. In: Rendering Techniques 1998: Proceedings of the Eurographics Workshop in Vienna, pp. 301–314 (1998) 12. Koch, R., Evers-Senne, J.F.: View Synthesis and Rendering Methods. In: Schreer, O., Kauff, P., Sikora, T.E. (eds.) 3D Video Communication - Algorithms, Concepts and Real-time Systems in Human-Centered Communication, pp. 151–174. Wiley, Chichester (2005) 13. Halle, Kropp: Fast Computer Graphics Rendering for Full Parallax Spatial Displays. In: Proc. SPIE, vol. 3011, pp. 105–112 (1997) 14. Dietrich, A., Schmittler, J., Slusallek, P.: World-Space Sample Caching for Efficient Ray Tracing of Highly Complex Scenes. Technical Report TR-2006-01, Computer Graphics Group, Saarland University (2006) 15. Dietrich, A., Slusallek, P.: Adaptive Spatial Sample Caching. In: Proceedings of the IEEE/EG Symposium on Interactive Ray Tracing 2007, pp. 141–147 (2007) 16. Teller, S.: Toward Urban Model Acquisition from Geo-Located Images. In: Proceedings of the 6th Pacific Conference on Computer Graphics and Applications, PG 1998, p. 45. IEEE Computer Society, Washington, DC, USA (1998) 17. Levoy, M., Pulli, K., Curless, B., Rusinkiewicz, S., Koller, D., Pereira, L., Ginzton, M., Anderson, S., Davis, J., Ginsberg, J., Shade, J., Fulk, D.: The digital Michelangelo project: 3D scanning of large statues. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2000, pp. 131–144. ACM Press, Addison-Wesley Publishing Co. (2000) 18. Tarabanis, K., Allen, P., Tsai, R.: A survey of sensor planning in computer vision. IEEE Transactions on Robotics and Automation 11, 86–104 (1995) 19. Werner, T., Hlavac, V., Leonardis, A., Pajdla, T.: Selection of Reference Views for Image-Based Representation. In: International Conference on Pattern Recognition, vol. 1, pp. 73–77 (1996) 20. Massios, N., Fisher, R.: A Best Next View Selection Algorithm Incorporating a Quality Criterion. In: BMV 1998 (1998) 21. Wong, L., Dumont, C., Abidi, M.: Next best view system in a 3D object modeling task. In: IEEE International Symposium Computational Intelligence in Robotics and Automation, CIRA 1999, pp. 306–311 (1999) 22. Fleishman, S., Cohen-Or, D., Lischinski, D.: Automatic Camera Placement for Image-Based Modeling. In: Proceedings of the 7th Pacific Conference on Computer Graphics and Applications, PG 1999, pp. 12–20. IEEE Computer Society, Washington, DC, USA (1999) 23. Byers, Z., Dixon, M., Goodier, K., Grimm, C., Smart, W.: An autonomous robot photographer. In: Proceedings. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003), vol. 3, pp. 2636–2641 (2003)
Efficient Rendering of Light Field Images
211
24. Scott, W.R., Roth, G., Rivest, J.F.: View planning for automated threedimensional object reconstruction and inspection. ACM Comput. Surv. 35, 64–96 (2003) 25. Finkel, R.A., Bentley, J.L.: Quad trees: a data structure for retrieval on composite keys. Acta Informatica 4, 1–9 (1974), doi:10.1007/BF00288933 26. Wald, I., Slusallek, P., Benthin, C., Wagner, M.: Interactive rendering with coherent ray tracing. Computer Graphics Forum 20, 153–165 (2001)
Appendix Listing 1.4. GLSL Vertex shader for simulating a fisheye lens
unifo r m f l o a t zNear ; unifo r m f l o a t zFar ; v o i d main ( ) { vec4 pos ; f l o a t rxy ; float p; float t ; gl FrontColor = gl Color ; pos = f t r a n s f o r m ( ) ; rxy = s q r t ( pos . x∗ pos . x + pos . y∗ pos . y ) ; p = a ta n ( pos . y , pos . x ) ; t = a ta n ( rxy / pos . z ) ; t /= 0 . 3 4 9 0 6 5 8 5 0 4 ; // 2 0 . 0 deg ∗ PI / 180 pos . x = t ∗ c o s ( p ) ; pos . y = t ∗ s i n ( p ) ; pos . z = ( pos . z + zNear ) / ( zFar − zNear ) ; pos .w = 1 . ; g l P o s i t i o n = pos ; }
Author Index
Ballan, Luca 77 Brostow, Gabriel J. Cremers, Daniel Darabi, Soheil
77
104
Favaro, Paolo
Rosenhahn, Axel 52 Rosenhahn, Bodo 52 Rother, Carsten 104
124 52
Sellent, Anita Sen, Pradeep
184
Klose, Felix 1 Koch, Reinhard Leal-Taix´e, Laura
104
1, 25
Heydt, Matthias Jung, Daniel
Oswald, Martin R. Pollefeys, Marc 77 Puwein, Jens 77
152
Eisemann, Martin
Magnor, Marcus 1, 25 Martinello, Manuel 124
184 52
25 152
Taneja, Aparna 77 T¨ oppe, Eno 104 Xiao, Lei
152