Selected Readings in Vision and Graphics Volume 60 Daniel Eugen Roth
Real-Time Multi-Objekt Tracking Diss. ETH No. 1872...
38 downloads
414 Views
17MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Selected Readings in Vision and Graphics Volume 60 Daniel Eugen Roth
Real-Time Multi-Objekt Tracking Diss. ETH No. 18721
∑ Hartung-Gorre
Diss. ETH No. 18721
Real-Time Multi-Object Tracking A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH
for the degree of Doctor of Sciences ETH
presented by DANIEL EUGEN ROTH MSc ETH in Electrical Engineering and Information Technology born 28 t/l January 1978 citizen of Zollikon (ZH) and Hemberg (SG)
accepted on the recommendation of Prof. Dr. Luc Van Gool, ETH Zurich and K.U. Leuven, examiner Prof. Dr. Thomas B. Moeslund, Aalborg University, co-examiner
2009
Abstract New video cameras are installed in growing numbers in private and public places, producing a huge amount of image data. The need to process and analyze the data automatically in real-time is critical for applications such as visual surveillance or live sports analysis. Of particular interest is the tracking of moving objects such as pedestrians and cars. This work presents two visual object tracking methods, multiple prototype systems and an event-based performance evaluation metric. The first object tracker uses a flexible 2D environment modeling to track arbitrary objects. The second method detects multiple object classes using a camera calibration, therefore operates in 2.5D and performs more robustly in crowded situations. Both proposed monocular object trackers use a modular tracking framework based on Bayesian per-pixel classification. It segments an image into foreground and background objects based on observations of object appearances and motions. Both systems adapt to changing lighting conditions, handle occlusions, and work in real-time. Multiple prototype systems are presented for privacy applications in video surveillance and cognitive multi-resolution behavior analysis. Furthermore, a performance evaluation method is presented and applied to different state-of-the-art trackers based on the successful detection of semantic high level events. The high level events are extracted automatically from the different trackers and their varying types of low level tracking results. The general new event metric is used to compare our tracking method and the other tracking methods against ground truth of multiple public datasets.
Zusammenfassung Die steigende Zahl von installierten Kameras an öffentlichen und privaten Orten führt zu einer stetig wachsenden Bilderflut, die nicht oder nur eingeschränkt analysiert werden kann. Die automatische Auswertung und Analyse dieser Videos in Echtzeit ist daher Voraussetzung für Anwendungen der Videoüberwachung oder zur live Analyse von Sportarten. Speziell interessant ist hierbei die Verfolgung von sich bewegenden Objekten, wie zum Beispiel Personen oder Autos. Diese Dissertation beschreibt zwei Methoden sowie mehrere Prototypsysteme zur automatischen Detektierung und Verfolgung von Objekten. Zudem wird eine Bewertungs- und Vergleichsmethode für solche Systeme präsentiert, die das Erkennen von einzelnen Ereignissen durch die Systeme auswertet. Die erste der beiden präsentierten Objektverfolgungsmethoden arbeitet rein zweidimensional (2D) mit beliebigen Objektgrössen und Formen. Die zweite Methode unterscheidet mehrere definierte Objektklassen. Zudem schätzt sie die räumliche Tiefe in einer zweieinhalb-dimensionalen (2.5D) Interpretation des Bildes mithilfe einer einzelnen kalibrierten Kamera. Damit wird die Verfolgung weniger fehleranfällig bei vielen überlappenden Objekten auf engem Raum. Bei beiden Verfolgungsmethoden werden die Bildpunkte segmentiert in Vordergrund und Hintergrund. Dabei werden Unterschiede in Bewegung und Aussehen der Objekte als Bayesische Wahrscheinlichkeiten interpretiert und klassifiziert. Beide Verfolgungsmethoden arbeiten in Echtzeit, passen sich laufend den Lichtverhältnissen an und erkennen Verdeckungen zwischen Objekten. Daraus entstanden mehrere Prototypsysteme. Zum einen werden Systeme zur kognitiven Videoüberwachung auf verschiedenen Auflösungsstufen präsentiert. Zum anderen wird mit einem Prototyp untersucht, wie die Privatsphäre trotz Videoüberwachung wiederhergestellt werden kann. Zusätzlich zu den Systemen beschreibt diese Arbeit eine hierfür geeignete quantitative Bewertungs- und Vergleichsmethode. Dabei werden konkrete ein-
ZUSAMMENFASSUNG
zelne Ereignisse mit semantischer Bedeutung aus den Resultaten der Objektverfolgungsmethoden gefiltert und deren Erkennungsrate gemessen. Vergleiche zwischen mehreren modernen Objektverfolgungsmethoden und menschlich annotierten Resultaten wurden auf mehreren bekannten Videosequenzen durchgeführt.
Acknowledgements This thesis would not have been possible without the invaluable support from many sides. First and foremost I thank my supervisor Prof. Luc Van Gool for giving me the opportunity to explore this fascinating field and the interesting research projects. My thanks also go to Dr. Esther Koller-Meier for guiding me with her advice and for her scientific and personal support. And I am very grateful to Prof. Thomas B. Moeslund for the co-examination of this thesis. 1 would like to thank all the project partners for their fruitful collaborations and for helping me building the various prototype systems. • Blue-C II project: Torsten Spindler, Indra Geys and Petr Doubek. • HERMES project: University of Oxford: Eric Sommerlade, Ben Benfold, Nicola Bellotto, Ian Reid. Aalborg Universitet: Preben Fihl. Universitat Autdnoma de Barcelona: Andrew D. Bagdanov, Carles Fernandez, Dani Rowe, Juan Jose Villanueva. Universitat Karlsruhe: Hanno Harland, Nico Pirlo, Hans-Hellmut Nagel. Finally, I thank all my colleagues at the Computer Vision Laboratory of ETH Zurich for their support, countless table soccer matches and the informal atmosphere at the working place. In particular Alexander Neubeck, Andreas Ess, Andreas Griesser, Angela Yao, Axel Krauth, Bastian Leibe, Bryn Lloyd, Michael Breitenstein, Michael Van den Bergh, Philipp Zehnder, Robert McGregor, Roland Kehl, Simon Hagler, Stefan Saur, Thibaut Weise, Tobias Jaggli. This work was funded by by the ETH Zurich project blue-c-II, the Swiss SNF NCCR project IM2 and the EU project HERMES (FP6 IST-027110). Last but not least, I want to thank my family and friends, especially my mom and dad for their love and support.
Contents List of Figures
xi
List of Tables
xv
1 Introduction 1.1 Challenges of Visual Object Tracking 1.2 Contributions 1.3 General Visual Surveillance Systems 1.4 Outline of the Thesis 2 2D Real-Time Tracking 2.1 Introduction 2.1.1 Related Work 2.2 Bayesian Per-Pixel Classification 2.3 Appearance Models 2.3.1 Color Modeling with Mixtures of Gaussians 2.3.2 Background Model 2.3.3 Finding New Objects 2.3.4 Foreground Models 2.4 Motion Model 2.5 Object Localization 2.5.1 Connected Components 2.5.2 Grouping Blobs to Objects and Adding New Objects . 2.5.3 Occlusion Handling 2.6 2D Tracking Results and Discussion 2.6.1 PETS 2001 TestDataset3 2.6.2 Occlusion Handling Limitations 2.6.3 Illumination Changes 2.6.4 Computational Effort
1 2 2 3 9 11 11 11 12 14 14 15 16 17 18 20 21 22 23 24 24 27 29 29
CONTENTS
2.7 2.8
Application: Privacy in Video Surveilled Areas Summary and Conclusion
32 34
Extended 2.5D Real-Time Tracking 37 3.1 Introduction and Motivation 37 3.2 Previous Work 38 3.3 2.5D Multi-Object Tracking 39 3.3.1 Tracking Algorithm 40 3.3.2 Iterative Object Placement and Segmentation Refinement 42 3.3.3 New Object Detection 43 3.3.4 Tracking Models 44 3.3.5 Ground Plane Assumption 45 3.4 Extended 2.5D Tracking Results and Discussion 46 3.4.1 Central Sequence 46 3.4.2 HERMES Outdoor Sequence 47 3.5 Application: HERMES Demonstrator 48 3.6 Summary and Conclusion 53 Event-based Tracking Evaluation 4.1 Introduction 4.2 Previous Work 4.2.1 Tracking Evaluation Programs 4.3 Event-Based Tracking Metric 4.3.1 Event Concept 4.3.2 Event Types 4.3.3 Event Generation 4.3.4 Evaluation Metric 4.3.5 Evaluation Pipeline 4.4 Experiments 4.4.1 CAVIAR Data Set 4.4.2 PETS 2001 Data Set 4.4.3 HERMES Data Set 4.4.4 Event Description 4.4.5 Tracker 1 4.4.6 Tracker 2a and 2b 4.4.7 Tracker 3 4.5 Case Study Results 4.5.1 CAVIAR
57 57 60 61 62 63 64 65 66 66 69 69 70 70 70 73 74 74 75 75
CONTENTS
4.6 5
4.5.2 PETS2001 4.5.3 HERMES 4.5.4 Metric Discussion
Conclusion 5.1 2D and 2.5D Tracking Methods 5.2 Discussion of the Evaluation Metric 5.3 Outlook
A Datasets Bibliography
ix
77 80 85 86 87 87 88 89 91 97
List of Publications
103
Curriculum Vitae
105
List of Figures 1.1
General framework of visual surveillance
2.1 2.2
Tracking framework Example for a Gaussian mixture with three RGB Gaussians with different mean, variance and weight Constructing Gaussian mixture for a whole slice Sliced object model Object position priors, from the original image in Figure 2.4(a) Two examples of the 3x3 filter masks for the noise filtering. The filter on the left removes single pixel noise. The right filter mask is one example of a filter removing noise in the case of seven uniform and two differing pixel values The connected component algorithm finds closed regions as shown in a) and b). The grouping these regions to final objects is shown in steps c) and d) Eight cases of occlusion are differentiated. The hatched rectangle represents the object in front. The bounding box of the object behind is represented by solid (=valid) and dashed ( i n valid) lines PETS 2001 test dataset 3, part 1 PETS 2001 test dataset 3, part 2 PETS 2001 test dataset 3, part 3 Tracking during partial and complete occlusion Limitation: The background model only adapts where it is visible Computational time and the number of objects including the background model of the PETS 2001 Test Dataset 3 Multi-person tracking in a conference room Large change in the background can lead to wrong objects. . .
2.3 2.4 2.5 2.6
2.7
2.8
2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16
4 13 17 18 19 20
20
21
22 25 26 27 28 30 32 33 34
xii
LIST OF FIGURES
3.1
Tracking framework: The maximum probability of the individual appearance models results in an initial segmentation using Bayesian per-pixel classification. The white pixels in the segmentation refer to the generic 'new object model' described in Section 3.3.3 3.2 The segmentation is refined while iteratively searching for the exact object position from close to distant objects 3.3 Ground plane calibration 3.4 Segmentation from the central square sequence. The different size of the bounding-boxes visualize the different object classes. Unique object IDs are shown in the top left corner. Black pixels = background, white pixels = unassigned pixels M, colored pixels = individual objects 3.5 Tracking results from the Central square sequence. Objects are visualized by their bounding box and unique ID in the top left corner 3.6 Tracking results from the HERMES outdoor sequence 3.7 Computational effort: The blue curve above shows the computation time in milliseconds per frame. Below in red, the number of tracked objects is given 3.8 HERMES distributed multi-camera system sketch. It shows the super visor computer with SQL database on top, the static camera tracker on the bottom left, and the active camera view to the bottom right 3.9 HERMES indoor demonstrator in Oxford. Static camera view and tracking results 3.10 New CVC Demonstrator, static camera view 4.1 4.2
4.3 4.4 4.5 4.6 4.7
Evaluation scheme Example of a distance matrix. It shows the distances between every ground truth event (column) versus every tracker event (row) Event matching of same type CAVIAR OneLeaveShopReenterl cor sequence with hand-labeled bounding boxes and shop area PETS2001 DS1 sequence HERMES sequence Hierarchical multiple-target tracking architecture
41 42 46
48
49 50
50
52 53 54 63
67 68 71 71 72 73
LIST OF FIGURES
4.8 4.9
4.10
4.11 4.12 4.13
Multi-camera tracker by Duizer and Hansen Frames/Event plot for the PETS sequence. Stars are ground truth events, squares from tracker 1 and diamonds show events from tracker 2a Frames/Event plot for the HERMES sequence. Stars equal ground truth, squares equal tracker 1, diamonds equal tracker 2b and pentagrams equal tracker 3 The presented tracking method on the HERMES sequence . . Segmentation on the HERMES sequence HERMES sequence: three trackers. Only tracker 1 in the left image detects the small bags (nr. 3 and nr. 23). Tracker 2b in the center and tracker 3 omit the small objects to achieve a higher robustness for the other objects
76
78
81 83 84
84
List of Tables 2.1
Tracking framework algorithm
14
2.2
Computational effort of the different parts of the algorithm . .
31
3.1
2.5D tracking algorithm
42
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12
Evaluation metric Detected events of the CAVIAR sequence Event-based evaluation of the CAVIAR sequence Detected events for the PETS sequence Event-based evaluation of the PETS sequence Object-based evaluation of tracker 1 (PETS) Object-based evaluation of tracker 2a (PETS) Detected events for the HERMES outdoor sequence Event-based evaluation of the HERMES outdoor sequence . . Object-based evaluation of tracker 1 (HERMES) Object-based evaluation of tracker 2b (HERMES) Object-based evaluation of tracker 3 (HERMES)
66 75 77 78 79 79 80 80 82 83 85 85
A. 1 HERMES, Central and Rush Hour dataset comparison A.2 PETS 2001, CAVIAR and PETS 2006 dataset comparison. . . A.3 PETS 2007, BEHAVE and Terrascope dataset comparison. . .
94 95 96
1 Introduction Tracking moving objects in video data is part of a broad domain of computer vision that has received a great deal of attention from researchers over the last twenty years. This gave rise to a body of literature, of which surveys can be found in the works of Moeslund et al. [53; 54], Valera and Velastin [74] and Hu et al. [32]. Computer-vision-based tracking has established its place in many real world applications; among these are: • Visual Surveillance of public spaces, roads and buildings, where the aim is to track people or traffic and detect unusual behaviors or dangerous situations. • Analysis of Sports to extract positional data of athletes during training or a sports match for further analysis of the performance of athletes and whole teams. • Video Editing where tracking can be used to add graphic content to moving objects for an enhanced visualization. • Tracking of Laboratory Animals such as rodents with the aim to automatically study interactions and behaviors. • Human-Computer Interfaces in ambient intelligence, where electronic environments react to the presence, action and gestures of people. • Cognitive Systems which use tracking over a longer time period in order to learn about dynamic properties. The growing number of surveillance cameras over the last decade as well as the numerous challenges posed by unconstrained image data make visual surveillance the most general application in the field. The aim of this work is to
1. INTRODUCTION
develop an intelligent visual detection, recognition and tracking method for a visual surveillance system. It should track multiple objects in image sequences in order to answer questions such as: How many objects are in the scene? Where are they in the image? When did a certain person entered the room? Are there pedestrians on the crosswalk? This thesis presents methods and prototype systems to answer such questions in real-time. Given the challenges of the problem, the research will be restricted to real-time tracking methods with static cameras, primarily of people and secondarily of other objects such as vehicles. Furthermore, the presented methods are compared and evaluated with novel measurements and tools. Finally, the methods are designed to run live in prototype systems. Multiple such systems for visual surveillance of different scope are built and presented. One system implements an application for privacy in video surveillance, the other is a cognitive multi-resolution surveillance system.
1.1
Challenges of Visual Object Tracking
Multi-object tracking poses various interconnected challenges, starting with object detection, classification, environment modeling, tracking and occlusion handling. The approach in this thesis addresses each of these challenges with separated and specialized models or modules. The development of a modular framework and the modification of these modules directly affect the computational speed as well as the tracking performance and robustness. The framework is chosen so that results of the visual sensing first lead to a segmentation, regardless of the sensing method. The tracking between frames is formulated as a Bayesian filtering approach which uses the segmentation. The contributions of this thesis are the general framework, the special modifications to each module and their integration in order to achieve a good balance between robustness and speed.
1.2
Contributions
This thesis deals with specialized modules for real-time multi object tracking and explores probabilistic models concerning the appearance and motion of
i I, (JI-NERAL VISUAL SURVEILLANCE SYSTEMS
BI i -.1 ins. The work proposes a tracking framework, multiple real-time protolype systems and an evaluation metric to measure improvements. The contril -in ions of the thesis are: • Two real-time tracking methods sharing a similar tracking framework for Bayesian per-pixel classification, which combines the appearance and motion of objects in a probabilistic manner. • Multiple real-time prototype systems for privacy in video surveillance and cognitive multi-resolution behavior analysis. • Evaluation method to compare and measure tracking performance based on a novel event metric.
1.3 General Visual Surveillance Systems I he first part of this thesis deals with the design and implementation of a vilual surveillance system. Therefore, general modules of a visual surveillance w sicm including references to prior art are sketched first. Inspired by several general frameworks in the literature, e.g. in Hu et al. [32], Moeslund et al. [54] OT Valera and Velastin [74] a similar structure as the one in Hu et al. [32] is adopted. Modifications to this general structure are made due to the focus in monocular tracking by discarding any multi-camera fusion. Furthermore, a nunc general task of object detection and segmentation is considered instead simpler motion segmentation. The general structure of a visual surveillance system is shown in Figure 1.1. It Combines modules for Visual Observation, Object Classification, Object Tracking. Occlusion Handling and Environment Modeling. Only the Visual Observation module has direct access to the full images of the surveillance camera, despite the fact that image regions or preprocessed image features are passed on to different modules in the framework. The tight connections between the modules are essential for any object tracking system to fulfill the needs of a visual surveillance system. While this framework is used throughout the thesis, first, recent developments and general strategies of all these modules are reviewed.
1. INTRODUCTION
Figure 1.1: General framework of visual surveillance. 1. Visual Observation. The visual observation module works directly on the image data and aims at detecting regions corresponding to the objects under surveillance such as humans. Detecting the regions which are likely to contain the tracked object (e.g. by means of background segmentation methods) provides a focus of attention for further analysis in other modules such as tracking, object classification or occlusion handling. The purpose of this module is often twofold in the sense that only these regions will be considered in the later process and that intermediate computational results (e.g. color likelihoods) will be required by other modules. In the following several approaches for the visual observation module are outlined. • Background Subtraction A popular method to segment objects in an image is background subtraction. It detects regions by taking the difference between the current image and a previously learnt empty background image. In its simplest versions it literally subtracts color or grey scale values pixel-by-pixel and compares it to a threshold. More complex methods are adaptive, incorporating illumination changes as
I (ii-.NERAL VISUAL SURVEILLANCE SYSTEMS
they occur during the day. A comparison of state-of-the-art methods can be found in [24] by Hall et al. They compare W4 by Haritaoglu et al. [26], single [76] and multiple Gaussian models [69] as well as LOTS [8] regarding performance and processing time. For the test, they use sequences from the CAVIAR corpus, showing a static indoor shopping mall illuminated by sunlight. In this setting the fast and simpler methods such as single Gaussian and LOTS outperform more complex multi-modal methods such as multiple Gaussian mixtures. However, in complex outdoor settings the background is more dynamic due to wind, waving trees or water. Several methods have been proposed to deal with such multimodal distributed background pixels. For example, Wallflower by Toyama et al. [71 ] employs a linear Wiener filter to learn and predict background changes. Alternatively, codebook vectors are used by Tracey [40] to model foreground and background. The codebook model by Kim et al. [38] quantizes and compresses background samples at each pixel into codebooks. A comparison of these methods can be found in the work of Wang et al. [75] where they introduce a statistical background modeling method. • Temporal Differencing This method is similar to background subtraction. But instead of computing the pixel-wise difference between the current image and an empty background, two or three consecutive frames of an image sequences are subtracted to extract moving regions. Therefore, temporal differencing does not rely on an accurate maintenance of an empty background image. In contrary, it is adapted to deal with dynamic environments. However, due to its focus on temporal differences in a short time frame the method rarely segments all the relevant pixels and holes inside moving objects are a common error. An example of this method is presented in Lipton et al. [46], where connected component analysis is used to cluster the segmented pixels to actual motion regions. • Optical Flow Characteristics of optical flow vectors can be used for motion segmentation of moving objects. While the computation of accurate flow vectors is costly, it allows the detection of independently moving objects, even in the presence of camera motion. More details about optical flow can be found in Barron's work [2]. Flow-based
1. INTRODUCTION
tracking methods were presented by Meyer et al. for gait analysis [50; 51] and Nagel and Haag [55] for outdoor car tracking. • Feature Detection Recently, tracking by detection approaches received attention for non real-time approaches. Local invariant features such as SURF [4] or SIFT [47] are extracted from the image which are further processed in additional modules such as object classification in order to detect and later track similar regions over time. Their main advantage is the robustness of local features, the data reduction for further processing and their independence of a background model, which allows such techniques on a moving camera. 2. Object Classification Modules for object classification are important to identify regions of interest given by the visual observation module. In visual surveillance this module classifies the moving objects into humans, vehicles and other moving objects. It is a classic pattern recognition problem. Depending on the preprocessing results of the observation layer, different techniques can be considered. However, three main categories of methods are distinguished. They are used individually or in combination. • Feature-detector-based Classification Detected local features can be used to train object class detectors such as ISM by Leibe [43] for general feature detectors. Bo Wu and Ram Nevatia use specific human body part detectors for the object detection and tracking [77]. Today, these approaches are generally not suitable for real-time operation. GPU accelerated implementations will benefit from a huge performance boost in the near future, which will allow some detector-based tracking methods to become real-time using high-end graphics hardware. • Segmentation-based Classification Object classification based on image blobs are used by VSAM, Lipton et al. [46] and Kuno et al. Different descriptors are used for classification of the object's shape such as points, boxes, silhouettes and blobs. • Motion-based Classification Periodic motion patterns are used as a useful cue to distinguish non-
i I, GENERAL VISUAL SURVEILLANCE SYSTEMS
rigid articulated human motion from motion of other rigid objects, such as vehicles. In [46], residual flow is used as the periodicity measurement. 3. Object Tracking The object tracking module relates the observations of the same target in different frames into correspondence and thus obtains object motion. Information from all modules are combined to initiate, propagate or terminate object trajectories. Prominent mathematical methods are applied, such as the Kalman filter [35], the Condensation algorithm [33] [34] or dynamic Bayesian networks. • Region-based Tracking Region-based tracking algorithms track objects according to image regions identified by a previous segmentation of the moving objects. These methods assume that the foreground regions or blobs contain the object of interest. Wren et ai [76] explore the use of small blob features to track a single person indoor. Different body parts such as the head, torso and all four limbs are identified by means of Gaussian distributions of their pixel values. The loglikelihood is used to assign the pixels to the corresponding part, including background pixels. The human is tracked by combining the movements of each small blob. The initialization of a newly entering person starts from a clean background. First, it identifies a large blob for the whole person and then continues to find individual body parts. Skin color priors are used to identify the head and hand blobs. Mean shift tracking is a statistical method to find local maxima in probability distributions where the correlation between the image and a shifted target region is forming the probability distribution. Its maximum is assumed to locate the target position. Collins et. al [12] presented a method to generalize the traditional 2D meanshift algorithm by incorporating scale into a difference of Gaussian mean-shift kernel. • Feature-Based Tracking Feature-based tracking methods combine successive object detections for tracking. General local features are extracted, clustered for object classification and then matched between images. Current methods can be classified into either causal or non-causal ap-
1. INTRODUCTION
proaches. Non-causal approaches construct trajectories by finding the best association according to some global optimization criterion after all observations of a video are computed. Examples are the methods from Leibe et. al [44], Berclaz et. al [7], Wu and Nevatia [79]. Causal methods, in contrast, do not look into the future and only consider information from past frames. Examples for causal feature-based tracking are Okuma et. al [57], Breitenstein et. al [9]. 4. Environment Modeling Tracking is done in a fixed environment which defines object and background models, coordinate systems and target movements. For static single cameras, environment modeling focuses on the automatic update and recovery of the background image from a dynamic scene. The most prominent background subtraction techniques are described in the visual observation module. The coordinate system for single camera systems is mostly 2D, avoiding a complex camera calibration. Multi-camera setups, in contrast, usually use more complex 3D models of targets and the background in real world coordinates. 3D coordinates given by a multi-camera calibration are used in various aspects of the tracking process, such as volumetric target models and movement restrictions to a common ground plane [19]. Many different volumetric 3D models are used for humans, such as elliptical cylinders, cones [52], spheres, etc. These models require more parameters than 2D models and lead to more expensive computation during the matching process. Vehicles are mainly modeled as 3D wire-frame. Research groups at the University of Karlsruhe [59] [39] [23], University of Reading [70] and the National Laboratory of Pattern Recognition [80] made important contributions to 3D model-based vehicle localization and tracking. Multi-camera methods are not directly applicable to monocular tracking. However, some ideas and subsets of the algorithms can be adopted to single camera tracking in order to improve the tracking, especially under occlusion. 5. Occlusion Handling Occlusion handling is the general problem of tracking a temporally nonvisible object by inferring its position from other sources of information besides the image. Closely related is therefore the environment modeling
I I. OUTLINE OF THE THESIS
and the ability to detect an occlusion. Multiple cameras with different viewpoints onto the scene cope with occlusions by data fusion [52] [37] or "best" view selection [73]. However, the focus of this work is on using only a single camera and therefore focuses on methods to estimate depth ordering and 3D ground positions from environment modeling and object motion only.
1.4 Outline of the Thesis I lir. ihcsis is structured into the following four chapters. Chapter 2 outlines 111. • ' I) real-time tracking method. Chapter 3 extends the previous 2D approach Into a 2.5D tracking approach extending the environment modeling. Chapter 4 Introduces the novel event-based tracking evaluation metric. Finally Chapter 5 lUmmarizes this thesis, discusses the achieved results and provides an outlook l"i future research in the field of real-time tracking and tracking evaluation. I he Appendix lists popular tracking datasets and their specific challenges and properties.
2 21) Real-Time Tracking 2.1
Introduction
l In', chapter introduces the first of two methods for the detection and trackiin' of people in non-controlled environments. The focus throughout the thesis "i! monocular tracking with a static camera. Furthermore, the proposed Rlthod adapts to changing lighting conditions, handles occlusions and newly ippcaring objects, and works in real-time. It uses a Bayesian approach, asIgning pixels to objects by exploiting learned expectations about both motion mill appearance of objects. In comparison to the general framework of viu.11 Burveillance in Figure 1.1 this 2D method lacks a module for Environment \ /. ii /< 'ling.
' i.l
Related Work
Human tracking has a rich history as shown in the introduction in Chapter 1 ,n HI i he re fore we only describe the work most closely related to the presented BWthod in this chapter. Mittal and Davis [52] developed a multi-camera system hli li also uses Bayesian classification. It calculates 3D positions of humans i'"in segmented blobs and then updates the segmentation using the 3D posiI he approach owes its robustness to the use of multiple cameras and more • iphisticated calculations, which cause a problem for a real-time implementa". m In comparison, we developed a modular real-time tracker which solves ii" 11,ieking problem with a single view only. Furthermore, the original tracker horn 152] needs a fixed calibrated multi-camera setup and does not scale well dlli i" I lie pairwise stereo calculations and iterative segmentation.
12
2. 2D REAL-TIME TRACKING
Capellades et al. [10] implemented an appearance-based tracker which uses color correlograms and histograms for modeling foreground regions. A correlogram is a co-occurrence matrix, thereby including joint probabilities of colors at specific relative positions. Also in our case, taking into account of how colors are distributed over the tracked objects is useful, but we prefer a sliced color model instead, as will be explained. Senior et al [65] use an appearance model based on pixel RGB colors combined with shape probabilities. The outline of this chapter is as follows. Section 2 presents the overall strategy underlying our tracker, which is based on both appearance and motion models. The appearance models are explained in Section 3 and the motion models in Section 4. Section 5 describes the object localization and how occlusions are handled. Results are discussed in Section 6 and Section 7 concludes this chapter.
2.2
Bayesian Per-Pixel Classification
The proposed method performs a per-pixel classification to assign the pixels to different objects that have been identified, including the background. The probability of belonging to one of the objects is determined on the basis of two components. On the one hand, the appearance of the different objects is learned and updated, and yields indications of how compatible observed pixel colors are with these models. On the other hand, a motion model makes predictions of where to expect the different objects, based on their previous positions. Combined, these two factors yield a probability that, given its specific color and position, a pixel belongs to one of the objects. The approach is akin to similar Bayesian filtering approaches, but has been slimmed down to strike a good balance between robustness and speed. As previously mentioned, each object, including the background, is addressed by a pair of appearance and motion models. Figure 2.1 sketches this tracking framework. It incorporates different characteristics of objects such as their appearance and their motion via separate and specialized models, each updated by observing them over time. This method has several advantages for object tracking over simple foreground / background segmentation, especially in cases when an object stands still for a while. It will not be mistaken for background or fades into the background over time.
1
HAYESIAN PER-PIXEL CLASSIFICATION
13
Figure 2.1: Tracking framework i "ini.illy, the classification is described by (2.1) and (2.2). These equations dei rlbc how the probabilities to occupy a specific pixel are calculated and compared for different objects. These probabilities are calculated as the product -I i prior probability Pprior(object) to find the object there, resulting from its motion model and hence the object's motion history, and the conditional probibllllj / ',„l.sterior(object\pixel) that, if the object covers the pixel, its specific uloi would be observed. / 'r..sterior(object\pixel) a P{pixel\object)Pprior segmentation—
(object)
max (Posterior (objectlpix el)).
(2.1) (2.2)
object
11" Iriu ker executes the steps in Table 2.1 for every frame. First, the prior probnbilities of all objects are computed. Then they are used in the second lip lo segment the image with the Bayesian per-pixel classification. Each • I i assigned to the object with the highest probability at this position. In a lop, ihe object positions are found by applying a connected components Inn \>H\ to the segmentation image, which groups pixels assigned to
14
2. 2D REAL-TIME TRACKING
1. compute posterior probability of all models 2. segment image (Bayesian per-pixel classification) 3. find connected components and group them to objects 4. add, delete, split objects 5. handle occlusion 6. update all models Table 2.1: Tracking framework algorithm the same object. The fourth step handles several special situations detected by the connected components algorithm. Missing objects are deleted, if they are not occluded. Objects are split if they have multiple new object positions. New objects are initialized from the regions claimed by the generic new-objectmodel (described in Section 2.3.3). The fifth step handles objects which are partially or completely occluded and infers their spatial extent according to the motion model. Occlusion handling is the subject of Section 2.5.3. Finally, all object models are updated. Namely, the appearance models update the color models from the pixels assigned to them in the segmentation image and the motion models are updated according to the newly found object positions.
2.3 Appearance Models In this section we introduce the color-based appearance models responsible for estimating the P(pixel\object) probabilities in (2.1). Different models are used for the background B, for newly detected objects Af, and for the objects Oi that are already being tracked. The differences of the models reflect the different expectations of how these objects change their appearance over time. Each model provides a probability image, where pixels which match the model will have a high probability and the others a low probability. The appearance models are updated after pixels have been assigned to the different objects, based on (2.2).
2.3.1
Color Modeling with Mixtures of Gaussians
The appearance models B of the background and O, of all tracked objects are based on Gaussian mixtures in the RGB color space. Methods employing
' i. APPEARANCE MODELS
15
iinn' adaptive per-pixel mixtures of Gaussians (TAPPMOGs) have become a popular choice for modeling scene backgrounds at the pixel level, and were proposed by [69]. Other color spaces than simple RGB could be considered, g. described by Collins and Liu [13]. I oi the presented tracker the color mixture models are separated among our pi i ialized appearance models B. The first and dominant Gaussian is part Ol i pixel wise background model, described in the next Section 2.3.2. The i in.lining Gaussians of the color mixture are assumed to model foreground objects. These colors arc separated from B and part of a Oi as described in li i lion 2.3.4 about foreground models. i "i both appearance models, the probability of observing the current pixel llue V, = [Rt Gt Bt] at time t, given the mixture model built from preOUB observations is K
l'(Xt\Xi,...,Xt-i)
= ywi-i^iAu
?t-i,fc, St-i>fc)
(2.3)
fe=l
Iv re "'/ I.A- is the weight of the kth Gaussian at time t — 1, ~jlt-i,k is the oi of the RGB mean values, the Et-i,fc k the covariance matrix of the kth i.iii and r\ is the Gaussian probability density function „//, .,A-.E t -i, fc )=
,
»—e4(^-M1-1,)rEr-1u(^-^-u).
i • .i ihc covariance it is assumed that the red (R), green (G) and blue (B) comi .i'HiH are independent. While not true for real image data, such an aplinution reduces the computational effort. A diagonal covariance matrix '/, 0 0 -, avoids a costly matrix inversion. unplc of Gaussian mixture with K = 3 is shown in Figure 2.2.
* \ Background Model ii.
ilyorithm by [69] was originally designed to combine foreground and i niiincl Gaussians together into one model, where the foreground Gausi \ov> er weights. Due to the separation of foreground and background
16
2. 2D REAL-TIME TRACKING
pixels into different models in the presented approach, there is no more need to model foreground colors within B. During experiments, a single Gaussian has shown to be sufficient. The background colors are modeled with one Gaussian per pixel with evolving Jit, assuming the camera is static. Such a simplification is beneficial for a faster frame by frame computation and it simplifies the initialization and training at the beginning with a single empty background image. Going one step further, the background model uses the same fixed diagonal covariance matrix everywhere which again speeds up the computations without sacrificing too much accuracy. A fixed color variance is sufficient to handle global image noise and possible compression artifacts. More complex background models are discussed in the related work section. For tracking purposes, the single Gaussian model showed satisfactory results given the limited processing time available in real-time applications. The covariance matrix is set beforehand and the mean vectors are initialized based on the values in an empty background image. Thereafter, the pixels Xt segmented as background are used for updating the corresponding color model, where a is the learning rate ~Jtt,A- = (1 — a)~ftt-i,k + cxXt. The background is updated only in visible parts.
2.3.3 Finding New Objects The tracker detects newly appearing objects as part of the segmentation process. Their creation is based on a generic 'new object model' M. This appearance model has a uniform, low probability pv. Thus, when the probabilities of the background and all other objects drop below pjV, the pixel is assigned to M. Typically, this is due to the following reasons: • A new object appears in front of the background and the background probability drops. • The pixel is on the "edge" of an object and its value is a mixture of the background and foreground color. • The foreground model does not contain all colors of the object. A new foreground model Oi is initialized as soon as a region of connected pixels has a minimal size. Some rules have been established to avoid erroneous initializations:
17
' I, APPEARANCE MODELS
£
1st A
A •a
1
3rd
2nd
A
^^_
A
^ ^
Pixel intesities [0 255]
' fglire 2.2; Example for a Gaussian mixture with three RGB Gaussians with different mean, variance and weight. • Objects entering the image from the sides are not initialized until they are fully visible, i.e. until they are no longer connected with the image border. This prevents objects from being split into multiple parts. • Pixels assigned to M which are connected with a foreground object Oi arc not used for the new object detection. Instead, it is assumed that Ox is not properly segmented. If the number of these pixels exceed 20% of Ot, they are added to Oi. The threshold is due to the assumption that a smaller number of pixels are a result of pixels on the "edge" of the object. \ also has a mechanism for cleaning up itself by reducing the probability even bilow i\ for those pixels which are assigned to M for a longer period of time. I In', mechanism is essential to keep the new object model clean from noise blobs and it increases the accuracy of the background model. M is initialized H i.ni up withp^.
2,3.4
Foreground Models
i i ;r
18
2. 2D REAL-TIME TRACKING
as visualized by Figure 2.3. The number K of Gaussian mixtures is the same for all slices and typically ranges from 2 to 5. More Gaussians are able to model objects with more diverse colors, but have a cost of slowing down the algorithm. Histogram
Gaussian Mixture
Figure 2.3: Constructing Gaussian mixture for a whole slice The approach of slicing a foreground object was previously proposed by Mittal and Davis [52] for 3D multi-camera tracking. Figure 2.4 shows an example of a sliced object model. Dividing an object into horizontal regions is an effective way of modeling approximately cylindrical objects like humans. It assumes that the color characteristics within a given slice do not change with object rotation around the vertical axis. That being said, the sliced object model does adapt to such occasional changes, but horizontal slicing tends to at least limit them and therefore to keep the model more discriminative. As long as the object does not rotate around other axes and if the number of Gaussians K is big enough to represent all colors at a particular height, the sliced color model can adapt. Updating the color model for one slice is done by replacing the old mixture with a new one of equal K. This is a fast solution for updating the mixture without a costly iterative EM algorithm. The classification prevents the model from changing too fast as only colors similar to the previous model are segmented. The adaption to new colors is provided by the 20% rule of Af, as described in 2.3.3.
2.4
Motion Model
This section describes the motion model which is responsible for the Pprior probability in (2.1). As with appearance models, each object has its own mo-
19
2.4. MOTION MODEL
(a) original image
(b) 7 slices
(c) probability image
Figure 2.4: Sliced object model
tion model. The probability of the object position is high in an area where the object is expected to appear in the current frame. Both B and M use a simple model with uniform probability and without any tracking capabilities.
All objects Oj use an individual linear Kalman filter with a constant velocity model for tracking the 2D image position. An object is represented by its hounding box, given by the box center x, y and size Hx,Hy. The probability that the object may be present at the different pixels is shown in Figure 2.5(b). The probability is uniformly high inside the area of the bounding box and decreases as one passes through a border region outside the bounding box, linearly dropping off. The border region enlarges the region where the new object position is expected. This adds a safety margin around the predicted object location of the Kalman filter, while at the same time catering for a possible growth of the object size. The border width is the same for all objects in all directions. This is important in the case of occlusion, where joining and separation works best if both objects have the same probability border size.
After the per-pixel classification, updated object positions are obtained from the minimal enclosing rectangles of the pixels assigned to the objects. The box center x, y is used for updating the Kalman filter. Hx, Hy are incorporated directly from the latest detection. In order to improve the stability of the tracking during sudden changes in size, we limit the maximum changes of Hx and Hy to 25% of its previous value.
20
2. 2D REAL-TIME TRACKING
(a) bounding box coordinates
(b) Probability image
Figure 2.5: Object position priors, from the original image in Figure 2.4(a)
2.5 Object Localization This section describes the steps needed to find the object positions based on the segmentation, as described in Section 2.2. Equation 2.2 defines the segmentation as a per-pixel classification. Object positions are found based on connected components, as described in Section 2.5.1. In order to remove small components, the segmentation is filtered beforehand to reduce noise. 3x3 filter masks are used in a manner similar to a morphological operator to remove noise pixels; two cases are shown in Figure 2.6.
Figure 2.6: Two examples of the 3x3 filter masks for the noise filtering. The filter on the left removes single pixel noise. The right filter mask is one example of a filter removing noise in the case of seven uniform and two differing pixel values.
21
2.5. OBJECT LOCALIZATION
Adjustments are made if either one single isolated pixel or two different pixels are found among seven uniform segmented pixels. This eliminates most artifacts caused by image noise and speeds up the connected components algorithm which is described next.
2.5.1
Connected Components
(a) Segmentation image
(c) Grouping of components to objects
(b) Multiple connected components
(d) Final object positions
Figure 2.7: The connected component algorithm finds closed regions as shown in a) and b). The grouping these regions to final objects is shown in steps c) and d). The connected components algorithm plays a central role in the tracking process as it converts the low level segmentation image to object positions and
22
2. 2D REAL-TIME TRACKING
sizes to be used by the higher level tracking models. The used algorithm has an optimized running time of O(N) where N is the number of pixels. It processes one or two image lines at a time which is beneficial for fast computations. All operations can be carried out on a small part of the image. This allows for maintaining "cache locality", which is beneficial for modern CPUs to speedup the computation, as the data to be process fits fully into the fast memory cache of the CPU. The algorithm does one forward pass for each line and labels the target image according to information gathered during the pass. The method uses a union-find data structure to build an internal representation of the connection of the image during the pass. Figures 2.7(a) and 2.7(b) show an example segmentation image with two foreground objects.
2.5.2
Grouping Blobs to Objects and Adding New Objects
The connected components algorithm usually finds multiple components of each object. Therefore, the real object position has to be reconstructed by connecting overlapping and/or close regions together, which is shown in Figures 2.7(c) and 2.7(d). For our tracker, we define a maximal distance for combining close connected components to final object positions. These new object positions are then compared with the objects from the last frame.
Figure 2.8: Eight cases of occlusion arc differentiated. The hatched rectangle represents the object in front. The bounding box of die object behind is represented by solid (=valid) and dashed (^invalid) lines.
2.5. OBJECT LOCALIZATION
23
I Xiring this comparison it is possible to detect object splits or disappearance if the number of objects changes. In case of object splitting, the largest object in terms of number of pixels is matched to the known object Oi. The remaining smaller objects are removed from Oi and assigned pixel-wise to J\f in the segmentation image. As described in Section 2.3.3, they may be initialized as new foreground objects if they meet the requirements of M.
2.5.3
Occlusion Handling
The Bayesian classification tracker handles occlusion between foreground objects by comparing the estimated object positions with the actual observations. The lack of image depth information (z coordinate) is partially compensated by assuming a horizontal and upright camera orientation, as well as a planar floor. In this case, objects closer to the camera have higher bottomdine = y + \Hy\ value. In the special case, when the bottomdine values of two objects are equal, the taller object is assumed to be in front of the smaller one. This special case occurs when two objects are behind an obstacle or when the lower part of the objects is outside the field of view. The occlusion handling works in two steps: occlusion analysis and occlusion interpretation. During the analysis, each predicted object position is checked against each object in the current image. If two objects are overlapping, the lype of the occlusion is classified as one of eight possible cases (Figure 2.8). Depending on the case, some or all of the edges of the new object position box are affected by the occlusion. Those edges which are occluded are not used for updating the object position. The occlusion interpretation then updates the object position by applying only the valid edges of the current observation. Missing edges are reconstructed by using the object width and height from previous images without occlusion. The position of a completely occluded object is only based on the position predicted by the constant velocity assumption of the Kalman filter. The accuracy of predicting completely occluded objects could be increased by refining this model.
24
2. 2D REAL-TIME TRACKING
2.6 2D Tracking Results and Discussion The presented tracking framework has the following advantages: • The separation of foreground and background into different and discrete models is able to track moving and motionless objects in front of an adaptive background. • The use of adaptive color modeling with mixtures of Gaussians enables the background and foreground models to adapt individually to changing lighting conditions. • Occlusion handling can be implemented at the high level of object positions. Objects under occlusion can be tracked with the help of accurate foreground models and per-pixel classification. • The current implementation works in real-time on QVGA (320x240) resolution. In this section results on selected datasets are shown such as those from PETS 2001. A general list of datasets with their characteristics and difficulties is given in Appendix A.
2.6.1
PETS 2001 Test Dataset 3
This section demonstrates several of the above-mentioned advantages using the dataset provided by the Second IEEE International Workshop on Performance Evaluation of Tracking and Surveillance 2001. Test dataset 3 with 5336 frames is used, which is a challenging sequence in terms of multiple targets and significant lighting variation. The sequence is down-scaled to a resolution of 320x240 which allows to run the tracker in real-time. No objects are entering the scene in the first 1200 frames. As expected, the tracker adapts to the changing lighting conditions from bright sunshine to dull weather conditions and no false objects are detected, showing the strength of the adapting background model. In Figure 2.9(a) a group oftwo people enter the field of view in the lower right corner. The whole group is detected as one object and is tracked throughout
25
1,6. 2D TRACKING RESULTS AND DISCUSSION
I he image. The tracker is not designed to separate grouped objects as long as the group is not visually disconnected, due to the use of connected components for building objects. Furthermore, the object detected by the tracker includes I he shadow cast by the object on the ground. This is an expected result as the background model has no special shadow removal capabilities. More complex methods to remove shadows are referenced and described in Section 1.3. At Figure 2.9(b) a single person enters the scene in the lower left corner in the shadow of a tree. The person is correctly detected despite being eclipsed in the shadow. The object model adapts to the brighter colors of the person when he steps into the sunlight, showing the adaptiveness of the foreground models. The person is correctly tracked until he leaves the image. Figure 2.9(a) and 2.9(b) show the high accuracy and the fast detection of new object.
(a) Frame 1382
(b) Frame 1415
(c) Frame 2401
Figure 2.9: PETS 2001 test dataset 3, part I In Figure 2.9(c) four people walk into the scene individually. A total number of live people are tracked correctly during this phase of the sequence. The tracker is able to follow the group of two people which walk further apart between the cars until they shrink to small dots, where the tracker reaches its resolution limits. The low number of pixels for updating the color models is insufficient and the object is gradually adapted to the background where the foreground model will stay, even if the two people have disappeared. Five people enter in Figure 2.10(a) from the right side. Due to the shadows and the close distance between the people, the group is detected as one large
26
2. 2D REAL-TIME TRACKING
(a) Frame 2510
(b) Frame 2701
Figure 2.10: PETS 2001 test dataset 3, part 2 object. In the following frames, when the distance between some members of the group increases, the group is split into three objects, containing one single person and two couples. After the split, six object-trackers are tracking eight people in Figure 2.10(b). Computational time reaches its maximum at 55 milliseconds per frame on a 3GHz P4. The frames up to number 3301 are sometimes a bit chaotic due to multiple occlusions of entire groups of objects. During this phase some object tracks are mislead by other objects and new objects are initialized on the old object positions. However, the tracker correctly recognizes that three objects have left the image on the left border, that a group of four people is walking between the cars to the back and that a cyclist passed through the scene. The changing lighting condition of the disappearing sun has no negative effect on the tracker. Figure 2.11(a) shows that crossing people are correctly tracked by the occlusion handling only if two objects are undergoing occlusion. The tracker distinguishes between the two objects, even under occlusion, and the object positions are correctly determined.
27
2.6. 2D TRACKING RESULTS AND DISCUSSION
(a) Frame 4000
(b) Frame 4201
Figure 2.11: PETS 2001 test dataset 3, part 3 A fast illumination change in Figure 2.11(b) shows the limitations of the adaptive background for those parts of the image, which are covered by an object. While visible background pixels adapt, the background color models of pixels behind the objects are not updated. This results in an outdated background model when the object leaves its position. The outdated background model leads to the initialization of new foreground object containing only background and no real objects.
2.6.2
Occlusion Handling Limitations
Occlusion handling is critical part in 2D tracking where objects can completely disappear behind other objects. Figure 2.12 shows the crossing of two people, which is successfully resolved by the tracker. In the segmentation image the classification of the pixels to one of the objects can be seen. As soon as the two objects overlap in frame 634 the object models are no longer updated and
28
2. 2D REAL-TIME TRACKING
the object positions are processed by the occlusion handing. This results in a fixed object width of the rear object. The tracking in x direction is now only based on the object parts which are still visible on the left side in frame 634 and on the right side in frame 643. During the complete occlusion of the upper and lower bounds of the object behind, the y coordinate value is only based on the prediction of the Kalman filter. The x coordinate is taken from the actual observation, because the object behind is wider than the object in front.
(a) Frame 621
(b) Frame 628
(c) Frame 634
(d) Frame 638
(c) Frame 643
(0 Frame 648
Figure 2.12: Tracking during partial and complete occlusion However, the success of handling occlusions is directly related to the distinctiveness of the color probabilities drawn from the two objects.
2.6. 2D TRACKING RESULTS AND DISCUSSION
2.6.3
29
Illumination Changes
This section shows the adaptiveness of the tracker to changing lighting conditions. Generally, all foreground and background models are adaptive to color changes. However, the advantage of per-pixel classification and per-pixel color models in the background can also cause problems if the model cannot be updated due to occlusion. The background model is only updated in visible image regions. Occluded regions by a foreground object are not updated and therefore cannot adapt to illumination changes. As soon as the foreground object moves away, the background model is outdated and does not match with the new colors of the reappearing background. If the appearance changed just lightly within the color variance, these pixels will still be segmented as background. Bigger changes result in the wrong detection of foreground objects. In the following sequence, the variance is much higher in the image than the background model in order to handle the reappearing background correctly. A new object is initialized at the position where the foreground object occluded the background during the illumination change. The sequence shows a person walking to the lights switch and turning on the lights. As he walks away, a new object is initialized on the background.
2.6.4
Computational Effort
This section discusses the computational effort of the implemented tracker. First the theoretical complexity of the individual steps of the tracking algorithm are discussed, then the time taken by the PETS sequence is measured on a 3 GHz Pentium IV machine with an image resolution of 320x240. The computation time for a single frame depends on several aspects. Table 2.2 shows the different steps of the algorithm and their computational effort. The speed is a linear function of the number of objects obj multiplied by the number of pixels N of an individual object. In other words, the computation time scales by the pixel count of the objects and is therefore a quadratic function of their size (width x height). The algorithm is fast for small object sizes but slows down for large objects. Figure 2.14 shows the computational effort of the tracker over the whole sequence starting at frame 1200.
30
2. 2D REAL-TIME TRACKIN- I
(a) Frame 621
(b) Frame 628
(c) Frame 634
(d) Frame 638
(e) Frame 643
(0 Frame 648
Figure 2.13: Limitation: The background model only adapts where it is visible.
0 ' i) TRACKING RESULTS AND DISCUSSION
< nmputational effort of the different llcp ave complexity 46% 0{obj N) • lienor pi inabilities
mentation
7%
i niiiiected • "inponents
23%
0Diect groupmi' and ocl in .ion hanillmg "|..late mod• Is
~ 1%
1
23%
O(obj)
0(N)
G(obj)
0(obj N 7)
31
parts of the algorithm description The computation of the posterior probability of all models depends on the size of the models in pixel and of the number of models. The background model takes about 2 ms. The sliced object models can grow up to 60 ms if one model covers the whole image. The per-pixel classification depends only on the number of objects. Each object needs about 1 ms. The connected components algorithm is proportional to the area which is covered by foreground objects. Computation time for this part of the algorithm is insignificant in comparison to the other steps and does not affeet the performance. The model update depends on the number of objects times their pixels given to the EM algorithm in the sliced foreground object models. However, color models are not updated at every time step; larger changes in the appearance of a slice trigger the update of a color model. Changes are detected by measuring the average per-pixel probability of segmented pixels. Such updates depend on the sequence and are modeled here by 7. The given percentage is a measured average.
Table 2.2: Computational effort of the different parts of the algorithm
32
2. 2D REAL-TIME TRACKING
1000
1500
2000
2500
3000 3500 frame number
4000
4500
5000
5500
1000
1500
2000
2500
3000 3500 frame number
4000
4500
5000
5500
Figure 2.14: Computational time and the number of objects including the background model of the PETS 2001 Test Dataset 3
2.7 Application: Privacy in Video Surveilled Areas The presented tracking method was used for a privacy application in video surveilled areas [68; 67]. This section briefly presents the results of the collaboration with the chair of computer aided architectural design as part of the larger research project Blue-C 2 [18] where the possibilities of conventional and 3D video cameras are explored. A prototype system was built for self-determination and privacy in video surveilled areas by integrating computer vision and cryptographic techniques into networked building automation systems. The prototype allowed persons to control their visibility in a video stream via AES encryption to either allow the real view or an obscured picture at their position in the video. A filter is used on the video stream that obscures or removes the parts showing a specific person. A person can control the displayed image of himself and can decide
2.7. APPLICATION: PRIVACY IN VIDEO SURVEILLED AREAS
33
for each viewer if the viewer is allowed to see the clear image or a restricted version with obscured image regions. Our first scene contains two people entering a meeting room for a discussion. Figure 2.15 shows the camera view with the two persons and a bounding-box around them showing the successful tracking by the algorithm. Due to our multi model tracking framework the persons are detected throughout the entire meeting without fading into the background.
Figure 2.15: Multi-person tracking in a conference room As shown in the previous results in Sections 2.6.1 and 2.6.2 the tracking method is not able to fully model all possible events happening in front of a camera, nor is it able to have an in-depth understanding of the scene due to limited computational resources. For the segmentation based approach, all changes in the image of a reasonable size not related to persons are a source of detection Or tracking errors. The projection screen and the open door, in Figure 2.16 are typical examples. During everyday operation of this system, we also discovered, that it is not sufficient to use only a single technology to provide an accurate security standard. On one hand the tracking can fail due to the above mentioned limitations, on the other hand the reliable identification of a person under surveillance could not be achieved by a vision system. Therefore, a bar code scanner was used. But even with a bar code on an ID card, the observed person must be willing to provide his or her identity truthfully. The prototype system shows that visual
34
2. 2D REAL-TIME TRACKING
Figure 2.16: Large change in the background can lead to wrong objects. tracking alone is not fully sufficient in our scenario. But it enabled us to use tracking for the protection of privacy and better balance the power from the observer and the observed.
2.8 Summary and Conclusion This chapter of the thesis presented a framework for the fast tracking of multiple objects or people in monocular video sequences. The implemented tracker can handle changing object and background appearances, as well as newly appearing objects and occlusions. However, the tracker can still be improved in several ways: • The time-adaptiveness of the models can only be ensured in visible parts of the image. Occluded objects are not updated. In particular, the background model is sensitive to outdated color models in areas where it was occluded by a static foreground object. In such a case, changes observed in the visible part could be used to infer changes at occluded pixels. • The occlusion handling performs well for two objects, but fails in crowded scenes with multiple objects occluding each other. The performance is
2.8. SUMMARY AND CONCLUSION
35
directly linked to the uniqueness of each foreground model, which is more doubtful in case of multiple overlaps. • Object sizes are unbound and tend to expand over more than one object during occlusion. Similar colored objects tend to be covered quickly by one of the foreground models completely, while the second model disappears. Knowledge about the expected object size given by a camera calibration could prevent unnatural object growth. • Object localization and sizes depend only on the performance of the connected components algorithm. Even small blobs (possible errors) in the segmentation influence the object localization significantly. • The background is assumed to be static or changing its appearance gradually due to lighting changes. These restrictions were sufficient for most of our experiments. However, background pixels should be given more sophisticated appearance models to deal with a larger variety of changes (e.g. waving tree branches, flickering lights). This would allow the use of the tracker in less controlled environments.
3 Extended 2.5D Real-Time Tracking In continuation of the previous chapter, this chapter presents several extensions to our real-time tracking method including experiments with a surveillance system prototype as well as public datasets. Most prominent is the extension from the pure 2D approach to a 2.5D tracker addressing some of the limitations disOUSSed in the last chapter. The new modifications incorporate scene context like lhe real-world scale and the ground plane into the monocular camera observations. While this leads to 3D world coordinates of object positions, the name 2.5D is used, as the coordinates are the outcome of a special interpretation of a single 2D camera view.
3.1 Introduction and Motivation Multi-object tracking for visual surveillance is a very active research topic in the computer vision community, as shown in the introduction in Chapter 1, as well as in several surveys of Hu et al. [32], Moeslund et al. [54], Valera and Velastin [74]. The presented extensions to the 2D monocular real-time visual Hacking method from Chapter 2 are motivated by the findings and conclusions described in Section 2.8. More specifically the limitations in the occlusion handling, the estimation of the correct object size, as well as the accuracy of the object localization are addressed by the extension from 2D into a 2.5D approach. The 2.5D approach allow us to improve our previous method [60] by the following main contributions:
38
3. EXTENDED 2.5D REAL-TIME TRACKING
• New objects are detected by accumulating evidence on the calibrated ground plane by mapping foreground image regions to the real-world positions. • Different object classes can be recognized by their distinctive footprint on the ground plane. • Object segmentation is improved by means of an iterative object placement process. Knowledge of already-found objects closer to the camera is used to refine the prior probabilities of objects further away. Furthermore, the experience gained as described in Section 2.7 during the development of the on-line prototype system built for the application for Privacy in Video Surveilled Areas was helpful for the development of new prototype systems in Section 3.5. The rest of the chapter is organized as follows: Section 3.2 briefly describes the most closely related work in respect to the extensions. In Section 3.3 we introduce the multi-object tracking method and with all its new extensions. Results are then given in Section 3.4, while new prototype systems are described in Section 3.5. Finally, Section 3.6 discusses our findings and draws a conclusion.
3.2 Previous Work This section gives an overview of the most relevant related work in respect to the new extensions and contributions. A comprehensive summary of previous work in the domain of visual surveillance can be found in Section 1.3. In comparison to the general framework of visual surveillance in Figure 1.1 this 2.5D method adds a module for Environment Modeling to the method from the previous chapter. Closely related to the presented work are the following two publications of real-time trackers. Zhao et al. [83] present a real-time multi-camera tracker able to robustly down-project foreground pixels onto the assumed ground plane thanks to the use of stereo cameras. Their stereo segmentation and tracking techniques inspired this development of a monocular version which does not require special stereo surveillance cameras. Due to the available 3D data, their
3.3. 2.5D MULTI-OBJECT TRACKING
39
algorithm uses mean shift on a quantized ground grid for tracking and segmentation. In contrast, this method specially interprets the 2D image information, tracks objects with a maximum window method and segments the image, exploiting inter object occlusion information. Lanz et al. [42] have developed a hybrid joint-separable formulation to model the joint state space of a multiobject tracker. While the approach is efficient and robust, especially during occlusion, their histogram model needs a careful initialization from different views prior to tracking. This method in contrast, learns a less specific but sufficient appearance model from only one view at any spot in the image.
3.3 2.5D Multi-Object Tracking In this Section first the real-time multi-object tracking algorithm from Chapter 1 is recapitulated which is used for an initial step in the new algorithm. Secondly, the extensions made to the object models, new object detection and the image segmentation are presented in detail. I he proposed method inherits a similar per-pixel classification as a first viIttal observation step. It assigns every pixel to one of the different objects that luu' been identified, including a background. The classification is based on the probability that a given pixel belongs to one of the objects given its specific I olor and position. The object probabilities are determined on the basis of two • Omponents. First, the appearance of the different objects is learned and upBated, and yields indications of how compatible observed pixel colors are with these models. Secondly, a motion model makes predictions of where to explCt the different objects based on their previous positions. While keeping the 'I) per-pixel segmentation, akin to similar Bayesian filtering approaches, the tflcker is extended by a novel object-class detection, an iterative segmentation refinement, and a camera calibration. I or every object, individual appearance and motion models are incorporated ami updated over time. Figure 3.1 sketches the per-pixel classification as a B lull of the appearance probability images for each object. The segmentation i'. has been simplified from the previous method by dropping the pixel-wise potion probabilities at this stage. i tally, the classification is described by Equations 3.1 and 3.2 below. The probability of a pixel belonging to the ith object o[ at time t is determined on
40
3. EXTENDED 2.5D REAL-TIME TRACKING
the basis of an observation likelihood given the associated object model. For all known objects including the background, their probability of occupying a specific pixel location is calculated and compared. Using Bayes law, we can compute the posterior as a product of the appearance likelihood and a prior probability. PPosterior(ol\pixeli:t) cc P(pixelt\o\)PPrior{o\\pixelixt-i) segmentation
= max(Pposterior(ol\pixeli;t))
(3.1) (3.2)
object
• State vector for n objects : objects = {o°, o 1 ,..., on} • Appearance model (likelihood): P{pixelt\o\) The major difference to the previous method is that the prior probability Pprior takes inter-object-occlusion into account instead of individual motion prediction. For the initial segmentation, the prior probabilities are the same for all objects, as their exact position and thus possible occlusion are not known. The segmentation described by Equation 3.2 assigns every pixel to the object with the maximum posterior probability. In the initial segmentation this only takes the appearance likelihoods given by our color models into account. The likelihoods are computed in an area where the object is expected to appear according to the state vector. Given the camera calibration, the state vector contains the position and velocity on the ground floor of each foreground object. A natural extension to the previous 2D method presented in Chapter 2. The next Section introduces the tracking algorithm in more detail. And Section 3.3.2 shows how the initial segmentation is further refined iteratively by a pixelwise modification of prior probabilities from occluded objects.
3.3.1 Tracking Algorithm The tracker executes the steps shown in Table 3.1 for every frame. It contains several new steps in contrast to the old algorithm in Table 2.1. First, positions of known foreground objects are predicted and sorted according to their distance to the camera. In a second step, the color models are applied to a small region - for computational reasons - around the predicted object position to
41
3.3. 2.5D MULTI-OBJECT TRACKING
Appearance models Background 3rd object
Original Image 2nd object
Initial Segmentation
Figure 3.1: Tracking framework: The maximum probability of the individual appearance models results in an initial segmentation using Bayesian per-pixel classification. The white pixels in the segmentation refer to the generic 'new object model' described in Section 3.3.3.
compute the appearance probabilities for each object. An initial segmentation then assigns every pixel to the object with the highest appearance probability. The fourth step finds the new object positions (4.1) and refines the segmentation given the new object position (4.2). This is described in more detail in Section 3.3.2. The fifth step searches for new objects, and deletes unseen ones before all appearance models are updated in step 6. In comparison to the old algorithm the major difference is the exchange of the connected component algorithm with a maximum window search using a fixed object size given by a camera calibration. This change allows us to better handle occlusion by the new iterative object placing as well as to distinguish between different object classes by a novel object type classification. These two novelties are further discussed in the next two sections.
42
1. 2. 3. 4.
5. 6.
3. EXTENDED 2.5D REAL-TIME TRACKING
predict and sort new object positions compute appearance likelihood initial segmentation of image by max probability iterative object placement and segmentation image refining, from close to far objects 4.1 find object's new position in the image, by max window search 4.2 remove outlier pixels, boost inlier pixels and refine segmentation detecting new objects and removing invisible ones update all appearance models Table 3.1: 2.5D tracking algorithm
3.3.2
Iterative Object Placement and Segmentation Refinement
The novel approach for finding new object positions is based on an initial segmentation according to the maximum posterior probability (Equation 3.2). It places objects iteratively from close to distant objects on the ground plane in world coordinates. Figure 3.2 visualizes this process. For each object the fol-
Segmentation Refining Object placement
objects sorted by their distance to the camera Figure 3.2: The segmentation is refined while iteratively searching for the exact object position from close to distant objects.
3.3. 2.5D MULTI-OBJECT TRACKING
43
lowing two processing steps are applied where first the object's position is found and secondly, the segmentation is refined. The new object position is found by applying a maximum window search on the 2D segmentation image. The bounding box width and height at the predicted object location is used to search over the whole segmentation image. The location where the shifted bounding box encloses the most object pixels is the result of the maximum window search. It is at the same time the new object position. During the refining step some pixels of the segmentation image are modified. In Equation 3.1 the prior probabilities are decreased of pixels outside of the new bounding box , while the those inside are increased. Equation 3.2 is then re-evaluated given the changed probabilities resulting in a refined segmentation. Pixels outside and initially labeled as the object are assigned to the object with the next highest probability. Furthermore, pixels inside of the bounding box of the new object position which are not yet segmented as the object obtain a higher probability. The overall quality of the segmentation is improved, especially during occlusions.
3.3.3
New Object Detection
The tracker creates new foreground objects based on a generic 'new object model' j\f. This special appearance model has a uniform, low probability pN. Thus, when the posterior probabilities of the background and all known foreground objects drop below pjV, the pixel is assigned to J\f indicating a new object or a badly modeled foreground object. By exploiting the camera calibration, only pixels on N are projected onto the ground plane to vote for a new object position. Votes are generated as follows: • One vote for every pixel • Pixel votes are summed in a column of successive pixels of M giving the highest weight to the bottom pixel With this voting scheme we accumulate evidence about new object positions on the ground plane. In conjunction with the camera calibration, it allows to map
44
3. EXTENDED 2.5D REAL-TIME TRACKING
2D image blobs to 3D world coordinates without a connected component algorithm. Furthermore, it makes it possible, given a rough segmentation, to guess the depth of objects from a monocular camera. Limitations are encountered for crowds, where the separation of individual objects on the ground plane might fail. When the votes on the ground plane are determined, a maximum window search is performed on the ground plane with the size of the expected object classes. As shown in Section 3.4 the method was successfully tested in distinguishing people and cars for different datasets. In this scenario, first a search with the maximum window size for the larger car is performed. Afterwards, we search for pedestrians with a smaller window on the remaining ground plane votes. The accumulation of ground plane votes for a given object position and window size form a score which is compared to the initialization threshold of the object class. Furthermore, the new object position is checked for a possible overlap with current objects, which would prevent the initialization of a new object at the same position. Additional boundary conditions prevent objects from being initialized at image borders, where a correct object class distinction might not be possible. The new object's position on the ground plane and object class directly determine the size of the bounding box. A fixed real world height and width is assumed for all objects of the same class. All J\f pixels inside the box are removed from the segmentation image and their votes are removed from the ground plane. These pixels are then used to initialize the appearance model of the new object. The maximum window search on the ground plane is repeated until no more new objects can be found of a certain class. Then the search for the next smaller object class starts.
3.3.4 Tracking Models This section briefly describes the different models used to compute the probabilities of the appearance models, as well as the dynamic model for the motion prediction. They are similar to the models used in our previous method. Color Model All our appearance models use variations of Gaussian mixtures in RGB color space as described in Section 2.3.1. Stauffer and Grim-
3.3. 2.5D MULTI-OBJECT TRACKING
45
son [69] have proposed this popular choice for modeling scene backgrounds with time-adaptive per-pixel mixtures of Gaussians (TAPPMOGs). However, modifications to this approach were applied in order to fit into the multi-model approach. The Gaussian is split among the background and foreground models and described below. Appearance Background Model In contrast to Stauffer's algorithm which combines foreground and background in one model, only one single Gaussian for each pixel of the background is used. Section 2.3.2 gives more detail about this model, which is initialized at start-up with a clean background. Appearance Foreground Model The appearance model is implemented as a 'sliced object model' described in Section 2.3.4. It divides the object into a fixed number of horizontal slices of equal height. For each slice, color models with multiple Gaussians are generated using EM, representing the most important colors for that part of an object. Each object class has a specific height and width. Dynamic Model The movement of each foreground object is predicted individually. A linear Kalman filter models the movement on the ground plane in world coordinates.
3.3.5
Ground Plane Assumption
In addition to the previous method [60], an extrinsic camera calibration [72] was added. In conjunction with a ground plane assumption, object movements are restricted, predictions are limited to the ground plane and made in world coordinates rather than 2D image coordinates. An example of the ground plane assumption is shown in Figure 3.3. Furthermore, objects are assumed to have a fixed 3D size. The width and height of a human is approximated with a fixed sized cylinder resulting in hard constraints for the size of the bounding box. While this is a simplification, the system is still able to handle varying human shapes. The fixed object size, in combination with the restriction to ground plane movements, significantly improved the tracking as well as the localisation of the objects in world coordinates, especially under occlusion.
46
3. EXTENDED 2.5D REAL-TIME TRACKINQ
Figure 3.3: Ground plane calibration
3.4 Extended 2.5D Tracking Results and Discussion This section presents qualitative results of the tracking method on multiple public datasets as well as a prototype system. First, results of the object classification and tracking on a challenging sequence of a pedestrian crossing (Central square) with cars and pedestrians are discussed. Due to the complexity of the sequence, the results are briefly compared with detection-based tracking approaches. Then, results on the CVC outdoor dataset and experiments made with two HERMES demonstrator systems are shown.
3.4.1
Central Sequence
The Central pedestrian crossing sequence was recorded with a public web-cam at 15fps, 320 x 240 pixels resolution, and contains severe MPEG compression artifacts. Both Leibe et al. [45] and Breitenstein et al. [9] presented results with different detector-based tracking methods on the same publicly available dataset [30]. This discussion is used to briefly mention the major differences in tracking performance between the segmentation-based method and
1,4. HXTENDED 2.5D TRACKING RESULTS AND DISCUSSION
47
iIn- detection-based approaches. The major challenge for the presented tracker ' i lie correct classification between pedestrians and cars. Furthermore, up to i dozen interacting objects lead to crowded situations and multiple occlusions '. huh is a very challenging situation for any tracker. i ure 3.5 shows the sequence overlaid with the results of the tracker. The n lined segmentation is displayed in Figure 3.4. During the first 800 frames all i" iKsirians and cars are correctly classified while entering the scene. Cyclists HI.I bikers (Figure 3.5(a)) are classified as pedestrians and all trajectories are -iieet. The object height is divided into five equal slices to learn the color models of the foreground objects. Despite specularities and reflections of cars the color models are able to constantly adapt to the changing appearance due i" ilie EM learning. In ihe following frames the number of cars and pedestrians constantly inri ises. Between frames 1400 in Figure 3.5(b) and frame 1600 in Figure * 5(d), the number of cars and pedestrians begin to increase. About 10 pedesiii.ms cross the street from both sides while several cars are waiting. During i In. part of the sequence the tracker loses track of several of the pedestrians duriii}' the severe occlusion in the center of the pedestrian crossing. However, due in ihe object detection described in Section 3.3.3 missed objects are detected MI in after the occlusion phase. In frame 1795 of Figure 3.5(e), two people with a suitcase are mistakenly iden11in-iI as a car due to their wider footprint on the ground plane. In comparison, the non-real-time pedestrian-detector approaches [45][9] showed 11 u»ie complete tracks of the pedestrians in crowded situations. The robustness ni the tracking is better in this situation than the real-time approach. However, Hi i car detectors were used and cars were therefore not tracked. The presented method detects and tracks nearly all objects including cars in real-time with a inch accuracy in less crowded situations.
1,4.2 HERMES Outdoor Sequence I his section qualitatively reports the tracking results as well as quantitatively II leisure the computation costs of our tracker for the CVC outdoor sequence. In Section 4.5.3 the same sequence will be used for a quantitative evaluation of ninliiple tracking methods.
48
3. EXTENDED 2.5D REAL-TIME TRACKING
(a) Camera view
(b) Segmentation
Figure 3.4: Segmentation from the central square sequence. The different size of the bounding-boxes visualize the different object classes. Unique object IDs are shown in the top left corner. Black pixels = background, white pixels = unassigned pixels Af, colored pixels = individual objects. Figure 3.6 shows the tracking results of this sequence. The tracker flawlessly tracks cars and pedestrians throughout the whole sequence, with the exception of the parked car which cause a wrong object detection when driving away. Initially, the parked car is part of the background model and both the reappearing street as well as the car cause each a foreground object. Furthermore, the computational performance of the tracker was analyzed with respect to the number of tracked objects. Figure 3.7 plots the computation time in milliseconds as well as the number of objects tracked for the CVC outdoor sequence of the proposed tracker. The performance was measured on a 2.13 GHz CPU with a video resolution of 320 x 240. The time varies between 8 and 38 milliseconds and scales with the number of foreground pixels in the scene. The two peaks around frame 860 and 1350, directly indicate the presence of the larger cars while the number of pedestrians has a much lower impact on the computational cost. For a single frame, most time is spent for the computation of the pixel probabilities as well as the segmentation.
3.5 Application: HERMES Demonstrator The presented tracking method was integrated into a distributed multi-camera system for multi-resolution surveillance [6] as part of the European research
49
M'I'LICATION: HERMES DEMONSTRATOR
(a) Frame 454: Cyclist
(b) Frame 1426:
(c) Frame 1466
(d) Frame 1581
(c) Frame 1795
Figure 3.5: Tracking results from the Central square sequence. Objects are visualized by their bounding box and unique ID in the top left corner.
50
3. EXTENDED 2.5D REAL-TIME TRACKING
(a) Tracking results
(b) Segmental ion
Figure 3.6: Tracking results from the HERMES outdoor sequence.
Frame number
Figure 3.7: Computational effort: The blue curve above shows the computation time in milliseconds per frame. Below in red, the number of tracked objects is given.
3.5. APPLICATION: HERMES DEMONSTRATOR
51
project HERMES [49]. This section presents tracking experiments as well as a brief general introduction of the international research project. The aim of the prototype system, described in the publication of Bellotto et al. [6], is to interconnect a set of distributed static and pan-tilt-zoom (PTZ) cameras and visual tracking algorithms, together with a central supervisor unit. Each camera (and possibly pan-tilt device) has a dedicated process and processor. The 2.5D tracking approach presented in this chapter is used to analyze the static camera images in real-time. Asynchronous inter-process communications and archiving of data arc achieved in a simple and effective way via a central repository, implemented using an SQL database. Visual tracking results from the tracker are stored dynamically into tables in the database via client calls to the SQL server. A supervisor process running on the SQL server determines if active zoom cameras should be dispatched to observe a particular target, and this message is effected via writing demands into another database table. Experiments demonstrate the effectiveness of our approach on two multi-camera systems for intelligent surveillance. Results are presented from an implementation of the system comprising of a static camera monitoring the environment under consideration and a PTZ camera under close-loop velocity control. A complete system contains in its minimal configurations three dedicated computers, one for the static camera, one for the PTZ camera and one supervisory computer running the database and control processes. Figure 3.8 gives an overview of the prototype. Next, the results of these systems with an emphasis on our real-time tracker for the static camera are discussed. A first prototype system was built for an indoor scenario, where the camera is mounted to the ceiling of the room as shown in Figure 3.9. Major challenges for the algorithm are the severe radial distortion, as well as the low ceiling height. Both non-idealities violate some implicit assumptions made for the 2.5D interpretation of the monocular view and has the following effects: The radial distortion slightly rotates people's appearance in the image corners. The object classification based on the footprint of objects is therefore less accurate, especially in the areas of the lower right and lower left image corners. The assumption of a vertical object appearance is violated for the pixel wise down-projection. For this setup, the tracking performance is still sufficient as only one object class was present.
52
3. EXTENDED 2.5D REAL-TIME TRACKING
Supervisor modulo (Integration » Control)
Static camera view
Active camera view
Figure 3.8: HERMES distributed multi-camera system sketch. It shows the super visor computer with SQL database on top, the static camera tracker on the bottom left, and the active camera view to the bottom right. However, a more severe radial distortion probably needs to be corrected prior to tracking for a reliably object classification. • Due to the low ceiling, the area to the back of the room has a small parallax to the mounting of the camera. The view is optically almost at the same height as a person's head and does not overlook this area. Therefore, a person's position standing in this area cannot be determined accurately. Regardless of the exact distance from the camera, a person's head lies on the same cpipolar line. The bounding box often appears too large assuming the person closer to the camera then in reality. The second prototype system is an outdoor scenario built at the CVC center in Barcelona. It consists of one static and one PTZ camera, similar to the first setup. However, the camera type, location, and larger view leads to new and
3.6.
53
SUMMARY AND CONCLUSION
(a) Two people tracking
(b) Segmentation
(c) Radial distortion rotates the person in the (d) Person on the far back is assumed closer lower left corner. than he really is due to geometrical limitations.
Figure 3.9: HERMES indoor demonstrator in Oxford. Static camera view and tracking results. interesting situations. Figure 3.10 shows the view of the static camera. In particular the uncontrolled environment, such as the lighting, the number and type of object as well as the general action happening in front of the camera demands for increased robustness of the tracking algorithm.
3.6 Summary and Conclusion In this chapter we introduced a novel 2.5D tracking framework for real-time multi-object tracking. The method is tested on multiple challenging datasets and an on-line prototype system. Different object classes are recognized by their distinctive footprints by accumulating evidence on the calibrated ground
54
3. EXTENDED 2.5D REAL-TIME TRACKING
Figure 3.10: New CVC Demonstrator, static camera view
plane. An iterative object placement process improves the segmentation and tracking, especially during inter-object occlusions. The method improves in several aspects over our previous method [60]:
Occlusion handling could be improved by modeling the occlusion with our probabilistic framework and refining the segmentation iteratively from close to distant objects. Object sizes are stable and restricted to known object classes, which we can distinguish based on the footprint of the object. They are no longer dependent on the connected component analysis of blobs. The addition of environmental modeling in the form of a ground plane assumption helped the tracking stability, making object classification possible. Furthermore, it also models the expected object sizes and is therefore a crucial element of the system. An accurate calibration however, is essential for good performance.
3.6. SUMMARY AND CONCLUSION
55
• The method is not suitable for setups violating the ground plane assumption. Furthermore, a sufficient overview of the scene is needed to accurately determine the distance of objects. The capabilities and limitations of the 2.5D real-time tracker were shown on multiple challenging datasets. A quantitative evaluation will be applied in the next chapter.
4 Kvent-based Tracking Evaluation I his chapter introduces a tracking evaluation metric and extends the results of iin two previous chapters with an evaluation of the presented trackers.
1.1 Introduction Performance evaluation of multi-object trackers for surveillance has recently i l eived significant attention as a tool for several important applications other ih.in functional testing: •
during development, experiments and parametrization of a tracking algorithm. MEASURING IMPROVEMENTS
• SCIENTIFIC JUSTIFICATION and measuring progress for different tracking methods, such as the ones presented in Chapters 2 and 3. • BENCHMARKING WITH COMPETITORS in the research community or
as part of a tracking evaluation programme. • APPLICATION DEVELOPMENT where complementary algorithms have to be benchmarked in order to determine the one which best fulfills the requirements. •
COMMERCIAL AND LEGAL PURPOSES such as the standardization and advertisement of products for activity recording and alert detection.
58
4. EVENT-BASED TRACKING EVALUATION
evaluation can be equated with a comparison of theoretical expectations know I as ground truth to actual results. Both the comparison measure as well as tin ground truth is a human interpretation. The problem of defining a track\\w evaluation metric therefore requires techniques to reduce the human influence in this process, while at the same time finding a simple score which resembld the overall human judgement. Due to the human interpretation, evaluation metrics for surveillance are almost as numerous as multi-object tracking methods themselves [78; 54]. They mostly address specific issues of the involved algorithms and application in mind. As such, evaluation methods and scores can rarely be compared. Often they do not directly tie in with the overall semantic interpretation of the seen, that users would be most interested in. As an example, assessments of pixel precise target detections are relevant for the evaluation of sub components like figure-ground segmentation, but fall short of determining whether a system can make sense of what is going on in the scene. The interpretation of scores in terms of what they convey about the correct or incorrect analysis of actions ia often difficult. As a result, human visual inspection is still needed to compare and estimate general performance and limitations. Supplementary material as part of a paper submission is highly appreciated by reviewers to get a better assessment. This novel tracking performance evaluation method limits the human influence on the pixel-, frame- or trajectory - level. Instead, the metric is motivated by the fact that humans conceptualize the world in terms of events and objects, and the metric aims to imitate this behavior by evaluating tracking performance on such higher, conceptual levels. Instead of comparing trackers and ground truth data directly on a low semantic level, different types of higher level events are extracted such as entering the scene, occlusion or picking-up a bag from the available data. The metric then focuses on the completeness of such event detections to do the evaluation.
59
•I I. INTRODUCTION
I lie proposed method comprises the following advantages: • The lengths of trajectories do not influence the metric making it independent of the frame rate and density of the ground truth labeling. • The type of events taken into account for the final metric can be fine tuned for different application scenarios. • Easy integration into higher level event and object detection frameworks. • The metric directly helps to improve tracking algorithms by identifying: -
difficult difficult difficult difficult
trajectories scene locations situations event types
• Fast generation of ground-truth data as not every frame needs to be annotated in full detail, as long as the events can be reliably extracted from sparse annotation. • Reuse of already available ground truth data by automatic conversion into our novel event-based representation. • Minimizing the human factor within the ground truth data and its influence onto the metric by means of event-based evaluation on a higher level. • A precise distance measurement between objects in real-world coordinates eliminates the need to define unreliable 2D bounding box distances. • Aims at minimizing the need for human visual inspection of results, allowing faster testing of new algorithms or longer sequences. • Establishing a lowest-common denominator to represent tracking data which is versatile to handle many different output formats. I he rest of this chapter is organized as follows: Section 4.2 discusses previous work in the domain of tracking evaluation. In Section 4.3 the novel eventkiscd metric is defined; Section 4.4 introduces multiple case studies, trackers .Hid the events used on selected public datasets and ground truth data. Section •1.5 shows the results of the evaluation while Section 4.6 discusses our findings and draws a conclusion.
60
4. EVENT-BASED TRACKING EVALUATION
4.2 Previous Work In this section different evaluation methods are classified into four semantic levels. Then the most closely related publications are discussed. Finally, a separate subsection gives an overview of evaluation programmes. In general, the evaluation of tracking methods is driven by the human interpretation of a particular sequence with respect to a specific application of interest. In such an evaluation, the sequence itself contains and defines the difficulties posed to the algorithm. To make these difficulties measurable, a human annotator has to label the sequence manually, resulting in a subjective interpretation of a scene. The task of evaluating a tracking method can then be described as the comparison and scoring of the computed results against the human interpretation. Related work in the field of performance evaluation for object tracking can roughly be classified into four different semantic classes that specifically address one or more of the following semantic levels: • pixel-level [1; 56; 82] • frame-level [3; 1] • object trajectory level [3; 78; 66; 11; 81] • behaviors or higher level events [15; 11] Desurmont et al. [15], which is most closely to this work, presented a general performance evaluation metric for frequent 'high level' events where they use dynamic programming to specifically address the problem of automatic re-alignment between results and ground truth. Their definition of an event is however very much limited to the detection of blobs crossing predefined lines in order to count people passing by. The same limitation applies also to the use of dynamic programming, whereas the method in this thesis integrates additional location information to the matching problem. Bashir and Porikli [3] presented a set of unbiased metrics on the frame and object level which leaves the final evaluation to the community. However, the total number of 48 different metrics make the interpretation difficult. Aguilera et al. [1] presented an evaluation method on the frame and pixel level, all based on segmentation. The pixel-level metric is currently used for the online service called PETS metrics [82].
4.2. PREVIOUS WORK
61
Wu et al. [78] evaluate their body part detector-based tracker using five criteria on the trajectory level which cover most of the typical errors they observed. Furthermore, occlusion events were separately evaluated defining short and long scene or object occlusions. The metric then gives the number of successfully handled occlusions against all occlusions of a certain category by dataset. Cavallaro and Ziliani [11] propose a benchmarking protocol where they distinguish between an algorithmic and an application-dependent evaluation. The first set of four measurements purely address the robustness, accuracy, stability and computational complexity of an algorithm without taking the object detection problem into account. The second benchmark computes a final reliability score based on the detection of high level events or behaviors. These events are manually defined in the context of a specific application. The paper names an example application of detecting abnormal people behaviors in a subway station. The description of the score calculation, however, is incomplete. It is intended to combine the anticipation, delay, duration and location for three different importance classes of events. However, false positive and false negatives should be penalized without specifying a matching algorithm. Due to the absence of real experiments, the applicability of the proposed benchmark is not shown. Yin et al. [81 ] propose a rich set of a dozen metrics to measure the performance. This set of metrics is applied to a whole sequence giving scores for e.g. track fragmentation, ID changes or latencies. Apart from the discussed evaluation methods, there exist additional testing methods for trackers which manipulate the input sequence in artificial manners to test the robustness of a tracker. Examples are the artificial corruption of the input video with noise, illumination changes, jitters or other effects. While such tests are relevant for real world applications, they do not measure the tracking capabilities directly. Instead they are a measure for the robustness of the image features, which is not directly relevant for a general tracking evaluation.
4.2.1 Tracking Evaluation Programs The Performance Evaluation of Tracking and Swveillance (PETS) program started in 2000 with its first workshop. It is the longest running evaluation program with a total of ten workshops. The theme of the workshop changed
62
4. EVENT-BASED TRACKING EVALUATION
from target segmentation [82]. detection and tracking to event level tasks in the past few workshops. The Video Analysis and Content Extraction (VACE) program was established with the objective of developing novel algorithms and implementations for automatic video content extraction, multi-modal fusion and event understanding. Within this large multiphase programme the University of South Florida carried out a performance evaluation initiative under the guidance of the National Institute of Standards and Technology (NIST). Manohar et al. [48] present a qualitative comparison of the VACE and the PETS programs in order to name possible synergies of a information exchange between the programs. Furthermore, a multi-step procedure is proposed to optimally choose test sequences, annotations and define suitable evaluation metrics. This guidelines were then used to initiate the Classification of Events, Activities and Relationships (CLEAR) workshop. It brought together the VACE and CH1L programs in order to extend the corpus of test sequences for the community as well as to develop a widely accepted performance metric. In Kasturi et al. [36] the performance evaluation framework is described. Furthermore, the necessary infrastructure such as source video, task definitions, metrics, ground truth and scoring tools are provided as supplement material to this publication on the Computer Society Digital Library. The European project Computers in the Human Interaction Loop (CHIL) [27] explores and creates computing services which provide helpful assistance for human to human interactions by anticipating the state of the human activities and intentions. The ETISEO project [31] aims at acquiring precise knowledge of vision algorithms by inviting multiple institutes to report on a general corpus of video sequences. Different metrics were proposed and a final evaluation is still in progress. Within ETISEO Nghiem et al. [56] presented an interesting approach where they evaluate multiple trackers on isolated video processing problems of different difficulties.
4.3 Event-Based Tracking Metric This section explains how the proposed event-based tracking metric is defined, generated and evaluated. Figure 4.1 shows an overview of the application of the
4.3. EVENT-BASED TRACKING METRIC
63
event metric. Event information is both extracted from the ground truth data and from the tracker. The lists can then be compared by using the proposed event metric definition resulting in a score that reflects how well the tracker is able to handle the specific sequence.
Figure 4.1: Evaluation scheme
4.3.1
Event Concept
The event is the basic building block of the evaluation metric. It represents the behaviour at a higher semantic level. A list of events therefore can describe the ongoing action in a scene which is similar to the conceptual level of humans and similar to human language. Each event describes an action conducted by an individual similar to subjects and verbs in natural language. Gerber and Nagel [20; 21] describe a system which extracts 'occurrences' from road traffic scenes. Those 'occurrences' have a broader meaning than the 'event' which is limited to actions without a duration. In this thesis an 'event' always describes instant actions happening at one particular point in time unlike the 'occurrences' which distinguish explicitly betweenperpetnative, mutative and terminative actions. This simplification has several advantages for the evaluation task because continuous unchanged states are extraneous. Furthermore, the evaluation metric can be simplified if all events are instantaneous. Finally, the use of events does not prevent us from modeling perpetuative actions as we can enclose them with
64
4. EVENT-BASED TRACKING EVALUAI IOS
starting and ending events as well as separating mutative evolving actions ai cordingly. Other definitions of'events' can be found in the literature [541. hut none of these definitions are general enough to suit our needs. An event E is a 4-tuple and always consists of an event-type V, a point in timi T, a location C and it is related to one object O. The event-type V identifies an action or interesting change in the scene as described in the next subsection. T is measured in seconds and fractions of seconds, computed from the frame ran and frame number when an event occurs in the sequence. The event local ion C is defined as the 3D base point in world coordinates of the associated ob ject. Given the camera calibration and a ground plane assumption a 2D to 3D correspondence is given in most cases. Objects O are numbered with unique IDs.
4.3.2 Event Types Event types V are selected in such a way that they are relevant for the application, have higher level meaning, can unambiguously be identified from ground truth as well as from the tracker results, and are atomic. In order to simplify the final evaluation metric described in Section 4.3.4 each V is handled individually which leads to the following examples of event types that are used for the case study in Section 4.4: • Entering / Leaving the scene • Starting / Ending an occlusion • Entering / Leaving a specific area like a shop or a pedestrian crossing Many more events could be considered for other applications, for example: • Pointing at something / Being pointed at • Starting / Ending walking, running, standing • Picking up a bag Actions with a certain duration such as the presence in the scene, movement attributes like running or an occlusion are split into a starting and ending event. Very short actions such as picking up something are single events. Interactions are handled implicitly by expecting the same event from each involved actor,
), I-VENT-BASED TRACKING METRIC
65
Dl h as occlusions. In case of directed actions such as pointing, two different .-wilt-types are used. Furthermore, it is possible to define special areas in the . me such as a pedestrian crossing or waiting areas which can trigger events. Within the scope of this evaluation metric, we do not further expand or group •Vents into different hierarchical layers even though some events would have a II-mantle relationship. Within such a hierarchical event-logic, an 'entering' and ,i 'leaving' event for the same object could be grouped into a 'object was seen' > in for example. This would further require defining a distance measure Wtween events in order to correctly evaluate partially correct event-structures. I lefining semantic distances is very difficult, especially for the general task of multi-object tracking. However, for specific applications and well constrained tasks a hierarchical event-logic could indeed be defined as shown in [20; 21].
4.3.3
Event Generation
In order to generate events from either manually labeled data or continuous 11 acker output, only the four basic building blocks of an event E(V, T, C. O) need to be extracted from the data. This allows the comparison of different is pes of trackers with different styles of annotation data. However, individual conversion methods have to be used to generate the events depending on the lype of the underlying data. / is measured in seconds and O can directly be extracted from almost any data lype. C is more difficult to extract as it most often needs further processing I© find the 3D base point from 2D image coordinates of an object given the camera calibration. The most difficult part is the definition and extraction of events, which will be discussed in more detail. (ienerally, event extraction rules can be defined with as much complexity as desired, using additional data such as camera calibrations or location maps, which are not necessarily required during annotation or tracking. Using the same formalism and categorization of Gerber and Nagel [20; 21], an event is defined by its pre-condition which has to be satisfied before the event is happening and a post-condition once the event has happened. Section 4.4.4 gives concrete examples. In order to reliably extract events, it is often important to take more information into account than just V, T, C, O. The additional information required to
66
4. EVENT-BASED TRACKING EVALUATION
reliably define an event has to be chosen carefully in order to compare different trackers and sequences.
4.3.4
Evaluation Metric
The evaluation metric is based on comparing the list of events extracted from ground truth data and the list of events extracted from the trajectories generated by the tracking algorithm. The evaluation can best be described as a pipeline of 3 steps where first dynamic programming is used to find the best matches between events of the same V from both lists. Then the different measurements are computed based on the matched events. Finally, the event matching is analyzed for each object individually in order to measure changing object identities. Due to the characteristic of the 4-tuple of each event, the metric can exploit either binary matches or continuous distances as shown in Table 4.1. E V T C O
building blocks event type time location object ID
binary match X thresholded thresholded X
cont. distance X X -
Table 4.1: Evaluation metric As described in Section 4.3.2, there is no distance measurement between different event-types. Therefore, each event V is matched and evaluated separately.
4.3.5
Evaluation Pipeline
In the first step of the evaluation pipeline the matching distance matrix is computed. It contains the distance between eveiy tracker event and each ground truth event. Figure 4.2 shows parts of such a distance matrix. T and C are combined into one distance for the distance measurement between two events i, j as described in Equation 4.1.
distij = min(a\Ti — TA + \\d —
Cj\\,maxdist)
(4.1)
67
4.3. EVENT-BASED TRACKING METRIC
where ||...|| is the Euclidean distance, a is a scaling factor in order to allow the combination of the different units of seconds and meters. The parameter maxdist is a maximal distance above which the match is considered to have failed. For the experiments, a is chosen in such a way that a difference of 5 seconds respectively 12 meters is equal to maxdist. These values are chosen with a generous margin above Tave and £ave as we do not want to penalize reasonable time and location deviation at this early stage in the evaluation pipeline. 150 150 150 150 150 150 150 150 150 150
150 24.57 150 150 150 150 150 24.57 IbO 150 150 IbO 150 150 24.31 150 150 150 150 IbU 150 150 23.43 11.01 12.3 IbU IbU IbO IbO 24.09 IbO 150 IbO 150 38.06 28.25 IbU IbU IbO IbU 97.14 101.22 150 150 150 150 150 I SO IbU IbU IbO 150 115500 150 150 150 150 150
1
150 150 150
IbO IbO IbO
IbO IbO IbO
150 150 150 150 150 150 15U
150 150 150 150 150 150 150 IbO
19.31 107.8 142.21 15.33 103.92 138.26 18.18 86.66 120.9 80.33 14.79 42.61 IbO IbO 137.34 23.9 IbO 124.b4 110.96 64.07 23.9 IbO IbO 124.54 110.96 64.07
Figure 4.2: Example of a distance matrix. It shows the distances between every ground truth event (column) versus every tracker event (row). Finding the best configuration of matched events is an assignment problem. It consists of finding the smallest total distance in a weighted bipartite graph. There are different techniques to solve this general problem, such as the Hungarian Algorithm [41]. It solves the problem in 0(n3) for a n x n matrix. A square matrix can be achieved by adding artificial events with maximum distance. Given the sparseness as well as the continuous character of the event lists, dynamic programming techniques [5; 16] can be used to match events from the same event type. Dynamic programming gives the optimal match between ground truth and tracker events, as long as both lists have the same relative order and no two events are mixed up. It runs much faster than the Hungarian algorithm in 0(n) time. Events are matched according to the time and location distance as shown in Figure 4.3. Matches below the maximal distance are counted as true positives (TP). Events on the ground truth list which have no corresponding event in the tracker list or match with maxdist are counted as false negative (FN). Events from the tracker which could not be matched to any
68
4. EVENT-BASED TRACKING EVALUATION
ground truth event are counted as false positive (FP). This first evaluation measures how well the defined event types were correctly handled by the tracker regardless of a correct and continued identification of the subjects in the scene.
Ground Truth
Tracker
Figure 4.3: Event matching of same type
In a second step, the average time Tave is computed and location Cave deviations from ground truth of all correct matches (TP) to measure the accuracy of the tracker. FP and FN are important to further identify time periods and locations where errors in the tracking occurred.
In a third step, all ground truth subjects with a successful TP Entering Scene event and their corresponding O are examined for completeness. This metric answers the question of how many of all the events were detected correctly of an object as a percentage of correct events (TP) out of all events. This percentage is a measure for the tracking quality of a certain object. Furthermore, the total number Ot„t of differing identities O are counted for every ground truth object it was matched to, in order to measure identity changes.
Events in the first and last frame of the ground truth are not evaluated. Tracker events which can be matched to those events are discarded. This prevents the evaluation of events, which might have happened before the first frame and obviously could not be annotated correctly such as people already present in the first frame.
4.4. EXPERIMENTS
4.4
69
Experiments
In order to verify the versatility of the novel evaluation metric we conduct different case studies where different tracking algorithms were applied to multiple well-known benchmarking data sets: CAVIAR, PETS2001 and HERMES. First, this section present the three data sets and analyzes them in order to identify the relevant events. Secondly, the events are then described using the notation described above. Finally, the different tracking algorithms used to test our evaluation scheme are introduced. These case studies were conducted as a part of the research project HERMES [49] and published in [63; 61]. Within the research project, the event-based tracking evaluation metric was used to compare different algorithms in order to find the most suitable for an agent tracking prototype. HERMES aims at developing an artificial cognitive system allowing both recognition and description of a particular set of semantically meaningful human behaviors from videos. The developed system combines active and passive sensors. HERMES analyzes the scene at three resolution levels. Depending on distance, people's actions can be analyzed as moving blobs, as articulated body gestures, or based on facial expressions. This research project has set out to interpret and combine the knowledge inferred from those 3 different motion categories. An important aspect is the combination of low level vision tasks with higher level reasoning, as well as the evaluation of different visual systems which cover one or multiple semantic levels. The recognized behaviors will then be used for natural language text generation and visualization.
4.4.1
CAVIAR Data Set
From the numerous CAVIAR datasets [28] the short One Leave Shop Reenter Icor sequence was chosen to demonstrate the event extraction by evaluating six different event types, relevant for this sequence: • Entering scene I Leaving scene • Start occlusion I End occlusion • Entering shop I Leaving shop
70
4. EVENT-BASED TRACKING EVALUATION
The ground truth events were extracted from the publicly available XML annotation data. The shop area shown in Figure 4.4 is the sole additional information to this evaluation. The base point of the objects in world coordinates are calculated with the camera calibration as given on the CAVIAR web-site [28].
4.4.2
PETS 2001 Data Set
In addition, the well known PETS2001 dataset [29] was used as it consists of additional challenges for the metric. Fine tuning of the automatic event extraction was needed due to the special XML format and the different object types such as cars and people. Figure 4.5 illustrates the outdoor scene. For this evaluation the trackers were run for the first 1570 frames. Relevant events for this sequence are Entering scene, Leaving scene, Start occlusion and End occlusion which were extracted automatically.
4.4.3
HERMES Data Set
Furthermore, the tracking evaluation metric was applied to the HERMES outdoor sequence [22]. For this sequence, tracking results from three trackers were compared for the following events: entering scene, leaving scene, entering pedestrian crossing and leaving pedestrian crossing. Table 4.8 shows the total number of detected events for each tracker as well as the ground truth. During the sequence, two bags are carried by different persons labeled in the human annotated ground truth but only tracked by one of the three methods. The area considered as the pedestrian crossing is shown in Figure 4.6(a), while Figure 4.10 shows the plotted events over time. The events were automatically extracted from the hand labeled ground truth as well as from the tracker results in various data formats including CAVIAR XML and other proprietary formats.
4.4.4
Event Description
Even though the event types are semantically clearly defined, this section briefly describes their actual implementation for the different data formats. Entering and Leaving scene events are found in those frames where an object is seen
71
4.4. EXPERIMENTS
Figure 4.4: CAVIAR OneLeaveShopReenterl cor sequence with hand-labeled bounding boxes and shop area.
Figure 4.5: PETS2001 DSl sequence
72
4. EVENT-BASED TRACKING EVALUATION
(a) Pedestrian crossing area used to trigger events.
(b) Dropping first bag
(c) Picking-up second bag
Figure 4.6: HERMES sequence
the first and last time in the sequence. Due to common instabilities in size and localization during entering and leaving, a weighted average filter is applied to the object position. The object size given by the bounding box area is used as a weighting factor for C over 10 frames. Human annotations tend to contain single hands, arms or heads of persons while they appear or disappear, which would give very wrong base point assumptions if not filtered. Start and End occlusion events are triggered as soon as two objects overlap. While the initial start event needs a significant overlap in order to filter out short sideswipes. Special area events such as Entering and Leaving Shop use an additional marked image area for which object movements into or out of the area trigger such events. Again, the number of events is filtered for each object to prevent event bursts due to tracking location inaccuracies.
4.4. EXPERIMENTS
4.4.5
73
Tracker 1
Figure 4.7: Hierarchical multiple-target tracking architecture The architecture of the first tracker by Rowe et ctl. [64] is based on a modular and hierarchically-organized system (see Figure 4.7). A set of co-operating modules which follow both bottom-up and top-down paradigms are distributed through three levels. Each level is devoted to one of the main different tasks to be performed: Target Detection, Low-Level Tracking (LLT), and High-Level Tracking (HLT). Since high-level analysis of motion is a critical issue, a princi-
74
4. EVENT-BASED TRACKING EVALUATION
pled management system is embedded to control the switching among different operation modes, namely motion-based tracking and appearance-based tracking. Additionally, the system monitors the interaction between different targets and can properly instantiate and remove trackers based on what happens in the scene. It copes with clutter distracters by selecting the most convenient colour-related features. For that purpose, a set of appearance models is continuously conformed, smoothed and updated. Thus, multiple targets are represented using several models for each of them, while they are simultaneously being tracked. The system works as a stand-alone application and is designed for offline processing of sequences without the need of real-time operation. For the case study, tracking results were exchanged in a proprietary format containing ellipse target representations on a frame by frame basis. To further ease processing of this format, rectangular bounding boxes were computed around the ellipses prior to event extraction, which resulted in similar data over all trackers and ground truth.
4.4.6
Tracker 2a and 2b
The trackers 2a and 2b are the ones presented in Chapter 2 and 3 respectively. Both real-time trackers store their tracking in the CAVIAR XML format, where targets are represented with rectangular bounding boxes for each object ID in every frame.
4.4.7
Tracker 3
The third tracker by Duizer and Hansen [17; 25] is a multi-view tracking system based on the planar homography of the ground plane. The foreground segmentation is performed using the codebook method for each view separately. Given an appropriate training, the codebook segmentation allows robust operation in a 24/7 situation capable of adapting to severe illumination changes. The tracking of objects is performed in each view using bounding box overlap, and occlusion situations are resolved by probabilistic appearance models. Figure 4.8 shows the combined segmentation results from two cameras in a virtual top-view. The corresponding tracks between views are combined based on two different methods. For smaller objects such as humans, the principal
75
4.5. CASE STUDY RESULTS
axis method is extended to handle groups. For larger objects, such as vehicles, the footage region is used to find correspondence. Tracker 3 runs near real-time for a two camera setup. The presented results for this experiment were computed with two cameras out of the four available in the HERMES sequence. Their overlapping region however was restricted as can be seen in Figure 4.8(e) to the pedestrian crossing and some parts of the road, which requires single view tracking in several areas of the scene.
4.5 Case Study Results This section presents results from the case study where the automatic event extraction is applied on different public datasets to evaluate and compare two different tracking algorithms. Finally, some effects of the metric are discussed. To test the versatility of the method for different types of ground truth data, the event representation is extracted from the PETS 2001, a CAVIAR and HERMES sequence. For this process the public available annotation data in different types of XML formats were processed. No additional human ground truth labeling was needed to extract the events automatically. V Entering Scene Leaving Scene Starting Occlusion Ending Occlusion Entering Shop Leaving Shop
ground truth 2 1 2 2 1 1
tracker 1 2 1 2 2 1 1
tracker 2a 2 1 2 2 1 1
Table 4.2: Detected events of the CAVIAR sequence
4.5.1
CAVIAR
Table 4.2 shows the total number of event detections for the OneLeaveShopReenterlcor CAVIAR sequence and Table 4.3 contains the evaluation metric applied to the two object trackers. As can be seen this sequence is tracked perfectly by
76
4. EVENT-BASED TRACKING EVALUATION
(a) Camera View
(b) Camera View 2
(e) Virtual top-view Figure 4.8: Multi-camera tracker by Duizer and Hansen.
77
4.5. CASE STUDY RESULTS
both trackers. The object-based evaluation is not shown here, as it does not contain more relevant information. Slight differences can be seen in the location accuracy Lave where tracker 2a is less accurate than tracker 1 due to strong reflections on the floor especially near the shop area. However, no significant time delay can be measured on either tracker. V Tracker 1 Entering Scene Leaving Scene Starting Occlusion Ending Occlusion Entering Shop Leaving Shop
TP 2 1 2 2 1 1
FN 0 0 0 0 0 0
FP 0 0 0 0 0 0
T 0.18s 0.16s 0.12s 0.16s 0.04s 0.12s
P Tracker 2a
TP 2 1 2 2 1 1
FN 0 0 0 0 0 0
FP 0 0 0 0 0 0
*ave
*~ave
0.32s 0.08s 0.04s 0.04s 0.00s 0.04s
1.91m 0.99m 0.12m 0.14m 0.81m 1.44m
Entering Scene Leaving Scene Starting Occlusion Ending Occlusion Entering Shop Leaving Shop
1
il V<
*~avc
0.17m 0.81m 0.34m 0.59m 0.47m 0.30m
Table 4.3: Event-based evaluation of the CAVIAR sequence
4.5.2
PETS2001
Table 4.4 and Figure 4.9 show for the PETS 2001 DS1 sequence the total number of the event detections. Tables 4.5, 4.6, 4.7 give the evaluation metric applied to the two object trackers. It can clearly be seen that this sequence is more challenging, giving more false negatives (FN) and false positives (FP). The evaluation also shows that the offline tracker 1 performs better than the realtime tracker 2a. Entering and leaving objects are better handled due to the more sophisticated architecture combining bottom-up and top-down paradigms. A closer look into the failures of tracker 1 shows that most FN and FP are caused by tracking multiple objects as one single object instead of the annotated individuals in the ground truth resulting in several missed occlusion events. Tables 4.6 and 4.7 show good results if only the ground truth objects which have an
78
4. EVENT-BASED TRACKING EVALUAI ION
associated tracker object are calculated. The analysis of tracker 2 shows high FPs of entering and leaving scene events, which is caused by lost tracks during occlusion. This can directly be seen in Table 4.7 by the high number of O,,,, which counts identity switches. Furthermore, this experiment shows an intei esting effect for the two occlusion events. Both trackers show a high numbei of FPs, sometimes outnumbering the true positives due to several lost tracks 01 single tracks representing multiple ground truth objects. Therefore, the m< surement of starting and ending occlusion might not be a very significant evaluation measurement for the trackers on this sequence. The high number oi events and the difficulty to detect them precisely in the different data formats are indications to use occlusion events with caution. tracker 1 11 5 32 28
V ground truth Entering Scene Leaving Scene Starting Occlusion Ending Occlusion
10 3 36 38
tracker 2a 18 15 28 26
Table 4.4: Detected events for the PETS sequence
0
0£ •*
Ending Occlusion Starting Occlusion
M c°^ =f |
* *K *
# 00 0 *
Leaving Scene 0_
_ _ 000 *•
Entering Scene-
200
400
600
-*:-•-:<-
*
•*•
*
0^00 0JS> ,£ *
_0 _0_ *
800
0_«&
1000
02>cO 0
1200
1400
Frames Figure 4.9: Frames/Event plot for the PETS sequence. Stars are ground truth events, squares from tracker 1 and diamonds show events from tracker 2a.
79
4.5. CASE STUDY RESULTS
V Tracker 1 Entering Scene Leaving Scene Starting Occlusion Ending Occlusion
TP 9 3 23 20
FN 1 0 13 18
FP 2 2 9 8
T L-ave •Lave 0.11m 1.28s 0.02s 0.06m 0.82s 0.28m 0.43s 0.04m
V Tracker 2a Entering Scene Leaving Scene Starting Occlusion Ending Occlusion
TP 7 3 18 16
FN 3 0 18 22
FP 11 12 10 10
1.8s 0.7s 0.52s 1.00s
*ave
r 0.18m 1.25m 0.29m 1.05m
Table 4.5: Event-based evaluation of the PETS sequence
O Tracker 1 GT Object 0 GT Object 1 GT Object 2 GT Object 3 GT Object 5 GT Object 6 GT Object 7 GT Object 8 GT Object 9 Total
TP percentage 4/4 8/16 12/13 7/11 6/12 3/3 6/10 3/4 2/4 51/77(66%)
Otot
1 3 2 5 2 1 2 2 1
Table 4.6: Object-based evaluation of tracker 1 (PETS)
4. EVENT-BASED TRACKING
80
O Tracker 2a GT Object 0 GT Object 1 GT Object 2 GT Object 3 GT Object 6 GT Object 7 GT Object 9 Total
TP percentage 4/4 10/16 10/13 5/11 3/3 4/10 1/4 37/77 (42%)
EVAI I
vrio
Otot 3 4 5 6 2 4 1
Table 4.7: Object-based evaluation of tracker 2a (PETS)
4.5.3
HERMES
Table 4.8 shows the total number of event detections for the sequence and Tables 4.9,4.10,4.11 and 4.12 contain the evaluation metric applied to the object trackers. In general, the evaluation of the three trackers shows good results with interesting differences and trade-offs. We now briefly discuss and interpret the evaluation scores for each tracker: V Entering Scene Leaving Scene Entering PedX Leaving PedX
ground truth 8 7 6 7
tracker 1 14 10 6 7
tracker2b 7 6 6 6
tracker3 8 7 6 6
Table 4.8: Detected events for the HERMES outdoor sequence
Tracker 1 Table 4.9 shows the results of the different trackers. In comparison to the other two methods tracker 1 is the only tracker with zero FN. Only tracker 1 detects and tracks the bags in the scene (see Figures 4.6(b) 4.6(c) and 4.13). However, these good results come with a price of multiple FPs and identity changes during pickup and dropping of the bags as shown in Table 4.10. The
81
i \si-: STUDY RESULTS
ft 0 z
Leaving PedX
He ft
Entering PedX
D
Leaving Scene Entering Scene 0D 200
400
* * D
ft 0 600
ft 0 • * ft ft
#rft
ft
0>0
0 ft 0
o o
ft • •#r a 0 0 a z ft ft ft ft 00 z D• 0 m • •
800
1000
1200
ftttr ft
0>0 0 — zz. z ft 00 DD: * 1400
Frames Figure 4.10: Frames/Event plot for the HERMES sequence. Stars equal ground truth, squares equal tracker 1, diamonds equal tracker 2b and pentarrams equal tracker 3.
disappearing car at the end of the sequence leads to multiple wrong object appearances, increasing the number of objects overly.
Tracker 2b
Tracker 2b is equal to the 2.5D tracking method presented in Chapter 3. It is an improved version of tracker 2a used for the Caviar and PETS sequences in the previous experiments. It now includes the capability to handle multiple object classes. As shown in Table 4.11 it perfectly tracks all pedestrians and cars without identity changes resulting in complete tracks. Cars and pedestrians are correctly classified as shown in Figures 4.11 and 4.12. However, the bags and their events are completely missed and not tracked resulting in several FN shown in Table 4.9 and Figure 4.13. In addition, a wrong object appears at the position of the parked car when it drives away at the end of the sequence.
82
4. EVENT-BASED TRACKING EVALUATION
Tracker 3 The results of this tracker are quite similar to tracker 2 as it tracks cars and pedestrians but also ignores the bags. The wrong object initialized at the empty space of the disappearing car shows the same segmentation problem as the other trackers. However, results could be different if the codebook for the background had been trained without this car. Furthermore, this tracker looses track of one pedestrian when it passes behind a lamp pole resulting in an identity change shown in Table 4.12 (GT Object 6). The lamp pole is seen only in one of the two cameras. In conclusion, the evaluation shows that each tracker has its own strengths and weaknesses. Only tracker 1 detects bags and therefore finds all relevant events, but also some undesired ones. Tracker 2 makes a perfect job tracking cars and pedestrians while not tracking any bags. Tracker 3 adds an identity change to the otherwise similar results in comparison to tracker 2. Disappearing objects initially learned as part of the background cause problems with all three methods. 7" *ave
V Tracker 1 Entering Scene Leaving Scene Entering PedX Leaving PedX
TP 8 7 6 7
FN 0 0 0 0
FP 6 3 2 0
2.44s 0.67s 0.35s 0.22s
2.30m 1.77m 0.30m 0.43m
V Tracker 2b Entering Scene Leaving Scene Entering PedX Leaving PedX
TP 6 6 6 6
FN 2 1 0 1
FP 1 0 0 0
T *ave 0.56s 0.59s 0.50s 0.16s
r '-•n r, 1.34m 0.39m 0.80m 0.64m
V Tracker 3 Entering Scene Leaving Scene Entering PedX Leaving PedX
TP 7 6 6 6
FN 1 1 0 1
FP 1 1 0 0
•Lave. 1.04s 0.26s 0.27s 0.24s
2.34m 0.54m 0.21m 0.75m
l^ave
^ave
Table 4.9: Event-based evaluation of the HERMES outdoor sequence
83
4.5. CASE STUDY RESULTS
Figure 4.11: The presented tracking method on the HERMES sequence
O Tracker 1 GT Object 1 GT Object 2 GT Object 3 GT Object 4 GT Object 5 GT Object 6 GT Object 7 GT Object 8 Total
TP percentage 4/4 4/4 1/1 4/4 4/4 4/4 3/3 4/4 28/28(100%)
Otot 3 1 1 1 1 1 1 3
Table 4.10: Object-based evaluation of tracker 1 (HERMES)
84
4. EVENT-BASED TRACKING EVALUATION
Figure 4.12: Segmentation on the HERMES sequence
Figure 4.13: HERMES sequence: three trackers. Only tracker 1 in the left image detects the small bags (nr. 3 and nr. 23). Tracker 2b in the center and tracker 3 omit the small objects to achieve a higher robustness for the other objects.
0 Tracker 2b GT Object 1 GT Object 2 GT Object 4 GT Object 5 GT Object 6 GT Object 8 Total
TP percentage 4/4 4/4 4/4 4/4 4/4 4/4 24/24(100%)
otot
Table 4.11: Object-based evaluation of tracker 2b (HERMES) O Tracker 3 GT Object 1 GT Object 2 GT Object 4 GT Object 5 GT Object 6 GT Object 7 GT Object 8 Total
TP percentage 4/4 4/4 4/4 4/4 4/4 1/3 4/4 25/27 (93%)
Otot 1 1 1 1 2 1 1
Table 4.12: Object-based evaluation of tracker 3 (HERMES)
4.5.4
Metric
While having promising results, the case study also shows some limitations of the proposed metric. Given the annotated ground truth, the difficulty of a sequence cannot fully be determined from the number, types and density of events. However, the number, types and relationships between the events can give a first estimate for the difficulty of a sequence for a certain event type. Due to the absence of relevant difficult 'events' such as illumination changes or scene occlusions within ground truth and tracking results, several difficulties are not directly visible to our metric. Only human inspection of the frames were many failures occur might show the illumination change.
86
4. EVENT-BASED TRACKING EVALUATION
4.6 Discussion A novel tracking evaluation metric based on a semantically higher-level was introduced based on events for multi-object tracking. Three different public datasets were automatically processed. They showed the versatility of the metric, which allows the definition of individual types of events for different application scenarios. Already available annotated ground truth data targeting lower-level metrics could be reused and automatically converted into the novel event-based representation. The metric aims at emulating a human visual inspection by conceptualizing the evaluation similar to the human terms of objects and events. This minimizes the need for human visual inspection, allowing faster testing of new algorithms or longer sequences. The experiments showed that in addition, the event type has to be carefully chosen. In cases of over-interpretation of semantically low-level results such as occlusion events in crowded scenes, the results might be dominated by other effects rather than the actual event. For future work, the evaluation metric could also be utilized to combine different levels of tracking within a single unified metric. In particular, in the context of HERMES, with its three level of visual tracking, such a metric could combine the different disciplines on a higher conceptual level.
5 Conclusion In this thesis two tracking methods and prototype systems for visual surveillance applications were presented. The work was focused on real-time methods for multi-object tracking with static cameras. Furthermore, a novel eventbased evaluation metric was presented [62]. Different state-of-the-art tracking methods were compared against ground truth of multiple public datasets. The preceding chapters already discuss the results, findings and open issues in detail. This chapter briefly summarises the results and provides a more general outlook.
5.1 2D and 2.5D Tracking Methods Chapter 2 and 3 both present real-time tracking methods for visual surveillance with a single static camera. The proposed monocular object trackers are able to detect and track in non-controlled environments, are adaptive to changing lighting conditions, and handle occlusions. Both trackers use a modular tracking framework sharing a similar Visual Observation module. Bayesian per-pixel classification is used to segment an image into foreground and background objects, based on observations of object appearances and motions. However, the two trackers use different methods for Object Classification, Object Tracking, Occlusion Handling and Environment Modeling. All these modules together build a generic visual surveillance system as presented in Section 1.3. The 2D tracker presented in Chapter 2, uses image coordinates and a connected component algorithm to detect and track arbitrary object sizes. It allows a quick setup of experiments with any camera position or sequence without a geometric camera calibration. This was especially useful for our prototype system for
88
5. CONCLUSION
privacy in video surveillance. Due to the arbitrary object size, persons were tracked and obscured completely and the cameras could be placed anywhere in a room. However, the simple object model led to bad tracking performance in crowded situations. In Chapter 3 an extended 2.5 D tracker was presented which allows objects to be tracked more robustly during multi-occlusion situations. Furthermore, it classifies between multiple object types such as cars and pedestrians and operates in 3D world coordinates from a single camera view using a camera calibration. The method was demonstrated in two multi-camera systems for intelligent multi-resolution surveillance. Real-time object trajectories were used to intelligently steer pan-tilt-zoom cameras in an uncontrolled indoor and outdoor scenario. The improved robustness was mainly achieved due to the geometric calibration which was used in the modules for Object Classification, Object Tracking, Occlusion Handling and Environment Modeling. However, the camera calibration and limitation to views not parallel to the ground plane limits the versatility in comparison to the first 2D method. Both trackers rely on a static background and are therefore limited to situations where the target objects are the main cause of changes in the image. The methods fail for heavily changing backgrounds such as waving trees, large projection screens and sudden lighting changes. Furthermore, a large number of objects reduces the tracking performance due to difficult multi-occlusion situations and a degenerating background model caused by the lack of continuous model update behind objects.
5.2
Discussion of the Evaluation Metric
In Chapter 4 an event-based evaluation metric for multi-object tracking was presented and applied to state-of-the-art trackers on multiple public datasets. The metric measures the completeness of semantically high level events which are automatically extracted from the ground truth and the tracking results. This comparison emulates the human conceptualization of the world into events and objects. The automatic conversion of raw data from tracking and annotation into events comprised several advantages: The metric is independent from the frame-rate and annotation density and it represents a least common denominator for various tracking data from semantically different levels. Furthermore,
5.3. OUTLOOK
89
the type and definition of the events for the metric can be fine-tuned for different application scenarios. However, event types have to be appropriately chosen for the low level results and application. Over-interpretation can lead to unreliable event detections and wrong results, where one event score is affected by the performance in another domain. Furthermore, the ground truth as the result of a human interpretation is sometimes ambiguous.
5.3
Outlook
More context and scene knowledge could be used to extend future methods into two directions, continuing the direction of research shown from the 2D to the 2.5D tracking approach. First, improved object and background models could make the tracker more robust in a larger variety of situations. Multi-modal background models for 24 hour operation could be also incorporated. Second, the current tracker could be guided by a context aware high-level instance. This would allow for filtering object trajectories, detections, and removals according to plausibility of higher-level scene knowledge. Furthermore, the tracking strategies could be adapted on-line to actively reduce failures and managing computational resources. For example, a parked car could be removed from the foreground object list and incorporated into the background model to improve the overall performance. And in a broader context, different tracking and detection methods could be combined and selected according to the situation and computational resources. Finally, it would be interesting to take advantage of an on-line and real-time method and integrate it into more complex systems. In particular, the multiresolution surveillance system [6] could be an interesting starting point for future research of a cognitive system. For the evaluation metric it would be of interest in future experiments to combine multiple applications of larger systems. For example, visual object detection applications such as facial emotion detection or biometric identification as part of a multi-resolution surveillance system could generate events for each module. The proposed metric would be extended to evaluate whole systems.
A Datasets This section in the Appendix gives an overview of popular public datasets used in the research community for visual surveillance, as well as additional sequences used in this thesis. The list describes their characteristics and difficulties qualitatively with respect to the multi-object tracking problem. Tables A.l A.2 A.3 give a quantitative overview.
Hermes Outdoor The Hermes outdoor sequence is a staged scene addressing the following visual problems: ghosting and partial occlusions. Semantically, the scene contains: walking and running pedestrians, moving cars, theft, left-luggage, danger of a near collision and a chase scene. The sequence contains many semantic and visual problems in a compact form. However, it is not long enough to learn statistical models. It has a good image quality (high resolution), is not crowded and all objects are large. Ground truth of trajectories and segmentation masks is available for one camera. Central Square The Central Square dataset was recorded with a public web-cam containing the following visual problems: very low resolution/small objects, encoding artifacts, additional vehicle such as trams. Semantically the scene contains: varying density of pedestrian and car crowds, interactions between cars and pedestrians at a pedestrian crossing. The scene is extremely difficult due to many actors, very low resolution, small objects and the poor video quality. Ground truth of person detections is available.
92
A. DATASKTS
• PETS'2001 The PETS'2001 dataset consists of multiple sequences for training and testing. Usually only the first dataset is used which contains the following visual problems: trees, ghosting and occlusions behind background and cars. Semantically, the scene contains: groups of walking pedestrians and sporadic drive-by of a car or bike. The sequence has a good image quality and colors. It was extensively used in tracking publications and comparisons. Trajectory annotation data is available for dataset 1 only. In Section 2.6.1 the third dataset was used which contains lighting changes as additional challenge. The evaluation in Section 4.5.2 uses the first dataset. • IAKS Rush Hour The IAKS Rush Hour sequence was recorded for traffic analysis. It contains the following visual problems: below average image quality, with challenging conversion artifacts, small objects, occlusions, clutter and varying lighting conditions. Semantically, the scene contains: car traffic, streetcar and a few pedestrians. Especially the pedestrians are very small and difficult to detect or track given the image artifacts. The sequence as well as the ground truth annotations for all 66 cars is available on request from IAKS, Karlsruhe. • CAVIAR For the CAVIAR project, a large number of videos were recorded in different settings including the shopping mall sequences presented here. They contain the following visual problems: low resolution and reflexions. Semantically, the scene contains: low density crowds, people walking and meeting with others, window shopping, entering and exiting shops, fighting, passing out and leaving a package in a public place. One benefit of this dataset is that all sequences are extensively annotated. However, actor labels are not the same in both views. Furthermore, some actors are not fully labelled and the actors have different labels upon entering and leaving shops. • PETS'2006 The PETS' 2006 datasets are multi-sensor sequences containing leftluggage scenarios at a railway station with varying scene complexity. They contain the following visual problems: specularities, cluttered background, reflections (in 1 view), patterns on the floor. Semantically, the
93
scene contains: crowds, unattended and abandoned luggage, different luggage (briefcase, suitcase, rucksacks, backpack, ski gear carrier). The ground truth available consists of sparse semantic events related to the luggage. PETS' 2007 The PETS' 2007 datasets are multi-sensor sequences at an airport security check-point. The following visual problems have to be addressed: very crowded scene with heavy occlusions, many people are only partly visible. Scmantically, the scene contains: loitering, left luggage and theft. The dataset includes one sequence with an empty background for learning. One camera view is colored blueish as it was recorded through glass. The ground truth of this dataset is limited to sparse alarms of semantic events. BEHAVE BEHAVE consists of a large database of interacting people outdoors filmed out of a window. The visual problems of the sequences are: slight occlusions, weak colors and reflections of irrelevant people inside the building. The emphasis of this dataset is on the interactions, which contains people acting out about 10 types of group interactions: In_Group, Approach, Walk_Together, Split, Ignore, Following, Chase, Fight, Run.Together and Meet. Sometimes a car passes without interaction. Most of the sequences come with annotated bounding-boxes. Terrascope The Terrascope sequences contain scripted indoor multi-camera recordings of several offices with 9 partly overlapping cameras. The visual challenges are: heavy occlusions and partly visible people. Semantically, the following actions are staged: group meeting, group exit and intruder, suspicious behavior and theft.
94
A. DATASETS
HERMES
Central
Rush Hour
1392x1040 (p)
320x240 (p)
768x576 (i)
Ixl612xl5fps
Ix2276x25fps 3 sub-clips 2-20 peds(40-50) cars(55) 1 normal cloudy
Ixl5000x25fps
Picture resolution [px] (progressive / interlaced) #sequences x #frames x #fps # of objects average height [px] multi-view weather conditions lighting conditions simplifications website
1-3 peds(100-270) cars (150-290) 4 (+1 PTZ) normal, cloudy diffuse, daylight not crowded, no shadows link
diffuse, daylight no shadows
1-10 peds (10-20) cars (25-50) 1 changing cloudy / sunny changing diffuse / shadows no groups
link
on request
Table A.l: HERMES, Central and Rush Hour dataset comparison.
95 PETS'2001
CAVIAR
Picture ..« resolution [px] (progressive / interlaced) #sequences x #frames x #fps # of objects average height [PX] multi-view weather conditions lighting conditions simplifications website
m PETS' 2006
768x576 (i)
384x288(p)
5x(500011000)x25 fps 1-10 peds(25-65) cars (40-100) 2
26x(5001500)xl5fps
7x2300x25 fps
1-5 peds(20-150)
1-10 peds(50-250
1 (1NPJA) + 2 (shopping mall) indoor, sunlight
4
colored lighting
artificial illumination
no interactions
constant ing link
normal, windy daylight, setl: diffuse set2+3: changing no large groups or crowds link
link
720x576(i)
indoor
light-
Table A.2: PETS 2001, CAVIAR and PETS 2006 dataset comparison.
96
A. DATASETS
PETS' 2007 •
BEHAVE
720x576
640x480
640x480
9x(27004500)x25 fps 3-30 peds (100-300)
4x76800x25 fps
lx(2-10 min)x30 fps 0-4 peds(250-600)
4 indoor, half sunny illumination changes, some shadows
1 penumbra, half sunny some illumination changes, slight shadows / reflections constant illumination during parts of the sequence link
h ':L 3
Picture resolution [px] (progressive / interlaced) #sequences x #frames x #fps # of objects average height [px] multi-view weather conditions lighting conditions
simplifications
website
link
Tcrrascope
3-5 peds(150)
9 indoor various static but local light sources low density, uniform background link
Table A.3: PETS 2007, BEHAVE and Terrascope dataset comparison.
Bibliography [I]
[2] [3] [4] [5] [6]
[7]
[8]
[9]
[10] [II]
[12]
[13]
J. Aguilera, H. Wildernauer, M. Kampel, M. Borg, D. Thirde, and J. Ferryman. Evaluation of motion segmentation quality for aircraft activity surveillance. In IEEE Int. Workshop on VS-PETS, pages 293- 300, 2005. J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow techniques. Internationa/Journal ofComputer Vision, 12:43-77, 1994. F. Bashir and F. Porikli. Performance evaluation of object detection and tracking systems. In IEEE International Workshop on PETS, volume 5, pages 7-14, 2006. H. Bay, A. Ess, T. Tuytelaars, and L. van Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding (CVIU), 110(3):346-359, June 2008. R. Bellman. Dynamic Programming. Princeton University Press, 1957. N. Bellotto, E. Sommerlade, B. Benfold, C. Bibby, I. Reid, D. Roth, L. V. Gool, C. Fernandez, and J. Gonzalez. A distributed camera system for multi-resolution surveillance. In Third ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC 2009), September 2009. in press. J. Berclaz, F. Fleurct, and P. Fua. Robust people tracking with global trajectory optimization. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages 744-750, June 2006. T. E. Boult, R. J. Micheals, X. Gao, and M. Eckmann. Into the woods: Visual surveillance of non-cooperative and camouflaged targets in complex outdoor settings. In Proceedings of the IEEE, pages 1382-1402, 2001. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. V. Gool. Robust tracking-by-detection using a detector confidence particle filter. In IEEE International Conference on Computer Vision (ICCV'09), October 2009. in press. M. B. Capellades, D. Doermann, D. DeMenthon, and R. Chellappa. An appearance based approach for human and object tracking. In ICIP, 2003. A. Cavallaro and F. Ziliani. Characterisation of tracking performance. In Workshop on Image Analysis For Multimedia Interactive Services (WIAMIS), Montreux, Switzerland, April 2005. R. Collins. Mean-shift blob tracking through scale space. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, volume 2, pages 11-234-40 vol.2, June 2003. R. T. Collins and Y. Liu. On-line selection of discriminative tracking features. In ICCV, 2003.
98
BIBLIOGRAPHY
[14] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B, 39(1): 1-38, 1977. [15] X. Desurmont, R. Sebbe, F. Martin, C. Machy, and J.-F. Delaigle. Performance evaluation of frequent events detection systems. In IEEE Int. Workshop on PETS, 18th June 2006. [16] S.Dreyfus. Richard bellman on the birth of dynamic programming. Open Res.,
50(l):48-51,2002. [17] P. Duizer and D. Hansen. Multi-view video surveillance of outdoor traffic (master thesis). In Digital Project Library, Aalborg University, Denmark, August 2007. [18] BlueC II Project, http://blue-c-ii.ethz.ch. [19] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua. Multicamera people tracking with a probabilistic occupancy map. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):267-282, 2008. [20] R. Gcrber and H.-H. Nagel. Occurrence extraction from image sequences of road traffic scenes. In Workshop on Cognitive Vision, pages 1-8, 19-20 September 2002. [21] R. Gerber and H.-H. Nagel. Representation of occurrences for road vehicle traffic. In Artificial Intelligence, 2007. Article in press, available online. [22] J. Gonzalez, F. X. Roca, and J. J. Villanueva. Hermes: A research project on human sequence evaluation. In Computational Vision and Medical Image Processing (VipIMAGE'2007), October 2007. [23] M. Haag and H.-H. Nagel. Combination of edge element and optical flow estimates for 3d-model-based vehicle tracking in traffic image sequences. Int. J. Comput. Pfe/o«,35(3):295-319, 1999. [24] D. Hall, J. Nascimento, P. Ribeiro, E. Andrade, P. Moreno, S. Pesnel, T. List, R. Emonet, R. B. Fisher, J. S. Victor, and J. L. Crowley. Comparison of target detection algorithms using adaptive background models. In ICCCN '05: Proceedings of the 14th International Conference on Computer Communications and Networks, pages 113-120, Washington, DC, USA, 2005. IEEE Computer Society. [25] D. M. Hansen, P. T. Duizer, S. Park, T. B. Moeslund, and M. M. Trivedi. Multiview video analysis of humans and vehicles in an unconstrained environment. In ISVC '08: Proceedings of the 4th International Symposium on Advances in Visual Computing, pages 428^139, Berlin, Heidelberg, 2008. Springer-Verlag. [26] I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):809-830, 2000. [27] http://chil.server.de/. Chil project website. [28] http://homepages.inf.ed.ac.uk/rbf/CAVIAR. Caviar: Project website, datasets and annotated ground truth. [29] http://www.cvg.cs.rdg.ac.uk/PETS2001. Pets 2001: Dataset and annotated ground truth.
BIBLIOGRAPHY
99
[30] http://www.ee.ethz.ch/bleibe/data/datasets.html. Central pedestrian crossing: Dataset. [31] http://www.silogic.fr/etiseo. Etiseo: Video understanding evaluation. [32] W. Hu, T. Tan, L. Wang, and S. Maybank. A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems, Man and Cybernetics, 34:334-352,2004. [33] M. Isard and A. Blake. Contour tracking by stochastic propagation of conditional density. In ECCV '96: Proceedings of the 4th European Conference on Computer Vision-Volume I, pages 343-356, London, UK, 1996. Springer-Verlag. [34] M. Isard and A. Blake. Condensation - conditional density propagation for visual tracking. International Journal of Computer Vision, 29:5-28, 1998. [35] R. E. Kalman. A new approach to linear filtering and prediction problems. 1:3545,1960. [36] R. Kasturi, D. Goldgof, R Soundararajan, V. Manohar, J. Garofolo, R. Bowers, M. Boonstra, V. Korzhova, and J. Zhang. Framework for performance evaluation of face, text, and vehicle detection and tracking in video: Data, metrics, and protocol. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):319-336, 2009. [37] S. M. Khan and M. Shah. A multiview approach to tracking people in crowded scenes using a planar homography constraint. In In European Conference on Computer Vision, 2006. [38] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis. Real-time foregroundbackground segmentation using codebook model. Real-Time Imaging, 11(3): 172 - 185, 2005. Special Issue on Video Object Processing. [39] H. Kollnig and H.-H. Nagel. 3d pose estimation by directly matching polyhedral models to gray value gradients. Int. J. Comput. Vision, 23(3):283-302, 1997. [40] D. Kottow, M. Kn, and J. R. del Solar. A background maintenance model in the spatial-range domain. In Statistical Methods in Video Processing, volume 3247 of Lecture Notes in Computer Science, pages 141-152. Springer Berlin / Heidelberg, 2004. [41] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83-97, 1955. [42] O. Lanz. Approximate bayesian multibody tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(9): 1436-1449, Sept. 2006. [43] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. Int. J. Comput. Vision, 77(l-3):259-289, 2008. [44] B. Leibe, K. Schindler, N. Cornells,, and L. van Gool. Coupled object detection and tracking from static cameras and moving vehicles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30( 10): 1683-1698, October 2008. [45] B. Leibe, K. Schindler, and L. Van Gool. Coupled detection and trajectory estimation for multi-object tracking. In International Conference on Computer Vision (ICCV'07), October 2007.
100
BIBLIOGRAPHY
[46] A. J. Lipton, H. Fujiyoshi, and R. S. Patil. Moving target classification and tracking from real-time video. In WACV '98: Proceedings of the 4th IEEE Workshop on Applications of Computer Vision (WACV'98), page 8, Washington, DC, USA, 1998. IEEE Computer Society. [47] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91-110, 2004. [48] V. Manohar, M. Boonstra, V. Korzhova, P. Soundararajan, D. Goldgof, R. Kasturi, S. Prasad, H. Raju, R. Bowers, and J. Garofolo. Pets versus vace evaluation programs: A comparative study. In Ninth IEEEInt 7 Workshop Performance Evaluation of Tracking and Surveillance, pages 1-6, 2006. [49] http://www.hermes-project.eu. Hermes website. [50] D. Meyer and J. Denzler. Model based extraction of articulated objects in image sequences for gait analysis. In ICIP '97: Proceedings of the 1997 International Conference on Image Processing (ICIP '97) 3-Volume Set-Volume 3, page 78, Washington, DC, USA, 1997. IEEE Computer Society. [51] D. Meyer, J. Posl, and H. Niemann. Gait classification with hmms for trajectories of body parts extracted by mixture densities. In in BMVC, pages 459-468, 1998. [52] A. Mittal and L. S. Davis. M2tracker: A multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo. In ECCV, May 2002. [53] T. B. Moeslund and E. Granum. A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81(3):231 - 268, 2001. [54] T. B. Moeslund, A. Hilton, and V. Kruger. A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding, 104:90-126,2006. [55] H.-H. Nagel and M. Haag. Bias-corrected optical flow estimation for road vehicle tracking. In ICCV '98: Proceedings of the Sixth International Conference on Computer Vision, page 1006, Washington, DC, USA, 1998. IEEE Computer Society. [56] A. Nghiem, F. Bremond, M. Thonnat, and R. Ma. A new evaluation approach for video processing algorithms. In IEEE Workshop on Motion and Video Computing, 2007. WMVC '07., pages 15-15, 2007. [57] K. Okuma, A. Taleghani, N. d. Freitas, J. J. Littlei, and D. G. Lowe. A boosted particle filter: Multitarget detection and tracking. In ECCV, 2004. [58] A. Rahimi. Fast Connected Components on Images, 2001. http://xenia.media.mit.edu/rahimi/connected/. [59] D. Roller, K. Daniilidis, and H. H. N. and. Model-based object tracking in monocular image sequences of road traffic scenes. International Journal of Computer Vision, 10(3):257-281, June 1993. [60] D. Roth, P. Doubek, and L. Van Gool. Bayesian pixel classification for human tracking. In MOTION, pages 78-83, January 2005.
BIBLIOGRAPHY
101
[61] D. Roth, E. Koller-Meier, and L. V. Gool. Multi-object tracking driven event detection for evaluation. In 1st ACM workshop on Analysis and retrieval of events/actions and workflows in video streams (AREA), October 2008. [62] D. Roth, E. Koller-Meier, and L. V. Gool. Multi-object tracking evaluated on sparse events. Multimedia Tools and Applications, 2009. [63] D. Roth, E. Koller-Meier, D. Rowe, T. Moeslund, and L. Van Gool. Event-based tracking evaluation metric. In IEEE Workshop on Motion and Video Computing (WMVC), January 2008. [64] D. Rowe, I. Reid, J. Gonzalez, and J. Villanueva. Unconstrained Multiple-people Tracking. In 28th DAGM, Berlin, Germany, pages 505-514. Springer LNCS, 2006. [65] A. Senior, A. Hampapur, Y.-L. Tian, L. Brown, S. Pankanti, and R. Bolle. Appearance models for occlusion handling. In PETS, 2001. [66] K. Smith, D. Gatica-Perez, J. Odobez, and S. Ba. Evaluating multi-object tracking. In CVPR Workshop on Empirical Evaluation Methods in Computer Vision (EEMCV), pages 36-36, June 2005. [67] T. Spindler, C. Wartmann, L. Hovestadt, D. Roth, L. V. Gool, and A. Steffen. Privacy in video surveilled spaces. Journal of Computer Security, 16(2): 199 222, January 2008. [68] T. Spindler, C. Wartmann, D. Roth, A. Steffen, L. Hovestadt, and L. V. Gool. Privacy in video surveilled areas. In International Conference on Privacy, Security and Trust (PST2006), October 2006. [69] C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. In CVPR, 1999. [70] T. N. Tan, G. D. Sullivan, and K. D. Baker. Model-based localisation and recognition of road vehicles. Int. J. Comput. Vision, 27(1):5—25, 1998. [71] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance. Seventh International Conference on Computer Vision, 1:255+, 1999. [72] R. Y. Tsai. A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses, pages 221-244, 1992. [73] A. Utsumi, H. Mori, J. Ohya, and M. Yachida. Multiple-view-based tracking of multiple humans. In ICPR '98: Proceedings of the 14th International Conference on Pattern Recognition-Volume 1, page 597, Washington, DC, USA, 1998. IEEE Computer Society. [74] M. Valera and S. Velastin. Intelligent distributed surveillance systems: a review. Vision, Image and Signal Processing, IEE Proceedings -, 152(2): 192-204, April 2005. [75] H. Wang and D. Suter. Background subtraction based on a robust consensus method. Pattern Recognition, International Conference on, 1:223-226, 2006. [76] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:780-785, 1997.
102
BIBLIOGRAPHY
[77] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV '05: Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, pages 90-97, Washington, DC, USA, 2005. IEEE Computer Society. [78] B. Wu and R. Nevatia. Tracking of multiple, partially occluded humans based on static body part detection. In CVPR, pages 951-958, 2006. [79] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. Int. J. Comput. Vision, 75(2):247-266, 2007. [80] H. Yang, J. Lou, H. Sun, W. Hu, and T. Tan. Efficient and robust vehicle localization. In Image Processing, 200J. Proceedings. 2001 International Conference on, volume 2, pages 355-358 vol.2, Oct 2001. [81] F. Yin, D. Makris, and S. A. Velastin. Performance evaluation of object tracking algorithms. In 10th IEEE International Workshop on Performance Evaluation of Tracking and Swveillance (PETS2007), Rio de Janeiro, Brazil, October 2007. [82] D. Young and J. Ferryman. Pets metrics: On-line performance evaluation service. InProc. 2nd Joint IEEE Int. Workshop on VS-PETS,pagzs 15-16,2005. [83] T Zhao, M. Aggarwal, R. Kumar, and H. Sawhncy. Real-time wide area multicamera stereo tracking. In CVPR 2005, 2005.
List of Publications D. Roth, E. Koller-Meier and L. Van Gool. Multi-object tracking evaluated on sparse events, Multimedia Tools and Applications, 2009. D. Roth, E. Koller-Meier, D. Rowe, T.B. Moeslund and L. Van Gool. EventBased Tracking Evaluation Metric, In Proceedings of the IEEE Workshop on Motion and Video Computing (WMVC), 2008. D. Roth, P. Doubek and L. Van Gool. Bayesian Pixel Classification for Human Tracking In Proceedings of the IEEE Workshop on Motion and Video Computing (MOTION), 2005. D. Roth, E. Koller-Meier and L. Van Gool. Multi-Object Tracking driven Event Detection for Evaluation In Proceedings of the 1st ACM workshop on Analysis and retrieval of events/actions and workflows in video streams (AREA), 2008. N. Bellotto and E. Sommerlade and B. Benfold and C. Bibby and I. Reid and D. Roth and L. Van Gool and C. Fernandez and J. Gonzalez. A Distributed Camera System for Multi-Resolution Surveillance, In Proceedings of the 3rd ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC 2009). T. Spindler, C. Wartmann, L. Hovestadt, D. Roth , L. Van Gool and A. Steffen. Privacy in video surveilled spaces^Iournal of Computer Security, Vol. 16, No. 2, pp. 199-222,2008. T Spindler, Ch. Wartmann, D. Roth, A. Steffen, L. Hovestadt and L. Van Gool. Privacy in Video Surveilled Areas In Proceedings of the International Conference on Privacy, Security and Trust (PST), 2006. K. Nummiaro, E. Koller-Meier, T. Svoboda, D. Roth and L. Van Gool. ColorBased Object Tracking in Multi-Camera Environments Lecture Notes in Computer Science, Vol. 2781, pp. 591-599, 2003. K. Nummiaro, E. Koller-Meier, T. Svoboda, D. Roth and L. Van Gool. ColorBased Object Tracking in Multi-Camera Environments In Proceedings of the 25th Pattern Recognition Symposium, DAGM 2003.
Curriculum Vitae Personal Data Name:
Daniel Roth
Date of Birth:
January 28*\ 1978
Citizenship:
Zollikon (ZH) and Hemberg (SG), Switzerland
Education 2004 - 2009
Doctor of Sciences ETH Department of Information Technology and Electrical Engineering, ETH Zurich, Switzerland
1998-2004
MSc ETH in Electrical Engineering and Information Technology, ETH Zurich, Switzerland
1993- 1998
Gymnasium Matura Type C (Mathematics and Physics) Mathematisch-Naturwissenschaftliches Gymnasium Ramibiihl, Zurich, Switzerland
Occupation 2004 - 2009
Computer Vision Laboratory, ETH Zurich, Switzerland Teaching and Research Assistant Research Projects: HERMES, IM2, BlueC-II, Epoch
2002
Arnold Nemetz and Associates Ltd, Vancouver Canada Electrical Consulting Engineers AutoCAD application developer and CAD operator