This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6815
Lutz Dickmann Gerald Volkmann Rainer Malaka Susanne Boll Antonio Krüger Patrick Olivier (Eds.)
Smart Graphics 11th International Symposium, SG 2011 Bremen, Germany, July 18-20, 2011 Proceedings
13
Volume Editors Lutz Dickmann Gerald Volkmann Rainer Malaka University of Bremen, Department of Mathematics and Computer Science Research Group Digital Media, 28359 Bremen, Germany E-mail: {dickmann; volkmann; malaka}@tzi.org Susanne Boll University of Oldenburg, Department of Computer Science Media Informatics and Multimedia Systems, 26121 Oldenburg, Germany E-mail: [email protected] Antonio Krüger German Research Center for Artificial Intelligence (DFKI) Innovative Retail Laboratory, 66123 Saarbrücken, Germany E-mail: [email protected] Patrick Olivier Newcastle University, School of Computing Science Newcastle Upon Tyne, NE7 1NP, UK E-mail: [email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-22570-3 e-ISBN 978-3-642-22571-0 DOI 10.1007/978-3-642-22571-0 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011931554 CR Subject Classification (1998): I.4, H.3-5, I.3-7, I.2, I.5 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics
The International Symposium on Smart Graphics serves as a scientific forum that attracts researchers and practitioners from the fields of computer graphics, artificial intelligence, cognitive science, human–computer interaction, interface design, and information visualization. Initiated by Andreas Butz, Antonio Kr¨ uger, and Patrick Olivier, Smart Graphics has been continued as a series of annual events in Asia, North America, and Europe for more than a decade now. The 11th International Symposium on Smart Graphics was held in Bremen, Germany, during July 18–20, 2011. Frieder Nake, one of the prominent pioneers of computer art, provided a friendly welcome and invited the Smart Graphics community to the compArt Center of Excellence Digital Art at the University of Bremen. In an evening talk, he also shared insights from his interdisciplinary activities and research in computer art, computer graphics, human–computer interaction, and semiotics. Tracy Hammond, director of the Sketch Regnition Lab at the Texas A&M University, kindly followed our invitation to give an evening talk on sketchbased interfaces and intelligent user interfaces. Tracy Hammond holds a PhD from MIT where she worked with Randall Davis in the Computer Science and Artificial Intelligence Laboratory. For their dedicated attention and careful review work, we sincerely thank our Program Committee members and external reviewers—experts in computer graphics, artificial intelligence, human–computer interaction, interface design, and other areas relevant to Smart Graphics. Several authors reported that the extensive reviews they received contained very insightful remarks and references that helped to improve their work. Three reviews were collected for each submission throughout all categories (full papers, short papers, and artistic works or system demonstrations). The acceptance rate was 43.48% in the main category this year: 10 out of 23 full-paper submissions were selected for publication in these proceedings. This is accompanied by 16 contributions that were accepted as short papers or system demonstrations. We encourage all authors to continue submitting to the Smart Graphics symposium series, and we wish all of you good luck and success with your future research. July 2011
Lutz Dickmann Gerald Volkmann Rainer Malaka Susanne Boll Antonio Kr¨ uger Patrick Olivier
Organization
Organizing Committee Lutz Dickmann Gerald Volkmann Rainer Malaka Susanne Boll Antonio Kr¨ uger Patrick Olivier
University of Bremen, Germany University of Bremen, Germany University of Bremen, Germany University of Oldenburg, Germany DFKI/Saarland University, Germany University of Newcastle upon Tyne, UK
Advisory Board Andreas Butz Brian Fisher Marc Christie
University of Munich, Germany University of British Columbia, Canada University of Nantes, France
Program Committee Elisabeth Andr´e Marc Cavazza Yaxi Chen Luca Chittaro David S. Ebert Tracy Hammond Marc Herrlich Phil Heslop Hiroshi Hosobe Christian Jaquemin Gesche Joost Tsvi Kuflik J¨ orn Loviscach Boris M¨ uller Frieder Nake Bernhard Preim Mateu Sbert Tevfik Metin Sezgin John Shearer
University of Augsburg, Germany University of Teesside, UK University of Munich, Germany University of Udine, Italy Purdue University, USA Texas A&M University, USA University of Bremen, Germany Newcastle University, UK Tokyo National Institute of Informatics, Japan LIMSI/CNRS, France University of the Arts Berlin, Germany Haifa University, Israel University of Applied Sciences Bielefeld, Germany University of Applied Sciences Potsdam, Germany University of Bremen and University of the Arts Bremen, Germany University of Magdeburg, Germany University of Girona, Italy Ko¸c University, Turkey Newcastle University, UK
VIII
Organization
Shigeo Takahashi Robyn Taylor Roberto Ther´ on Benjamin Walther-Franks
University University University University
of of of of
Tokyo, Japan Alberta, Canada Salamanca, Spain Bremen, Germany
Reviewers Chi Tai Dang Federico Fontanaro Mathias Frisch Tobias Isenberg Markus Krause Joel Lanir Roberto Ranon Alan Wecker
University of Augsburg, Germany University of Udine, Italy University of Magdeburg, Germany University of Groningen, The Netherlands University of Bremen, Germany Haifa University, Israel University of Udine, Italy Haifa University, Israel
Supporting Institutions The 11th International Symposium on Smart Graphics was organized and sponsored by the TZI Center for Computing and Communication Technologies at the University of Bremen. Additional support was provided by the compArt Center of Excellence Digital Art in Bremen and by the OFFIS Institute for Information Technology in Oldenburg. Smart Graphics 2011 was held in cooperation with the Eurographics Association (EG), the Association for the Advancement of Artificial Intelligence (AAAI), ACM SIGGRAPH, ACM SIGCHI and ACM SIGART.
Table of Contents
View and Camera Control Smart Views in Smart Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Axel Radloff, Martin Luboschik, and Heidrun Schumann
1
Advanced Composition in Virtual Camera Control . . . . . . . . . . . . . . . . . . . Rafid Abdullah, Marc Christie, Guy Schofield, Christophe Lino, and Patrick Olivier
13
Towards Adaptive Virtual Camera Control in Computer Games . . . . . . . . Paolo Burelli and Georgios N. Yannakakis
25
Three-Dimensional Modeling An Interactive Design System for Sphericon-Based Geometric Toys Using Conical Voxels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masaki Hirose, Jun Mitani, Yoshihiro Kanamori, and Yukio Fukui A Multi-touch System for 3D Modelling and Animation . . . . . . . . . . . . . . . Benjamin Walther-Franks, Marc Herrlich, and Rainer Malaka
Video Projection Using Mobile Projection to Support Guitar Learning . . . . . . . . . . . . . . . . . Markus L¨ ochtefeld, Sven Gehring, Ralf Jung, and Antonio Kr¨ uger
Smart Views in Smart Environments Axel Radloff, Martin Luboschik, and Heidrun Schumann Institute for Computer Science, University of Rostock, Germany {axel.radloff,luboschik,schumann}@informatik.uni-rostock.de
Abstract. Smart environments integrate a multitude of different device ensembles and aim to facilitate proactive assistance in multi-display scenarios. However, the integration of existing software, especially visualization systems, to take advantage of these novel capabilities is still a challenging task. In this paper we present a smart view management concept for an integration that combines and displays views of different systems in smart meeting rooms. Considering these varying requirements arising in such environments we provide a smart viewing management taking e.g. the dynamic user positions, view directions and even the semantics of views to be shown into account. Keywords: view management, smart environment, layout, display mapping, smart configuration.
1
Introduction
Smart environments are physical spaces collecting and processing information about the users and their environment to estimate the intended situation and adapting the environment and its behavior in order to reach an intended state ([6,19,15]). Smart environments focus on heterogeneous ad-hoc device assemblies consisting of multiple sensors, displays, software infrastructure, etc. ([11,5]) In particular, smart meeting rooms are used to support groups working together to reach a common goal [17]. Here, three typical scenarios can be distinguished: (1) the presenter scenario, including one dedicated presenter imparting information to the audience, (2) the exploration scenario, a group working together to explore data to gain insights and (3) the work group scenario, composed of small subgroups working together discussing information. However, these scenarios are not static and can alter dynamically. To give an example, a group of climate researchers may join up for a meeting. First, one group member presents an agenda for the meeting, the topics to be discussed and further information the presenter scenario. Then, the group splits into sub-groups to discuss the information and gain insights - work group scenario. After that, the leaders of the sub groups present the insights they acquired - presenter scenario, followed by an exploration session of the whole team - exploration scenario. In these scenarios the users often use their own personal devices and software. However, presenting and sharing information with other users requires combining information on public displays (projectors, large monitors). In most cases L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 1–12, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
A. Radloff, M. Luboschik, and H. Schumann
this is solved by binding private devices directly to one projector. In this case the location of presented information is fixed or requires configuration effort to change. Furthermore, such scenarios do not take advantage of the capabilities of multi-display environments, e.g. the displays do not act as a whole entity but are used as multiple single-display environments. Hence, information may be spread over different displays and thus, users have to turn their head, move around or move their eyes repeatedly resulting in an uncomfortable working environment. To face these problems we introduce a smart view management in section 3. The main idea is to show views from different software systems by combining views to be streamed via network. We provide an ad-hoc option that allows the integration of proprietary software and a prepared option. Here, the systems provide their views themselves and thus allow for an adaptation of the displayed information taking further characteristics of the environment into account. Our approach requires grouping several views based on semantic relationships between provided views (sec. 4). These groups of views are then automatically assigned to displays by a smart display mapper (sec. 5) also taking position and view-direction of users into account. Finally, a smart layout (sec. 6) allows for an appropriate scaling and layouting of different views on one display. We summarize our results and give an outlook on future work in section 7.
2
Background and Related Work
The development of smart environments raises many research questions like how to determine the users intention, how to interact with those environment naturally or how to make use of available devices in service of the user to ease life and work [5]. An associated open problem in such scenarios is how to use the available display devices for a maximum support of communication with a minimum configuration effort at the same time. This problem comprises the integration of displays, the adaptation of content to different display devices and so on. The integration of heterogenous displays is an open problem, as they may change over time (joining and leaving devices), the content to be shown may change but meanwhile shall facilitate collaborative work (e.g., in [7,14]). The specifically developed, rare approaches for those ensembles typically combine the individual displays into a large single one or replicate content at multiple displays. Thus, current research mainly addresses the problems of synchronously sharing content from multiple devices on multiple displays and sharing the corresponding multiple interactions on the devices (e.g., [9,14]). But all those systems are only restrictively applicable in such dynamically changing heterogenous ensembles, we would like to address. The approach in [9] requires the proprietary software who’s content shall be distributed running at each system. This implicates hard constraints concerning the operating system and hardware requirements to run the proprietary software and excludes heterogenous ensembles. The Diskotheque environment [14] needs an adapted X-Server running on each machine with the same constraints. Although WeSpace [18] is less constrained, it accounts only for one large public display on which different content shall be
Smart Views in Smart Environments
3
shared upon. Moreover those works generally do not consider the current state of the smart environment and assume static displays like mounted projectors or static servers. Although there are publications in the field of multi display environments that study the effectiveness of such environments (e.g., [18]) or of single display types (e.g., [16]), they typically do not provide a strategy for reacting to dynamic changes. To the best of our knowledge, the work of Heider [10] is currently the only automatic approach, which covers spontaneous changes that concern displays and the corresponding usability within a smart environment. Display surfaces, user positions, projectors and the content to be shown are used, to optimize the assignment of views to displays. The approach automatically integrates displays and assigns content to them, but it does not comprise important factors like perceivability of information or semantics of content yet. A further important issue is, how to adapt content to heterogenous displays (size, resolution, color depth. . . ). The adaptation of graphical representations according to given output devices, in particular considering the reduced display size of mobile devices, has been addressed extensively (e.g., for images [3], for maps [8], for 3D-models [12]). Beside adapting the content itself, other works focus on the adaptive layout of different contents (e.g., from markup languages). This kind of adaptation is strongly required in the described scenario, if contents from different sources are to be shown on one display. Beach [4] presented that the general layout problem of arranging non overlapping rectangles into a minimum rectangular space is NP-complete. Thus, automatic layout approaches are commonly based upon templates defined by an author and arrange content by abstract (e.g. author-of) or spatial (e.g. left-of) constraints [1]. To solve these constraints optimization approaches like force based models are used. Unfortunately, dynamic ensembles generally can not provide any templates at all. The approaches described above primarily relate to specific problems. We present a general smart view management combining state-of-the-art methods with newly developed techniques to allow for a smart information presentation in smart meeting rooms.
3
Principal Approach
In this paper we present the basic concept of a smart view management. For this purpose, we have to consider the characteristics that come along with a smart meeting room: ad-hoc device ensembles, heterogeneous personal devices, heterogeneous (perhaps proprietary) software and operating systems, dynamically varying scenarios (presenter, work group or exploration), as well as positions, varying view directions and interests of the users. Hence, our concept is designed to be independent from concrete information displaying software systems. Instead we focus on the output of those systems – the views. These views have to be provided to the smart room first. For this, we define two options: (a) ad-hoc and (b) prepared view generation, that both come along with specific advantages and disadvantages. Note that both options can be used in parallel.
4
A. Radloff, M. Luboschik, and H. Schumann
The ad-hoc option uses view grabbers that run on the physical machine where the information displaying software is running. Each view grabber provides an image stream of a rectangular area of the screen. This way, even proprietary software without the opportunity of customization can be integrated as well. However, this option provides the views as they have been generated and offers no possibility for adapting the content. There is also no meta-information about the content of the view available (e.g. data source). The prepared option, in contrast, requires a slight adaptation of the information displaying software, in such a way that the software system provides an image stream of the content by itself. In doing so, each generated image stream defines a view. Customizing the information displaying software allows for providing meta-information about the displayed view, e.g. which views belong together. Furthermore, views to be presented on a public display in the smart room do not necessarily have to be similarly shown on the personal device. However, this option needs a slight implementation effort to customize the software. This effort can be reduced by providing a prepared library. Gathering views is only the first step. The major issue is to adequately display these views with regard to the characteristics of a smart meeting room. This requires: – Grouping views (defining so-called view packages), in a way that semantically related views can be shown on the same or close-by displays. We address this point with an interactive view package definition (sec. 4). – Automatically assigning views to display surfaces. To accomplish this we use a smart display mapper (sec. 5). – Automatically generating a layout for the display of combined views. Our smart layout automatically arranges these views (sec. 6). Additionally, our smart view management comprises meta renderers that encapsulate the characteristics of displays and thus provide information for the display mapping and layout. Figure 1 shows the framework of our smart view manager who’s functionality is described in the following sections.
4
Interactive Generation of View Packages
Usually the number of views to be shared among users should not be limited and thus, there are often more views to be displayed than available display surfaces. This requires an appropriate combination of different views to be presented upon one display with regards e.g. to the user’s goals, the current scenario or the content of the views. We use the term view packages to describe a group of semantically belonging views. Although an automatic generation of view packages is desirable, the number and complexity of influencing factors that should be considered in real time is too high. Moreover, using the ad-hoc option does not provide the necessary meta-information to appropriately group views. Thus, we use an interactive approach to define view packages. However, we support the user by providing default setups that comprise different aspects. We do not claim these aspects to be complete, but they can be easily extended and adapted.
Smart Views in Smart Environments
5
Fig. 1. Smart view management. Views from applications (A) and view grabbers (G) are grouped interactively into view packages to encode semantic. That semantic and the current environment state is regarded to assign views to display surfaces. If several views are to be presented at one display (managed by a meta renderer), the views are layouted according to semantics and display characteristics.
Task (present, compare): We support two typical tasks: present and compare. The first task typically means that a view shall be displayed to a maximum of users in a way that a good visibility is provided. The second task addresses the issue that users often have to compare different views. Hence, to support an appropriate comparison the views and associated display surfaces should be close to each other. View content (fragile, robust): We distinguish two modes related to the resizing of a view. A fragile view means that the content cannot be distorted, since important information may be lost in this way. In contrast, robust means that the view can be resized. Scenario (presentation, discussion): This aspect describes the general setting of a smart environment that (in our case) reflects the view directions of the auditory. For example, in the presentation scenario the users generally visually follow the presenter. In contrast, within the discussion scenario, any single user has their own view direction e.g. for acquiring necessary information from the closest display. Based upon these aspects, we define customizable presets. For example, if several views origin from the same application on one device, we assume a compare task and build one compare view package. Typically view content can be resized robustly in case of the prepared option since alternative views could be generated. In contrast, the ad-hoc option leads to fragile view packages to be on the safe side. Through this approach of interactive view package generation, we switch from complex hardware configuration and hardware presets to an expedient configuration of content and interests. We realized the view package generation by a lightweight and easy-to-use drag-and-drop GUI.
6
5
A. Radloff, M. Luboschik, and H. Schumann
Smart Display Mapping
The next step is the automatic assignment of view packages to display surfaces. For this purpose, we introduce a smart display mapper that reduces configuration effort, thereby improves efficiency. Since the idea of view packages is basically to encode the semantic between different views, it is the task of the display mapper to materialize that semantic within the smart meeting room. Before explaining our approach, we briefly describe the basic idea of display mapping as it has been proposed in [10]. 5.1
Basic Approach
In [10] the problem of assigning views of interest to display surfaces is defined as an optimization problem, optimizing spatial quality qs , temporal continuity (quality qt ) and semantic proximity (quality qp ): q(mt−1 , mt ) = a · qs (mt ) + b · qt (mt−1 , mt ) + c · qp (mt )
(1)
The different qualities are weighted (a, b, c) and mt−1 and mt denote consecutive mappings. The spatial quality rates the visible spatial arrangement of views and comprises influencing factors like the visibility of display surfaces towards users, rendering quality of projectors upon those surfaces and importance of views. Temporal quality qt judges assignment changes that may be distracting and qp rates the correspondence of spatial and semantic proximity of views. The quality function 1 has be to maximized, which is a NP-hard problem. Therefore, Heider [10] uses a distributed greedy algorithm to randomly generate several mappings to associate devices, views and display surfaces. These mappings have to be judged with the above function to find the optimum. Heider uses only elementary approximations (e.g., to calculate visibility) that do not account for important aspects like readability. Hence, we introduce some purposive enhancements that improve the display mappings. 5.2
Our Approach
Improved Spatial Quality. The term spatial quality qs significantly influences the spatial arrangement of views, but usability is not represented adequately yet: In [10] only the angle between a surface’s normal and the user/projector is used to judge visibility and rendering quality. Thus, current viewing directions, the users distance towards the surface and available projection directions are not regarded. As a result, users may have to turn around repeatedly, may not be able to perceive the displayed content and steerable projectors may not hit the surface due to a limited deflexion range. To improve the situation, we consider additional geometrical measurements. First we regard the position of display surfaces according to the users current body orientation and to the projection directions. Concerning users we use a cosine function that is scaled to fall off towards the borders of maximum head deflexion and results in 0, if the maximum is exceeded. If a steerable projector
Smart Views in Smart Environments
7
is not able to hit a surface (with tolerance) due to a limited deflexion range, the rendering quality is assumed to be 0. These calculations are combined with the former angle measures by multiplication. Secondly, we judge distance since angles between users and surfaces do not reflect the perceivable information or readability. But Euclidian distance is ineligible as for heterogenous displays: A distance of 2 m to a smart phone has a different effect than to a power wall. Due to that, we introduce a field of view (FOV) calculation as the FOV correlates with distance and is more meaningful in such scenarios. A horizontal FOV of 20◦ and a vertical FOV of 60◦ are known to be the focus area of human perception [13] and thus any difference is considered to have a negative effect. Since perspective correct calculations are computationally complex, we use a simplifying approximation considering horizontal and vertical distortions of a display surface (regarding the users point of view) independently. Thus, the approximated FOV γh and γv for a surface is cos α{v,h} · hd{h,v} cos α{v,h} · hd{h,v} γ{h,v} = arctan + arctan . d + sin α{v,h} · hd{h,v} d − sin α{v,h} · hd{h,v} With that, α denotes the rotation angle of the surface, d is the distance of the user towards the surfaces center and hd is the half dimension of the surface. 180◦ have to be added, if (d − sin α · hd) is negative. We define the following rating functions for the FOV in [0..1]: 1 n γ , n > 1 , γh < 20◦ f ovh (γh ) = 20n h9 ; cos 16 γh − 11 , γh ≥ 20◦ 1 n γ , n > 1 , γv < 60◦ f ovv (γv ) = 60n v9 cos 7 γv − 77 , γv ≥ 60◦ . A reduced FOV results in a loss of perceivable information and thus needs an increased penalty. An enlarged FOV may lead to less comfort (additional movements) and yields a lower penalty. Since both, f ov = min(f ovh , f ovv ) and the above angle measures are important, we combine them again by a multiplication. Our improvements ensure that practical needs like readability of content, comfort (less movements) and realizability of mappings (e.g., steerable projectors) are presented adequately within the spatial view arrangement – and thus the perceived spatial quality is increased. Reduced Temporal Quality Tests. The above improvements in qs ensure readability and less eye or head movements, but increase the computational costs. Thus, we examined equation 1 to somehow reduce complexity again. We examined the term of temporal quality qt , which ensures that views do not unnecessarily change the display surface they are assigned to, i.e. small changes shall have no effect. As realized in [10] this means considering all users, views and the corresponding visibilities again. We found that a reduction of assignment changes can also be reached by regarding only those circumstances, that definitely need an update of the current mapping. This way, a new mapping will replace the former only (1) if a device or user joins or leaves the smart
8
A. Radloff, M. Luboschik, and H. Schumann
meeting room or (2) if views change the display surfaces they are assigned to and the distance between the corresponding surfaces is bigger than a threshold and they remain there for a minimum time period (to prevent flickering). In this way, we do not consider temporal quality repeatedly within the optimization process and consequently dismiss qt in equation 1. Instead we take an already optimized display mapping, analyze it, and materialize it if necessary. Thus, temporal continuity is retained and the computational costs are reduced, as the temporal analysis has to be carried out only in well defined situations. Since meta renderers comprise more than simply a projector, those renderers are able to fade between former and currently assigned views and moreover may give visual clues e.g. where a view moves to. Realization of Multiple Views on One Display. A meta renderer is able to present multiple views via one projector by adapting and arranging the views (see sec. 6). To represent this ability within the display mapping optimization, we logically split the according projector into n identical logical sub-projectors, where n is the number of overall views. Running the optimization is done as usual, but every mapping that comprises the same view assigned to more than one of the belonging sub-projectors is rejected. Since all belonging sub-projectors are managed by one meta renderer, it is easy to obtain the necessary values, e.g. scaling quality, rendering quality, visibility. 5.3
Smart Realization of View Package Semantics
Although equation 1 already includes the term qs , an implementation of semantic proximity is still missing. However, the view packages approach (see sec. 4) encodes a basic semantic, which can be used for an automatic assignment of views to display surfaces: Task (present, compare): The present task can be seen as the standard functionality of the display mapper. But to realize a spatial separation of view packages, defined by different users, we define a slight semantic proximity for views of one package. In contrast, the compare task defines a high semantic proximity and thus needs a strong consideration. Thus, we use a basic realization of semantic proximity p(v, v ) as it can be used in qp : ⎧ ⎨ 0.5, (v ∈ V ) ∧ (v ∈ V ) ∧ (task is present); p(v, v ) = 1, (v ∈ V ) ∧ (v ∈ V ) ∧ (task is compare); ⎩ 0, else; where v, v are views of the view package V . This way, there is no penalty if views do not belong to each other, a small penalty if they are from one package in a present task and a big penalty, if views shall be compared – on condition that views are not displayed close to each other (see fig. 2(b)). View content (fragile, robust): Meta renderers allow a scaling adaptation of views according to the screen of the displaying devices. In a fragile case, any scaling is forbidden and thus results in a zero scaling factor. If views are
Smart Views in Smart Environments
(a)
(b)
9
(c)
Fig. 2. Our smart meeting room as a floor plan screenshot of our room observer (U: users, P: projectors, DB1−B4 : views and LW1−6 , VD1−2 the canvases). (a) Belonging views are displayed close to each other due to compare packages. (b) Views are projected close to the speaker in the presentation scenario ignoring individual viewing directions. (c) During discussion an optimal visibility of views is aspired.
robust, the scaling factor is in [0..1) if a view is shrunk and 1 if it is enlarged. Within the spatial quality qs we multiply the scaling factor with the measure expressing the rendering quality of a projector. Scenario (presentation, discussion): Within the presentation scenario, all users visually follow the presenter. Thus, the presented views have to be arranged near the presenter. For that, we presume the connecting vectors between users and the presenter as the users’ viewing directions despite the real directions (fig. 2(b)). In the discussion scenario, the visibility of the important content for a single user is decisive and thus views are arranged optimally according to the extended visibility issues (see sec. 5.2, fig. 2(c)). The design of our smart display mapper allows for a dynamic reaction to changing situations within a smart meeting room. It regards our simplified semantic definition (sec. 4) by using the above semantic dependent automatic modifications within the optimization process. Finally, to realize the possible assignment of multiple views to one display surface and to complete our smart view management we make use of smart view layout presented next.
6
Smart View Layout
Automatically arranging multiple views on displays of different size is an open research problem. For instance, Ali suggests a force-based approach for the layout of dynamic documents requiring preprocessed content, templates and constraints [2]. Because of the ad-hoc characteristic of smart meeting rooms a preprocessing is generally not feasible. We chose a simplified approach. The major idea is to give a quick response by displaying a first layout. This layout is calculated by a spring force directed approach in combination with a pressure-directed resizing and is improved subsequently. According to Ali et. al., missing constraints may result in artifacts. For this reason, we recalculate the layout repeatedly and use a quality function to rate the new solution and finally replace the currently displayed layout if a threshold is exceeded. This background process dynamically calculates
10
A. Radloff, M. Luboschik, and H. Schumann
Fig. 3. Users in front of a display surface act as occluders and can be integrated into the dynamic layout of views to increase visibility
and rates new layouts and thus dynamically considers condition changes, like e.g. changed user positions, views or display mappings. Moreover, we are able to guarantee the visibility of views: Real objects within the smart meeting room that occlude parts of a display (e.g., a user in front of a screen) are represented as empty (non-visible) non-resizable elements within the layout. Those elements influence the forces affecting the views to be arranged, but remain fixed themselves (see fig. 3).
7
Results and Concluding Remarks
In this paper we present a concept for a smart view management. We gather views from attendant systems – using a view grabber or software streaming – and make them available to the smart environment. Our simplified interactive definition of view packages encodes a basic semantic that reflects different aspects of working scenarios within a smart meeting room. The smart display mapper assigns views from view packages to display surfaces according to the encoded semantic and the current room state provided by sensors (positions, directions,. . . ). The smart layout arranges multiple views on one display surface and reacts to changing conditions, too. We implemented the smart view management in Java for a maximum independence of operating systems. It has been integrated into our smart meeting room in Rostock. Here, the positions of the users are tracked via Ubisense tracking system combined with probabilistic models for enhancing the accuracy and to determine viewing directions (e.g., by traces). In this way we provide a working solution to show views of different systems on different displays with regards to varying viewing conditions. Figure 4 shows a snapshot of our smart view management in use, demonstrating a presentation scenario. Here, the talk slides are defined as one view package with the semantic aspects present, fragile, presentation and hence, are shown in full resolution next to the presenter. Furthermore, eight views are displayed to be compared. These views are grabbed and combined to one view package with semantic aspects compare, robust, presentation. Since view package definition is easily done via a lightweight user interface, but all layouts and optimizations according to semantics and the current room
Smart Views in Smart Environments
11
Fig. 4. Smart view management in use. Only a few semantic aspects have to be defined interactively, avoiding any further configuration effort.
state are carried out automatically, we reduce the configuration effort significantly and hence increase verifiably (see [10]) the usability of a smart meeting room. Although our management uses optimization, the available display space is still limited. Thus, if too much views have to be displayed on to less display space, the work efficiency or the number of finally displayed views decreases. Furthermore, the processing time of display mapping and force based layout increases along with the number of views and the bandwidth and latencies of network connections are a limiting factor, too. Nevertheless, we intend to enhance our smart view management even further. For example, since meta renderers have access to all views, the views could be easily enriched e.g. by view connecting arrows. Additionally we focus on a task based improvement of our system by generating and adapting view packages based on the user’s current tasks and workflows. Future work will also concentrate on interaction to support users interacting with the multiple views across the boundaries of devices at hand. Acknowledgments. Axel Radloff is supported by a grant of the German National Research Foundation (DFG), Graduate School 1424 (MuSAMA).
References 1. Ali, K.: Adaptive Layout for Interactive Documents. PhDThesis, University of Rostock, Germany (2009) 2. Ali, K., Hartmann, K., Fuchs, G., Schumann, H.: Adaptive Layout for Interactive Documents. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P., Christie, M. (eds.) SG 2008. LNCS, vol. 5166, pp. 247–254. Springer, Heidelberg (2008) 3. Avidan, S., Shamir, A.: Seam Carving for Content-Aware Image Resizing. ACM Transactions on Graphics 26(3), 10.1–10.9 (2007)
12
A. Radloff, M. Luboschik, and H. Schumann
4. Beach, R.J.: Setting Tables and Illustrations with Style. PhD-Thesis, University of Waterloo, Ontario, Canada (1985) 5. Brumitt, B., Meyers, B., Krumm, J., Kern, A., Shafer, S.: EasyLiving: Technologies for Intelligent Environments. In: Thomas, P., Gellersen, H.-W. (eds.) HUC 2000. LNCS, vol. 1927, pp. 12–29. Springer, Heidelberg (2000) 6. Cook, D.J., Das, S.K.: How Smart are our Environments? An Updated Look at the State of the Art. Pervasive and Mobile Computing 3, 53–73 (2007) 7. Encarna¸ca ˜o, J.L., Kirste, T.: Ambient intelligence: Towards smart appliance ensembles. In: Hemmje, M., Nieder´ee, C., Risse, T. (eds.) From Integrated Publication and Information Systems to Information and Knowledge Environments. LNCS, vol. 3379, pp. 261–270. Springer, Heidelberg (2005) 8. Follin, J.-M., Bouju, A., Bertrand, F., Boursier, P.: Management of MultiResolution Data in a Mobile Spatial Information Visualization System. In: Proceedings of Web Information Systems Engineering Workshops (WISEW 2003), pp. 92–99. IEEE Computer Society, Los Alamitos (2003) 9. Forlines, C., Lilien, R.: Adapting a Single-user, Single-display Molecular Visualization Application for Use in a Multi-user, Multi-display Environment. In: Proceedings of Advanced Visual Interfaces (AVI 2008), pp. 367–371. ACM Press, New York (2008) 10. Heider, T.: A Unified Distributed System Architecture for Goal-based Interaction with Smart Environments. PhD-Thesis, University of Rostock, Germany (2009) 11. Heider, T., Kirste, T.: Smart Environments and Self-Organizing Appliance Ensembles. In: Mobile Computing and Ambient Intelligence: The Challenge of Multimedia. Schloss Dagstuhl, Germany (2005) 12. Huang, J., Bue, B., Pattath, A., Ebert, D.S., Thomas, K.M.: Interactive Illustrative Rendering on Mobile Devices. IEEE Computer Graphics and Applications 27(3), 48–56 (2007) 13. Kaufmann, H.: Strabismus. Georg Thieme Verlag Stuttgart (2003) 14. Pirchheim, C., Waldner, M., Schmalstieg, D.: Deskotheque: Improved Spatial Awareness in Multi-Display Environments. In: Proceedings of IEEE Virtual Reality (VR 2009), pp. 123–126. IEEE Computer Society, Los Alamitos (2009) 15. Ramos, C., Marreiros, G., Santos, R., Freitas, C.F.: Smart Offices and Intelligent Decision Rooms. In: Handbook of Ambient Intelligence and Smart Environments, pp. 851–880. Springer, Heidelberg (2009) 16. Tan, D.S., Gergle, D., Scupelli, P., Pausch, R.: Physically Large Displays Improve Performance on Spatial Tasks. ACM Transactions on Computer-Human Interaction 13(1), 71–99 (2006) 17. van der Vet, P., Kulyk, O., Wassink, I., Fikkert, W., Rauwerda, H., Van Dijk, B., van der Veer, G., Breit, T., Nijholt, A.: Smart Environments for Collaborative Design, Implementation, and Interpretation of Scientific Experiments. In: Proceedings of AI for Human Computing (AI4HC 2007), pp. 79–86 (2007) 18. Wigdor, D., Jiang, H., Forlines, C., Borkin, M., Shen, C.: WeSpace: The Design, Development and Deployment of a Walk-Up and Share Multi-Surface Collaboration System. In: Proceedings of Human Factors in Computing Systems (CHI 2009), pp. 1237–1246. ACM Press, New York (2009) 19. Youngblood, G.M., Heierman, E.O., Holder, L.B., Cook, D.J.: Automation Intelligence for the Smart Environment. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI 2005), pp. 1513–1514. Morgan Kaufmann, San Francisco (2005)
Advanced Composition in Virtual Camera Control Rafid Abdullah1 , Marc Christie2 , Guy Schofield1 , Christophe Lino2 , and Patrick Olivier1 1
School of Computing Science, Newcastle University, UK 2 INRIA Rennes - Bretagne Atlantique, France
Abstract. Rapid increase in the quality of 3D content coupled with the evolution of hardware rendering techniques urges the development of camera control systems that enable the application of aesthetic rules and conventions from visual media such as film and television. One of the most important problems in cinematography is that of composition, the precise placement of elements in shot. Researchers already considered this problem, but mainly focused on basic compositional properties like size and framing. In this paper, we present a camera system that automatically configures the camera in order to satisfy advanced compositional rules. We have selected a number of those rules and specified rating functions for them, then using optimisation we find the best possible camera configuration. Finally, for better results, we use image processing methods to rate the satisfaction of rules in shot.
1
Introduction
As the field of computer animation evolved, camera control has developed from the simple process of tracking an object to also tackle the aesthetic problems a cinematographer faces in reality. As Giors noted, the camera is the window through which the spectator interacts with the simulated world [7]. This means that not only the content of screen space becomes very important but also its aesthetic qualities, making composition a very important aspect of camera control. For practical purposes, we divide the problem of composition into two levels: the basic level in which simple properties like position, size, and framing are considered and the advanced level in which more generic, aesthetic rules from photography and cinematography are considered. The rule of thirds is a good example of advanced level composition. We are developing a camera system that automatically adjusts a camera in order to satisfy certain compositional rules gleaned from photography and cinematography literature, namely, the rule of thirds, diagonal dominance, visual balance, and depth of field. We implement rating functions for these rules and use optimisation to find the best possible camera configuration. The rating functions depend on image processing, rather than geometrical approximation as implemented in most systems. Though it has the disadvantage of not being real, it has the advantage of being precise which is a requirement in composition. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 13–24, 2011. c Springer-Verlag Berlin Heidelberg 2011
14
R. Abdullah et al.
We start by discussing the most relevant research in this area, focussing on approaches to composition in different applications. Then, we explain the composition rules we have selected and how to rate them. Finally, we demonstrate the use of our system in rendering a scene from a well known film.
2
Background
Many previous approaches targeted composition, but most of them only handled basic compositional rules. Olivier et al. [14], in their CamPlan, utilised a set of relative and absolute composition properties to be applied on screen objects. All the properties realised in CamPlan are basic. Burelli et al. [4] defined a language to control the camera based on visual properties such as framing. They start by extracting feasible volumes in according to some of these properties and then search inside them using particle swarm optimisation. Like CamPlan, this approach lacks support of advanced composition rules. Another limitation in previous approaches is the use of geometrical approximation of objects for faster computation. In his photographic composition assistant, Bares [2] improves composition by applying transformations on the camera to shift and resize screen elements. However, the system depends on approximate bounding boxes which is often not sufficiently accurate. To alleviate this problem, Ranon et al. [16] developed a system to accurately measure the satisfaction of visual properties and developed a language that enables the definition of different properties. The main shortcoming of their system is that it only evaluates the satisfaction of properties, rather than finding camera configurations satisfying them. The final limitation that we are concerned about is that the methods used in previous work on composition either has limited applicability in computer graphics, or cannot be applied at all. For example, in Bares’s assistant, depending on restricted camera transformations rather than a full space search limits the application of the method to improving camera configurations rather than finding camera configurations. In digital photography world, Banerjee et al. [1] described a method to apply the rule of thirds to photographs by shifting the main subject(s) in focus so that it lies on the closest power corner. The main subject(s) is detected using a wide shutter aperture that blurs objects out of focus. Besides the limitation of the method to only one object and one rule, the method cannot be applied in computer graphics. Similarly, Gooch et al. [8] applied the rule of thirds on a 3D object by projecting its silhouette on screen and matching it with a template. Again, the method is limited to one object only. Liu et al. [12] addressed the problem of composition using cropping and retargeting. After extracting salient regions and prominent lines, they use particle swarm optimisation [10] to find the coordinates of the cropping that maximise the aesthetic value of an image based on rule of thirds, diagonal dominance, and visual balance. Though it addresses advanced composition, it cannot be applied to computer graphics.
Advanced Composition in Virtual Camera Control
15
Our system is an attempt to address the main shortcomings of the approaches above, namely a) the lack of advanced composition rules b) the use of approximate geometrical models, and finally c) the limited applicability.
3
Composition Rules
Although we implement basic composition rules in our system, our primary focus is on more advanced aesthetic conventions. We have explored some of the most important rules in the literature of photography and cinematography and selected the following for implementation because they have a clear impact on the results: 1. Rule of Thirds: The rule of thirds proposes that a useful starting point for any compositional grouping is to place the main subject of interest on any one of the four intersections (called the power corners) made by two equally spaced horizontal and vertical lines [17]. It also proposes that prominent lines of the shot should be aligned with the horizontal and vertical lines [9]. See figure 1a. Composition systems that have implemented this rule are [1, 2, 5, 8, 12]. 2. Diagonal Dominance: The rule proposes that “diagonal arrangements of lines in a composition produces greater impression of vitality than either vertical or horizontal lines” [17]. For example, in figure 1b, the table is placed along the diagonal of the frame. A system that has implemented this rule is [12]. 3. Visual Balance: The visual balance rule states that for an equilibrium state to be achieved, the visual elements should be distributed over the frame [17]. For example, in figure 1c, the poster in the top-left corner balance the weight of the man, the bottle, the cups, and other elements. Systems that have implemented this rule are [2, 12, 13]. 4. Depth of Field: The depth of field rule is used to draw attention to the main subject of a scene by controlling the parameters of camera lens to keep the main subject sharp while blurring other elements. See figure 1d. Besides the advanced composition rules, it is also important to have a set of basic rules to help in controlling some of the elements of the frame. For this, we have implemented the following basic composition rules: 1. Framing Rule: This rule specifies that the frame surrounding a screen element should not go beyond the specified frame. This rule is useful when an element needs to be placed in a certain region of the frame. See figure 1e. Other systems that have implemented this rule is [3, 4, 11, 14]. 2. Visibility Rule: This rule specifies that a minimum/maximum percentage of an element should be visible. For example, a case like that of figure 1f in which showing a character causes another character to be partially in view can be avoided by applying this rule on the character to the right. Other systems that have implemented this rule are [3, 4, 14].
16
R. Abdullah et al.
3. Position Rule: This rule specifies that the centre of mass of an element should be placed on a certain position of the screen. Other systems that have implemented this rule are [4, 14]. 4. Size Rule: This rule specifies that the size of a certain element should not be smaller than a certain minimum or larger than a certain maximum (specified by the user of the system). This rule is mainly useful to control the size of an element to ensure its size reflects its importance on the frame. Other systems that have implemented this rule are [3, 4, 14]. As different compositions need different rules, the rules are specified manually. All rules take an object ID(s) as a parameter and some other parameters depending on the rule’s requirements. We refer the reader to section 5 for practical examples.
(a) Rule of Thirds
(b) Diagonal Dominance
(c) Visual Balance
(d) Depth of Field
(e) Framing Rule
(f) Visibility Rule
Fig. 1. Images illustrating composition rules
4
Rules Rating
The camera systems of many applications depend on methods specific to the geometry of the application only. However, our system is scene-independent, with the ability to impose many rules on many objects which makes the problem strongly non-linear. This, along with the aim of producing aesthetically-maximal results, suggests rating shots according to the satisfaction of rules and using optimisation to solve for the best possible camera configuration. We found the optimisation method used by Burelli et al. [4] and Liu et al. [12], particle swarm optimisation, to be very efficient, so we use it. The rating of shot is described in this section.
Advanced Composition in Virtual Camera Control
4.1
17
Rendering Approach
Many approaches use geometrical approximation (e.g. bounding boxes) of scene objects to efficiently rate rules via closed form mathematical expressions [2, 4]. The downside of this approach is being inaccurate. To address this downside, we use the rendering approach. We render scene objects as an offline image and then process the resulting image to rate rules satisfaction for a certain camera configuration. The downside of the rendering approach is speed because of the time needed to render objects and process resulting images. Since composition rules apply to certain objects only, which we call ruled objects, we only render those objects and process the resulting image. However, one problem with image processing is the inaccuracy, difficulty, and cost of recognising object extents in an image. Since for the rules we selected the most important aspect is the region occupied by an object rather than its colours, we can safely avoid these problems by rendering objects with unique distinct colour for each. Moreover, since the same pixel might be occupied by more than one object, the colour we use for each object occupies only one bit of the RGBA pixel, then using blending with addition we can have as many as 32 objects occupying the same pixel. The problem then comes down to comparing each pixel in the rendered image against objects’ colours. Figures 2a and 2b illustrates the modified rendering process. An important issue to consider in the rendering method is that the resulting image tells nothing about the parts of an object which are out of view, making the rating of some rules incorrect, e.g. visibility rule. To solve this problem we use a field of view wider than the original such that the original view covers only the rectangle having corners (25%, 25%) and (75%, 75%), rather than the whole screen. This way we know which parts of objects are visible and which are invisible. This is illustrated in figure 2b in which the white rectangle represents the separator between the visible and invisible areas.
(a) Normal Rendering
(b) Modified Rendering
Fig. 2. To make image processing easier, we only render ruled objects and use different colour for each object. Furthermore, we bring the pixels resulting from the rendering towards the centre of the image such so that we can process pixels which are originally out of view.
18
4.2
R. Abdullah et al.
Rating Functions
In our system, each rule usually has several factors determining its rating value. For example, the rating of the diagonal dominance rule is determined by two factors: the angle between the prominent line of an object and the diagonal lines of the screen and the distance between the prominent line and the diagonal lines. For any rule, we rate each of its factors, then find the mean of ratings. To get the best results from particle swarm optimisation, we suggest the following criteria for the rating of each factor: 1. While the function must evaluate to 1 when the factor is fully satisfied, it must not drop to 0, otherwise the method will be merely an undirected random search. However, after some point, which we call the drop-off point, the function should drop heavily to indicate the dissatisfaction of the factor. 2. The function must have some tolerance near the best value, at which the function will still have the value 1. This gives more flexibility to the solver. We suggest using Gaussian function as a good match for these criteria. For each factor, we decide a best value, a tolerance, and a drop-off value, and then use Gaussian function as follows: FR = e
Δ−Δt )2 do −Δt
(Δ
(1)
where F R is factor rating, Δ is the difference between the current value of the factor and the best value, Δt is the tolerance of the factor, and Δdo is the difference between the drop-off value and the best value. Having the rating of each factor of a rule, we use geometric mean to find rule rating since it has the property of dropping down heavily if one of the factors drop heavily. However, the combined rating of all rules can be calculated using arithmetic or geometric mean (specified by the user) since sometimes it is not necessary to satisfy all rules. Moreover, weights can be specified for each rule according to its importance in the configuration. The data in table 1 gives the parameters of the different factors of all the composition rules we support in our system. Table 2 lists some of the symbols used in table 1. Crucially, we separate the rule of thirds into two rules, one for the power corners and the other for horizontal and vertical lines. This is because they usually apply to different elements of the screen. 4.3
Shot Processing
To rate the satisfaction of a rule we need to process the shot after rendering. The rating of framing rule and size rule depend on the frame surrounding the object on screen which can be easily found. The rating of visibility rule depends on the number of pixels in the visible and invisible areas which is also straightforward to calculate. The rating of visual balance and the power corners in the rule of thirds, the rating depends on the centroid of the object, which is also straightforward to calculate.
Advanced Composition in Virtual Camera Control
19
Table 1. The factors the camera solver depends on to rate the rules Factor
Best Value Rule of Thirds (Corners) Horz. Distance to 0 Corner Vert. Distance to 0 Corner Rule of Thirds (Lines) Line Angle 0 or 90 Diagonal Dominance Line Angle 45
Δt
Δdo
Explanation
HD/20
HD
HD/20
HD
The horizontal distance between the centre of mass of the object and the closest corner. The vertical distance between the centre of mass of the object and the closest corner.
5
30
The angle of the prominent line of the object.
15
30
The angle between the prominent line of the object and the diagonal lines. The distance between the prominent line of the object to the diagonal lines.
Line Distance
0
0.25
1
Visual Balance Horz. Centre
0
0.1
0.5
Vert. Centre
0
0.1
0.5
Framing Rule Left Outside
0
5% FW
Bottom Outside
0
5% FH
Top Outside
0
5% FH
Right Outside
0
5% FW
Visibility Rule Beyond Min. Horz. 0 Visibility Beyond Min. Vert. 0 Visibility Beyond Max. 0 Horz. Visibility Beyond Max. Vert. 0 Visibility Position Rule Horz. Distance 0
The horizontal component of the centre of mass of all the objects of the rule. The vertical component of the centre of mass of all the objects of the rule.
25% FW The width of the part of the object which is beyond the left border of the framing specified by the rule. 25% FH The height of the part of the object which is beyond the lower border of the framing specified by the rule. 25% FH The height of the part of the object which is beyond the upper border of the framing specified by the rule. 25% FW The width of the part of the object which is beyond the right border of the framing specified by the rule.
5% AOW 25% AOW The amount the horizontal visibility of the object is beyond the minimum horizontal visibility. 5% AOH 25% AOH The amount the vertical visibility of the object is beyond the minimum vertical visibility. 5% AOW 25% AOW The amount the horizontal visibility of the object is beyond the maximum horizontal visibility. 5% AOH 25% AOH The amount the vertical visibility of the object is beyond the maximum vertical visibility. 0.01
0.25
0.01
0.25
The horizontal distance between the centre of mass of the object and the position specified by the rule. The vertical distance between the centre of mass of the object and the position specified by the rule.
5% AOW 25% AOW The amount the width of the minimum width. 5% AOH 25% AOH The amount the height of the minimum height. 5% AOW 25% AOW The amount the width of the maximum width. 5% AOH 25% AOH The amount the height of the maximum height.
the object is beyond the object is beyond the object is beyond the object is beyond
20
R. Abdullah et al. Table 2. Symbols and abbreviations used in factors calculation Symbol HD FW FH AOW AOH
Description Half the distance between the power corners (i.e. 0.33333) Width of the frame used by the framing rule. Height of the frame used by the framing rule. Average width of the projection on screen of the object being considered by a rule. Average object of the projection on screen of the object being considered by a rule.
As for diagonal dominance and the horizontal and vertical lines in the rule of thirds, the rating depends on the prominent line of the ruled object, which is found by applying linear regression on the pixels of the object to find the best fitting line. The standard linear regression method works by minimising the vertical distance between the line and the points. This is mainly useful in case the aim of the regression is to minimise the error which is represented by the vertical distance between the line and the points. However, in our case we want to find a line that fits best rather than a line that minimises the error so we use a modified linear regression called perpendicular linear regression [18]. The method starts by finding the centroid of all the points then finding the angle of the line passing through the centroid which minimise the perpendicular distance. The angle is found according to the following equations: A A tan(θ) = − ± ( )2 − 1 (2) 2 2 where n n x2i − i=1 yi2 i=1 A= (3) n i=1 xi yi where (xi , yi ) is the coordinate of the ith pixel of the object. Figure 3a illustrates the prominent line of the table in figure 1b and figure 3b illustrates the same concept but for a character to be positioned according to rule of thirds.
(a) Diagonal Dominance
(b) Rule of Thirds
Fig. 3. Illustrating how the prominent line of an object is found using perpendicular linear regression. The line in blue is the prominent line extracted from the object in dark red.
Finally, the rating of the depth of field rule is always 1, as the adjustment of the camera lens cannot be decided before the position, orientation, and field of
Advanced Composition in Virtual Camera Control
21
view of the camera are found. Once they are found, the camera depth of field are adjusted according to the size of the object and its distance from the camera.
5
Evaluation
As a practical demonstration of our system, we decided to render a scene from Michael Radford’s Nineteen Eighty-Four. The scene we selected is set in a canteen and revolves around 4 important characters in the plot, Smith, Syme, Parsons, and Julia. In the scene, the protagonist, Smith is engaged in conversation with Syme and Parsons at the lunch table, and Julia is watching Smith from across the room. An additional secondary character participates briefly in one of the conversations and we call him OPM (abbreviation of Outer Party Member). The scene has been rendered in 3DSMAX and exported to OGRE. We used our system to automatically find camera configurations that show shots similar to those of the original scene. We then used those camera configurations to generate a video of the rendering of the scene. The video is attached with this paper. Table 3 lists some of the shots that have been generated and the rules used to generate each shot. For each shot, we repeated the test 10 times and calculated the mean and standard deviation of the achieved rating, processing time, and number of iterations until a solution is found, which we also include in the table in the second column, where the standard deviation is the value in brackets. Finally, our screen coordinates range from (-1, -1) at the bottom-left corner to (1, 1) at the top-right corner. The system we ran the test on has a 2.66 GHz Intel Core 2 Quad Core Q9400 processor and an NVIDIA GeForce 9800 GT video card with 512 MB of video memory. For the rating, scene objects are rendered on an 128×128 offline texture. It is possible to use smaller texture sizes to reduce processing time, but this will decrease the accuracy of the results. We configured the PSO to use 36 particles in each iteration, each particle representing a camera configuration of 3D position, 2 orientation angles, and an angle for the field of view. Initially, the position is randomly positioned in boxes manually specified around the objects of interest. The orientation and field of view angles are also randomly specified in a manually set range. The algorithm breaks if the rating reach 98% or after 100 iterations. As obvious from the table data, the standard deviation of the rating is either zero or negligible, which shows that the results obtained by the system are steady. Another thing to notice is that the processing time varies widely among the different shots because different shots have different rules. Also, the standard deviation of the time is relatively large because, depending on the initial random configurations, the required number of iterations before a solution is reached varies widely. In the fourth configuration specifically, the number of iterations is zero. This is because we fixed the position of the camera to Syme’s eyes to show the view from his perspective, and only allowed the camera pitch and field of view to be adjusted, making it enough for the initial step to find a satisfactory solution. Finally, the rating of the last shot is relatively low because the used rules cannot be satisfied together.
22
R. Abdullah et al.
Table 3. The list of shots used in the rendering of a scene from Nineteen Eighty-Four. For each shot, we repeated the test 10 times and calculated the average rating the system could achieve and the average time spent in the solving process. The numbers in the brackets are the standard deviation of the results of the 10 tests. Rules
On Smith while listening to screen Visibility(Syme, MaxVisibility:0%) Framing(Smith#Head, MinX:-1.0, MinY:0.0, MaxX:0.5) Visibility(Screen, MinVisibility:100%) Size(Screen, MinWidth:2.5, MinHeight:2.5)
72.4% (0.351%) 15.188 (0.11) 100 (0)
Shot
Advanced Composition in Virtual Camera Control
23
Finally, the number of particles used has an important effect on the result. For in one hand, if we reduce the number of particles, the achieved rating will decrease, while if we increase the number of particles, the processing time will heavily increase with not much gain in rating. For more information about tuning particle swarm optimisation we refer the reader to [6, 15].
6
Conclusion and Future Work
We have implemented a camera system for advanced composition. The system has been implemented based on particle swarm optimisation, which proved to be very successful in finding solutions in high dimensional search spaces, a necessity for camera control. To get accurate results we rated shots based on image processing, as opposed to geometric approximation. The main shortcoming of our approach is the time it takes to find a solution, which makes the system limited to offline processing only. The other shortcoming is that the occlusion problem is not considered here as it requires full rendering of the scene which is a expensive operation. Future work will be focused on solving these two shortcomings. We are investigating the possibility of implementing the image processing computations on the GPU rather than the CPU. This would heavily reduce the processing time, as the bottleneck of our system is transferring the image data from the GPU memory and processing them in the CPU.
Acknowledgements This research is part of the “Integrating Research in Interactive Storytelling (IRIS)” project, and we would like to thank the European Commission for funding it. We would also like to forward our thanks to Zaid Abdulla, from the American University of Iraq Sulaimani (AUI-S) for his invaluable help with the technical aspects of the research. Finally, we would like to thank the reviewers of this paper for their comments.
References 1. Banerjee, S., Evans, B.L.: Unsupervised automation of photographic composition rules in digital still cameras. In: Proceedings of SPIE Conference on Sensors, Color, Cameras, and Systems for Digital Photography, vol. 5301, pp. 364–373 (2004) 2. Bares, W.: A photographic composition assistant for intelligent virtual 3d camera systems. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P. (eds.) SG 2006. LNCS, vol. 4073, pp. 172–183. Springer, Heidelberg (2006) 3. Bares, W., McDermott, S., Boudreaux, C., Thainimit, S.: Virtual 3D Camera Composition from Frame Constraints. In: Proceedings of the Eighth ACM International Conference on Multimedia, pp. 177–186. ACM, New York (2000) 4. Burelli, P., Di Gaspero, L., Ermetici, A., Ranon, R.: Virtual camera composition with particle swarm optimization. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P., Christie, M. (eds.) SG 2008. LNCS, vol. 5166, pp. 130–141. Springer, Heidelberg (2008)
24
R. Abdullah et al.
5. Byers, Z., Dixon, M., Smart, W.D., Grimm, C.M.: Say Cheese! Experiences with a Robot Photographer. AI Magazine 25(3) 37 (2004) 6. Carlisle, A., Dozier, G.: An off-the-shelf pso. In: Proceedings of the Workshop on Particle Swarm Optimization, vol. 1, pp. 1–6 (2001) 7. Giors, J.: The full spectrum warrior camera system (2004) 8. Gooch, B., Reinhard, E., Moulding, C., Shirley, P.: Artistic composition for image creation. In: Eurographics Workshop on Rendering, pp. 83–88 (2001) 9. Grill, T., Scanlon, M.: Photographic Composition. Amphoto Books (1990) 10. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, pp. 1942–1948 (1995) 11. Lino, C., Christie, M., Lamarche, F., Schofield, G., Olivier, P.: A Real-time Cinematography System for Interactive 3D Environments. In: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA) (July 2010) 12. Liu, L., Chen, R., Wolf, L., Cohen-Or, D.: Optimizing photo composition. Computer Graphic Forum (Proceedings of Eurographics) 29(2), 469–478 (2010) 13. Lok, S., Feiner, S., Ngai, G.: Evaluation of Visual Balance for Automated Layout. In: Proceedings of the 9th International Conference on Intelligent User Interfaces, pp. 101–108. ACM, New York (2004) 14. Olivier, P., Halper, N., Pickering, J., Luna, P.: Visual composition as optimisation. In: AISB Symposium on AI and Creativity in Entertainment and Visual Art, pp. 22–30 (1999) 15. Pedersen, M.E.H.: Tuning & Simplifying Heuristical Optimization (PhD Thesis). sl: School of Engineering Sciences. Ph.D. thesis, University of Southampton, United Kingdom (2010) 16. Ranon, R., Christie, M., Urli, T.: Accurately measuring the satisfaction of visual properties in virtual camera control. In: Taylor, R., Boulanger, P., Krger, A., Olivier, P. (eds.) Smart Graphics. LNCS, vol. 6133, pp. 91–102. Springer, Heidelberg (2010) 17. Ward, P.: Picture Composition for Film and Television. Focal Press (2003) 18. Weisstein, E.W.: Least squares fitting–perpendicular offsets (2010), http:// mathworld.wolfram.com/LeastSquaresFittingPerpendicularOffsets.html
Towards Adaptive Virtual Camera Control in Computer Games Paolo Burelli and Georgios N. Yannakakis Center For Computer Games Research IT University Of Copenhagen Rued Langgaards Vej 7 2300 Copenhagen, Denmark {pabu,yannakakis}@itu.dk
Abstract. Automatic camera control aims to define a framework to control virtual camera movements in dynamic and unpredictable virtual environments while ensuring a set of desired visual properties. We investigate the relationship between camera placement and playing behaviour in games and build a user model of the camera behaviour that can be used to control camera movements based on player preferences. For this purpose, we collect eye gaze, camera and game-play data from subjects playing a 3D platform game, we cluster gaze and camera information to identify camera behaviour profiles and we employ machine learning to build predictive models of the virtual camera behaviour. The performance of the models on unseen data reveals accuracies above 70% for all the player behaviour types identified. The characteristics of the generated models, their limits and their use for creating adaptive automatic camera control in games is discussed.
1
Introduction
In 3D computer games, a virtual camera defines the point of view of the player on the virtual environment and it mediates her visual perceptions. Therefore, a wide range of aspects of the player experience, such as interaction and storytelling are heavily affected by the virtual camera placement and motion [13]. In games and other 3D applications the camera is either manually controlled by the player during her interaction or placed and animated a priori by a designer. However, manual control of the camera often proves challenging for the player as it increases the complexity of the interaction, whereas statically predefined cameras fail to cope with dynamic virtual environments. These limits have driven the research towards the identification of a new camera control paradigm: automatic camera control. Within this framework the camera is controlled using high-level and environment-independent requirements, such as the visibility of a particular object or the size of that object’s projected image on the screen. A software module, commonly referred as camera controller, dynamically and efficiently infers the ideal camera position and motion from these requirements and the current game state. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 25–36, 2011. c Springer-Verlag Berlin Heidelberg 2011
26
P. Burelli and G.N. Yannakakis
The process of finding the virtual camera configuration that best fit a set of requirements has been widely investigated [8]. On the contrary, the requirements themselves have received little attention despite their impact on player experience [11]. Virtual camera parameters are commonly hand-crafted by game designers and do not consider player preferences. Including the player in the definition of these parameters requires the construction of a model of the relationship between camera motion and player experience [11]. In this paper we investigate player preferences concerning virtual camera placement and animation, we propose a model of the relationship between camera behaviour, player behaviour and game-play and we evaluate the performance of this model. For this purpose, data from player gaze and the virtual camera motion is collected through a game experiment and used to identify and describe the players’ camera behaviours. In the data-collection experiment the participants play a three-dimensional platformer game featuring all the stereotypical aspects of the genre’s mechanics.
2
Background
Early studies on camera control focused on the definition of the camera properties and investigated the mapping between input devices and camera movement [16]. The main research focus in the field rapidly shifted towards automatic camera control since direct control of the camera has shown to be problematic for the user [9]. Several different techniques have been proposed for automatic camera control, based on a variety of mathematical models; however, the vast majority of the approaches model the problem as a constrained optimisation problem [8]. These approaches allow the designer to define a set of requirements on the frames that the camera should produce and on the camera motion. Depending on the approach, the controller positions and animates one or more virtual cameras in the attempt to generate a shot or a sequence of shots that satisfy the predefined requirements. Requirements for the camera include constraints on camera motion (such as speed limit), constraints on camera position (such as maximum height), or constraints on the rendered frames. The last type of camera requirements, introduced by Bares et al. [2], defines required properties on the frames rendered using the camera such as subject inclusion or subject position within the frame. 2.1
Camera Profiles
A large volume of research studies on automatic camera control is dedicated to the analysis of robust and time-efficient techniques to place and move the camera to satisfy a set of given requirements. These sets of requirements, usually referred as camera profiles [5], define the set of constraints and the objective function that needs to be optimised by the automatic camera control system. Christianson et al. [7] introduced a language that permits the definition of shot sequences (idioms) with the desired timings and subjects. Other researchers extended Christianson’s work by connecting shot plans with camera constraints [1],
Towards Adaptive Virtual Camera Control in Computer Games
27
or by introducing advanced planning techniques to support interactive storytelling [6,10]. While the aforementioned approaches address primarily the issues related to the manual design of camera behaviours for dynamic and interactive environments, other researchers have investigated approaches which does not require the contribution of a designer [3]. Yannakakis at al. [17] studied the impact of camera viewpoints on player experience and built a model to predict this impact. That study demonstrates the existence of a relationship between player emotions, physiological signals and camera parameters. In the light of these results, Picardi et al. [12] investigated the relationship between camera and player behaviour in a 3D platform game and demonstrate the existence of a significant relationship between the player’s preferences on camera, measured using a gaze tracker, and a set of game-play features such as the number of collected items or the number of jumps performed. In this paper we extend this work by analysing the interplay between past player behaviour and camera control to automate the generation and selection of virtual camera parameters. 2.2
Gaze Interaction in Games
Eye movements can be recognised and categorised according to speed, duration and direction [18]. In this paper, we focus on fixations, saccades and smooth pursuits. A fixation is an eye movement that occurs when a subject focuses at a static element on the screen; a saccade occurs when a subject is rapidly switching her attention from one point to another and a smooth pursuit is a movement that takes place when a subject is looking at a dynamic scene and she is following a moving object. Sundstedt et al. [15] conducted an experimental study to analyse players’ gaze behaviour during a maze puzzle solving game. The results of their experiment show that gaze movements, such as fixations, are mainly influenced by the game task. They conclude that the direct use of eye tracking during the design phase of a game can be extremely valuable to understand where players focus their attention, in relation to the goal of the game. Bernhard et al. [4] performed a similar experiment using a three-dimensional first-person shooting game in which the objects observed by the players were analysed to infer the player’s level of attention. We are inspired by the experiment of Bernhard at al. [4]; unlike that study however, we analyse the player’s gaze patterns to model the player’s camera movements and we model the relationship between camera behaviour, game-play characteristics and player-behaviour.
3
The Game
A three-dimensional platform game has been designed and developed as a testbed for this study1 . The game features an alien-like avatar (see Fig. 1a) in a futuristic environment floating in the open space. The player controls the avatar and 1
The game is based on Lerpz, a tutorial game by Unity Technologies — http://www.unity3d.com
28
P. Burelli and G.N. Yannakakis
(a) Avatar
(b) Platform (c) Cell
Fuel
(d) Copper
(e) Re-spawn (f) Jump Point Pad
Fig. 1. Main components of the game
the camera using keyboard and mouse. Avatar movements, defined in camerarelative space, are controlled using the arrow keys, and jump and hit actions are activated by pressing the left and right mouse buttons respectively. The camera orbits around the avatar at a fixed distance; the player can change the distance using the mouse wheel and can rotate the camera around the avatar by moving the mouse. The environment thought which the avatar moves is composed by floating platforms . Each platform can be connected to another platform directly, forming a cluster of platforms, through a bridge or it can be completely isolated, in which case the avatar is required to jump to move from one platform to the other. Four main object types may appear on the platforms: fuel cells, coppers, re-spawn points and jump pads. Fuel cells (see Fig. 1c) are collectable items increasing the score of the player. Coppers (see Fig. 1d) are opponent non player characters. Re-spawn points (see Fig. 1e) are small stands placed on some platforms that, when activated, act as the avatar’s spawn point after he falls from a platform. Finally, the jump pads (see Fig. 1f) are black and yellow striped areas which allow the player to perform a longer jump. The aim of the player is to cross the virtual world until the last platform while collecting fuel cells and killing coppers to achieve the highest possible score. However, the player needs also to avoid falling from the platforms and loosing too much time as this will negatively affect the final score. The game is preceded by a tutorial level that explains the game controls and gives an overview of the contribution of the game actions to the score. Moreover, during the tutorial, the player is walked through all the challenges she will come across in the game. The game is divided in a series of areas classified into three categories according to the game-play experience they offer: jump, fight and collection areas. Figure 2a shows a fight area where the main threat is given by the opponent copper at the centre of the platform. The jump area depicted in Fig. 2b is composed by several small floating platforms; the player needs to make the avatar jump across all the platforms to complete the area. Figure 2c shows an area where the main task of the player is to collect the fuel cells placed around the platform. In total, the game comprises 34 areas containing 14 collection areas, 11 fight areas and 9 jump areas.
Towards Adaptive Virtual Camera Control in Computer Games
(a) Fight area
(b) Jump area
29
(c) Collection area
Fig. 2. The three different area types met in the game
4
Experimental Methodology
Our experimental hypothesis is that the way a player controls the virtual camera depends on what actions she performs in the game and on how she performs them. We represent the virtual camera behaviour as the amount of time the player spends framing and observing different objects in the game environment while playing the game. This representation of behaviour is chosen over a direct model of the camera position and motion as it describes the behaviour in terms of the content visualised by the camera and, therefore, it is independent of the absolute position of the avatar, the camera and other objects. To get information about the objects observed by the player during the experiment, we used the Eyefollower 2 gaze tracker which samples the player’s gaze at a 120 Hz frequency (60 Hz per eye). Twenty-nine subjects participated in the experiment. Twenty-four were male, five were female; the age of the participants ranged between 23 and 34 years. Statistical data about game-play behaviour, virtual camera movement and gaze position is collected for each participant. The collection of the data is synchronised to the Eyefollower sampling rate, therefore, both data from the game and from the gaze tracker are sampled 120 times per second. Each data sample contains: information about the game-play, position and orientation of the camera, coordinates of the gaze position on the screen and the number and the type of objects around the avatar. The objects are classified into two categories: close and far. All the objects reachable by the avatar within the next action are labelled as close, otherwise as far. The data is logged only during the time the participants play the game; this phase is preceded by for each player by the calibration of the gaze tracking system, a tutorial level and a demographics questionnaire.
5
Extracted Features from Data
The data collected for each game is sorted into three datasets according to the three area types described in Section 3. Each time a player completes an area two types of statistical features are extracted from that area: game-play and camera 2
Developed by LC Technologies, Inc. - www.eyegaze.com
30
P. Burelli and G.N. Yannakakis
related features. The features of the first type are the experiment’s independent variables and encapsulate elements of the player’s behaviour in the area. The features of the second type describe the camera behaviour for each platform, therefore, they define the experiment’s dependent variables. The player’s behaviour is defined by the actions the player performs in each area or, more precisely, by the consequences of these actions. Hence, the features extracted describe the interaction between the player and the game though the avatar’s actions, rather than the sequences of pressed keys. For each area the following features have been extracted: the numbers of fuel cells collected, damage given, damage received, enemies killed, re-spawn points visited and jumps. The features are normalised to a range between 0 and 1 using a standard min-max uniform normalisation. To model the camera behaviour, we analyse the content visualised by the camera instead of the camera absolute position and rotation. The presence of a certain object on the screen, however, does not necessarily imply an intentionality of the player; e.g. the object might be on the screen only because it is close to an object the player is interested to. The gaze data available permits to overcome this limitation since, using the gaze position, it is possible to understand which object is currently observed among the ones framed by the player. Therefore, we combine camera movements and gaze coordinates to identify the objects observed by the player at each frame and we extract the following statistical features: the relative camera speed as well as the time spent observing the avatar, the fuel cells close to the avatar, the enemies close to the avatar, the re-spawn points close to the avatar, the jump pads close to the avatar, the platforms close to the avatar and the far objects. The seven features related to the time spent observing objects are calculated as the sum of the durations of the smooth pursuit and fixation movements of the eyes [18] during which the gaze position falls within an object’s projected image. These values are normalised to [0, 1] via the completion time of each area. The first feature is the average speed S of the camera relative to the avatar and it is defined as S = (Dc − Da )/T , where Dc is the distance covered by the camera during the completion of an area, Da is the distance covered by the avatar and T is the time spent to complete the area. Each time the avatar leaves an area, the aforementioned features are logged for that area. The data is then, cleaned from experimental noise by removing all the records with a duration inferior to 3 seconds and the ones with no platforms or no enemies and fuel cells left. The remaining records are sorted into three separate groups according to the area type and stored into three datasets, containing 239 records for the collection areas, 378 records for the fight areas and 142 records for the jump areas.
6
Camera Behaviour Modelling
To investigate and create a model of the relationship between camera behaviour and game-play, we analyse the collected data through two steps: identification
Towards Adaptive Virtual Camera Control in Computer Games
31
Table 1. Average camera behaviour features with the number of records of each cluster. Speed indicates the average camera speed with respect to the avatar. The remaining features express the time spent observing each object of a type in an area divided by the time spent completing the area. Highlighted in dark grey is the feature related to the main task of the area type. The features related to the other objects close to the avatar are highlighted in light grey. Collection Areas (k = 2) Jump Pads Re-spawn Points Far Objects Speed 0.034 0.113 0.021 3.338 0.056 0.072 0.012 8.852 Fight Areas (k = 3) Records Avatar Fuel Cells Coppers Jump Pads Re-spawn Points Far Objects Speed 137 0.674 0.042 0.095 0.049 0.034 0.036 3.283 99 0.676 0.032 0.478 0.044 0.056 0.025 5.293 142 0.250 0.029 0.069 0.030 0.052 0.013 5.927 Jump Areas (k = 3) Records Avatar Fuel Cells Platforms Far Objects Speed 33 0.759 0.464 0.795 0.202 2.1293 80 0.736 0.166 0.658 0.059 2.7593 29 0.450 0.113 0.559 0.012 5.5854 Records 150 89
Avatar 0.595 0.361
Fuel Cells 0.108 0.125
and prediction. In the first step we use a clustering technique to extract the relevant camera behaviours and analyse their characteristics and then, in the second step, we build a model based on this categorisation able to predict the correct behaviour given a set of game-play data. 6.1
Behaviour Identification
The number of distinct camera behaviours as well as their internal characteristics can only be based, in part, on domain knowledge. One can infer camera behaviour profiles inspired by a theoretical framework of virtual cinematography [10] or alternatively follow an empirical approach — as the one suggested here — to derive camera behaviour profiles directly from data. The few existing frameworks focus primarily on story-driven experiences with little or no interaction, thus are not applicable in our context. Therefore, we adopt a data-driven approach and we employ the k-means clustering algorithm on the gaze-based extracted features for the purpose of retrieving the number and type of different camera behaviours. Unsupervised learning allows us to isolate the most significant groups of samples from each dataset. However, k-means requires the number of clusters k existent in the data to minimise the intra-cluster variance. To overcome this limitation, the algorithm runs with a progressively higher k value — from 2 to 10 — and the clusters generated at each run are evaluated using a set of five cluster validity indexes. The algorithm runs 50 times for each k and the run with the smallest within cluster sum of squared errors is picked. Each selected
32
P. Burelli and G.N. Yannakakis
partition is evaluated against 5 validity indexes: Davis-Bouldin, Krzanowski-Lai, Calinski-Harabasz, Dunn and Hubert-Levin. The smallest k value that optimises at least 3 validity measures out of five is used for the clustering; the chosen k value is 2 for the collection type areas and 3 for the fight and jump type areas. As seen in Table 1, the camera behaviour is described with a different feature set for each area type. The features are selected to match the visual stimuli offered by each area, thus only the features describing observation of objects which are present in the area type are included in the set. Moreover, for each area type the features are sorted into 5 categories: camera speed and time spent observing the avatar, the primary task objects, other close objects and far objects. The primary task objects highlighted in dark grey in Table 1, represent the time spent observing objects relative to the main task of each area type, all the other objects are categorised as secondary. According to this feature categorisation it is possible to observe three main behaviour types: task focused including the clusters spending more time observing the task related objects, avatar focused including the clusters mostly observing the avatar and overview which includes the clusters evenly observing all the objects in each area. For an in-depth description of the clusters obtained the reader is referred to [12] 6.2
Behaviour Prediction
Once the relevant camera behaviour types are identified, we proceed by modeling the relationship between game-play and camera behaviour types. More precisely, since the model is intended to select the most appropriate camera behaviour that fits the player’s preferences in the game, we attempt to approximate the function that maps the game-play behaviour of the player to the camera behaviour. For this purpose, we use Artificial Neural Networks (ANNs) which are chosen as a function known for its universal approximation capacity. In particular, we employ a different ANN for each area type, instead of one for the whole dataset, to be able to base each model on the best features necessary to describe the game-play in that area. The three fully connected feed-forward ANNs are trained using Resilient Backpropagation [14] on the game-play data (ANN input) and the camera behavior clusters (ANN output) using early stopping for over-fitting avoidance. The networks employ the logistic (sigmoid) function at all their neurons. The performance of the ANNs is obtained as the best classification accuracy in 100 independent runs using 3-fold cross-validation. While the inputs of the ANN are selected algorithmically from the set of game-play features the ANN outputs are a set of binary values corresponding to each cluster of the dataset. The exact ANN input features, the number of hidden neurons and the number of previous areas considered in the ANN input are found empirically through automatic feature selection and exhaustive experimentation. Sequential Forward Selection (SFS) is employed to find the feature subset that yields the most accurate ANN model. SFS is a local-search procedure in which a feature is added at a time to the current feature (ANN input) set until the accuracy of the prediction increases. Once the best feature set is selected, the best ANN
Towards Adaptive Virtual Camera Control in Computer Games
33
Table 2. F-distribution values of the inter-cluster ANOVA test on the game-play features. The threshold for a 5% significance level is F > 3.85 for the collection areas and F > 2.99 for the jump areas. Values above the threshold appear in bold. Area Type
Fuel Cells
Damage Given
Damage Received
Enemies Killed
Re-spawn Points
Jumps
Collection Fight Jump
5.02 1.63 11.98
12.42 -
10.89 -
24.03 -
1.23 0.49 -
0.76 9.64 0.53
topology is calculated through an exhaustive search of all possible combinations of neurons in two hidden layers with a maximum of 30 neurons per layer. The combination of automatic feature and topology selection is tested on three different feature sets representing different time horizons in the past: input (game-play features) from one (one step) previous area visited in the past, input from the previous two areas visited (two step) and the combination of one previous area in the past with the average features of the rest of the previous areas visited (one step + average).
7
Results and Analysis
In this section we present and discuss the results in terms of prediction accuracy of the camera behaviour models created. First, a statistical analysis of the data is performed to check the existence of a relationship between camera behaviours and game-play features and to identify the significant ones. Then, the prediction accuracy of the models is evaluated with respect to the length of the time-series expressing the past which is considered in the ANN input, the selected feature set and the network topology. To isolate the significant features among the ones logged, we perform an intercluster one-way ANOVA for each game-play feature to identify for which features we can reject the null hypothesis (no statistical difference exists). As it is possible to see in Table 2, for each area type, at least one feature demonstrates a significant difference revealing the existence of significant linear relationships. In the fight areas dataset there is a significant difference in terms of damage (both given and taken), number of killed enemies and number of jumps. In the other two area datasets the clusters differ significantly in the number of fuel cells collected. The features highlighted by the ANOVA test reveal the existence of a linear relationship between the current camera behaviour and those features. However variable relationships, in general, are most likely more complex given that linearly separable problems are extremely rare. Thus, the aim of the analysis presented below is the construction of non-linear computational models of camera behaviour via the use of ANNs described in Section 6. Figure 3 depicts the best performance (3-fold cross validation) of 100 runs for each feature set on the three representations of the past events described in Section 6.
34
P. Burelli and G.N. Yannakakis 100
Accuracy
80 60 40 20 0 1S
2S 1S+A Fight Areas
1S
All Features
2S 1S+A Jump Areas Significant Features
1S 2S 1S+A Collection Areas SFS
Fig. 3. Best 3-fold cross-validation performance obtain by the three ANNs across different input feature sets and past representations. The bars labelled 1S refer to the one step representation of the past trace, the ones la- belled 2S refer to the two step representation and 1S+A to the representation combining one previous step and the average of the whole past trace.
Each value shown in Figure 3 corresponds to the best topology found. It is noteworthy that all the selected topologies have at least one hidden layer, confirming the non linear nature of the relationship. This aspect is also highlighted, in Figure 3, by the difference in prediction accuracy between the ANNs that use the subset of significant features identified through the ANOVA test and the ANNs using the subset identified through SFS. The latter type of ANNs, yield a better accuracy regardless of the past representation and game areas. The best 3-fold cross-validation performances achieved for the fight, the jump, and the collection areas are, respectively, 70.04%, 76.43% and 82.29%. It is worth noting that in the collection areas, while the first type of ANNs, built solely on the features that are found significant by ANOVA, perform even worse than the ones using the full feature set, indicating that the linear analysis does not capture the relationship between game-play and camera behaviour accurately. While, in the collection and jump areas, the ANOVA test indicates the number of fuel cells collected as the only significant feature, SFS selects the number of jumps and the number of re-spawn points activated as additional features for the ANN input. On the other hand, in the collection areas, SFS does not only select features not indicated by ANOVA (the number of re-spawn points activated), but it also discards the number of jumps performed. The results shown in Figure 3 confirm also our supposition that a more extensive representation of the past events would lead to a better accuracy. In fact, the best accuracies are achieved when the ANNs use the most extensive information about the past game-play events.
8
Conclusion
This article introduced the first step towards adaptive virtual camera control in computer games by proposing a model of camera behaviour and its relationship
Towards Adaptive Virtual Camera Control in Computer Games
35
to game-play. Camera behaviour is modelled using a combination of information about players’ gaze and virtual camera position collected during a game experiment. The data collected is sorted into three sets of areas according to the game-play provided to the player. For each group the data is clustered using kmeans to identify relevant behaviours and the relationship between the clusters and the game-play experience is modelled using three ANNs. The evaluation of the ANN accuracy in predicting camera-behaviour is analysed with respect to the number ans type of features used as input to the model. The analysis reveals that the best prediction accuracies (i.e. 76.43% for jump, 82.29% for collection and 70.04% for fight) are achieved using a representation of the past events which includes the previous area and the average features of the other previously visited areas. Moreover, sequential feature selection reduces vector size which results in higher accuracies for all ANNs. While the models constructed from the data show a prediction accuracy way above chance level, the analysis of collected data also suggests that the game is visually very complex and a multitude of objects, competing for the player’s attention, generate noise in the data. This aspect could be limited by reducing the visual noise in the game. However, this would require to reduce the game complexity and, thus, also reduce the applicability to other games of the same genre. The actual version of the game incorporates visually all the standard features of modern 3D platformers and, besides the aforementioned limitations, the results give strong indications of the link between camera behaviour models and game-play in such games. The same methodology could be used to construct camera behaviour models for different games such as role-playing or action games. Models constructed using this methodology could be used to dynamically select the camera behaviour best suiting a certain player and therefore, generating a personalised game experience. For this purpose, it would be necessary to further investigate how to translate the camera behaviours identified through this methodology into camera profiles that could be used by an automatic camera control system. Moreover, it would be interesting to investigate how to connect the proposed models to cognitive and affective states of the player, in order to be able to influence player experience aspects such as fun or attention.
References 1. Amerson, D., Kime, S., Young, R.M.: Real-time cinematic camera control for interactive narratives. In: International Conference on Advances in Computer Entertainment Technology, pp. 369–369. ACM Press, Valencia (2005) 2. Bares, W.H., McDermott, S., Boudreaux, C., Thainimit, S.: Virtual 3D camera composition from frame constraints. In: ACM Multimedia, pp. 177–186. ACM Press, Marina del Rey (2000) 3. Bares, W.H., Zettlemoyer, L.S., Rodriguez, D.W., Lester, J.C.: Task-Sensitive Cinematography Interfaces for Interactive 3D Learning Environments. In: International Conference on Intelligent User Interfaces, pp. 81–88. ACM Press, San Francisco (1998)
36
P. Burelli and G.N. Yannakakis
4. Bernhard, M., Stavrakis, E., Wimmer, M.: An empirical pipeline to derive gaze prediction heuristics for 3D action games. ACM Transactions on Applied Perception 8(1), 4:1–4:30 (2010) 5. Bourne, O., Sattar, A.: Applying Constraint Weighting to Autonomous Camera Control. In: AAAI Conference On Artificial Intelligence In Interactive Digitale Entertainment Conference, pp. 3–8 (2005) 6. Charles, F., Lugrin, J.-l., Cavazza, M., Mead, S.J.: Real-time camera control for interactive storytelling. In: International Conference for Intelligent Games and Simulations, London, pp. 1–4 (2002) 7. Christianson, D., Anderson, S., He, L.-w., Salesin, D., Weld, D., Cohen, M.: Declarative Camera Control for Automatic Cinematography. In: AAAI, pp. 148–155. AAI, Menlo Park (1996) 8. Christie, M., Olivier, P., Normand, J.-M.: Camera Control in Computer Graphics. In: Computer Graphics Forum, vol. 27, pp. 2197–2218 (2008) 9. Drucker, S.M., Zeltzer, D.: Intelligent camera control in a virtual environment. In: Graphics Interface, pp. 190–199 (1994) 10. Jhala, A., Young, R.M.: A discourse planning approach to cinematic camera control for narratives in virtual environments. In: AAAI, pp. 307–312. AAAI Press, Pittsburgh (2005) 11. Martinez, H.P., Jhala, A., Yannakakis, G.N.: Analyzing the impact of camera viewpoint on player psychophysiology. In: International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–6. IEEE, Los Alamitos (2009) 12. Picardi, A., Burelli, P., Yannakakis, G.N.: Modelling Virtual Camera Behaviour Through Player Gaze. In: International Conference On The Foundations Of Digital Games (2011) 13. Pinelle, D., Wong, N.: Heuristic evaluation for games. In: CHI 2008, p. 1453. ACM Press, New York (2008) 14. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: the RPROP algorithm. IEEE, Los Alamitos (1993) 15. Sundstedt, V., Stavrakis, E., Wimmer, M., Reinhard, E.: A psychophysical study of fixation behavior in a computer game. In: Symposium on Applied Perception in Graphics and Visualization, pp. 43–50. ACM, New York (2008) 16. Ware, C., Osborne, S.: Exploration and virtual camera control in virtual three dimensional environments. ACM SIGGRAPH 24(2), 175–183 (1990) 17. Yannakakis, G.N., Martinez, H.P., Jhala, A.: Towards Affective Camera Control in Games. User Modeling and User-Adapted Interaction (2010) 18. Yarbus, A.L.: Eye Movements and Vision. Plenum press, New York (1967)
An Interactive Design System for Sphericon-Based Geometric Toys Using Conical Voxels Masaki Hirose1 , Jun Mitani1,2 , Yoshihiro Kanamori1, and Yukio Fukui1 1 2
Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan JST/ERATO,Frontier Koishikawa Bldg., 7F 1-28-1, Koishikawa, Bunkyo-ku, Tokyo 112-0002, Japan {hirose,mitani,kanamori,fukui}@npal.cs.tsukuba.ac.jp
Abstract. In this paper, we focus on a unique solid, named a “sphericon”, which has geometric properties that cause it to roll down a slope while swinging from side to side. We propose an interactive system for designing 3D objects with the same geometric characteristics as a sphericon. For high system efficiency, we used a conical voxel representation for defining these objects. The system allows the user to concentrate on the design while itself ensuring that the geometrical constraints of a sphericon are satisfied. The user can also preview the rolling motion of the object. To evaluate the effectiveness of the proposed system, we fabricated the designed models using a 3D printer, and confirmed that they rolled as smoothly as a standard sphericon. Keywords: user interface, geometric modeling, sphericon, conical voxel.
1
Introduction
Of the many kinds of toys available in the world, one group is sometimes referred to as “geometric toys”, since they have geometrically interesting shapes or movements [1]. These toys are considered to be useful to evoke children’s interest in geometry and physics, and they can be found both in shops and science museums. In this paper, we focus on a unique solid, named a “sphericon”, which has interesting geometric characteristics such that it rolls down a slope swinging from side to side. It is difficult to predict how it will roll at first glance and it is entertaining to watch it move. A toy based on the sphericon has been available for some time in the U.S. [2]. The shape of the sphericon is generated by the following steps (see Fig.1). First, a square is rotated by 180 degrees around its diagonal. As a result, two cones with apex angles of 90 degrees are obtained and are connected at their base circles. One half of this 3D shape is then rotated by 90 degrees about an axis perpendicular to the original rotation axis. A sphericon has the geometrical property that its center of gravity remains a fixed height above the surface over which it rolls, similar to the case for a L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 37–47, 2011. c Springer-Verlag Berlin Heidelberg 2011
38
M. Hirose et al.
Fig. 1. Generation of a sphericon
sphere or a horizontally placed cylinder. When a sphere rolls, only a single point on its surface touches the floor. On the other hand, when a sphericon rolls, a generating line touches the floor, as is the case for a cylinder. This is because the surface of a sphericon consists of two conical sectors. This means that a sphericon will roll smoothly even if parts of its surface are sculpted away, so long as at least two points remain on a given generating line. Based on this fact, we propose an interactive system for designing objects that roll in the same way as a sphericon. The design process starts with a true sphericon, which is then sculpted to obtain new objects.The designed object should satisfy the following three geometric constraints. Center of gravity constraint: The center of gravity of the object should remain at a fixed height above the floor on which it rolls. Tangency constraint: The shape should touch the floor along a generating line. Although some parts of the line may be removed, at least two points on any generating line must be retained. Topology constraint: The object should comprise a single part so that it will not break apart. In this paper, we propose a conical voxel representation for defining the object, which is well suited to represent sphericon-like objects. We implemented two different user interfaces so that the user can interactively edit an object while satisfying the above three constraints. The first allows the user to directly add and remove cells. In the second, the user paints on a plane map obtained by flattening the surface of a sphericon. In both methods, the system allows the user to concentrate on the design while itself ensuring that the geometrical constraints of a sphericon are satisfied. The user can also preview the rolling motion of the object. Finally, the designed object can be physically constructed using a 3D printer and its rolling behavior can be enjoyed. The main innovations of the present research are the use of conical voxels to represent the designed object and the development of the user interfaces described above. In this paper, we introduce related work in Section 2 and describe the details of our method in Section 3. In Section 4, we describe the results, and finally present our conclusions and ideas for future work in Section 5.
An Interactive Design System for Sphericon-Based Geometric Toys
2 2.1
39
Related Work Designing a Geometric Shape for Real-World Objects
A recent trend in geometric modeling in the field of computer graphics has been the modeling of not only freeform virtual shapes but also the shapes of realworld objects. For instance, Mitani and Suzuki [3] and Shatz et al. [4] proposed methods for modifying 3D geometries to produce papercraft toys. Li et al. [5] proposed a method for generating geometries for pop-up card geometries from existing 3D data, in which a 3D structure pops up when the card is opened 90 degrees. In these studies, real-world objects are constructed from printed patterns, and original shapes are modified so that they satisfy geometrical constraints such as “the shape must comprise multiple pieces which can be flattened without distortion” or “the shape must pop up without collisions occurring”. Using a computer to ensure that the target shape satisfies the required geometrical constraints is a useful approach, especially when the constraints are severe or the user is a novice. When designing 3D real-world objects, we sometimes have to consider not only geometrical problems but also the physical characteristics of the object itself. Mori and Igarashi [6] proposed a system for designing stuffed toys made with stretchy fabric. The system simulates material stretching based on the pressure generated from the inner cotton. The user can interactively design new stuffed toys while viewing the results of the simulation. Furuta et al. [7] proposed a system for designing kinetic art, which takes account of the motion of objects. The behavior of rigid objects in the system is simulated, and the predicted scene is displayed so that the user can interactively design kinetic artwork while observing the results of the motion. These studies have shown that combining an object design system with motion simulation represents a powerful tool for designing real-world objects. In this paper, we also propose interfaces with which the user can easily design objects that satisfy the geometrical constraints required for rolling on a floor like a sphericon. Further, we employ a motion simulation for previewing the rolling behavior before actually constructing the objects. 2.2
Interactive Volume Data Editing
Surface models are commonly used in computer graphics to represents 3D models. However, a volumetric data representation, involving both density and weight, is sometimes needed to model solid real-world objects. One typical volumetric representation method is based on voxels, which are small cubes that build up the 3D shape of the object. Recently, in the field of industrial modeling, a voxel based topology optimization approach is often used to determine the shape of machine components [8]. In this approach, shape optimization is carried out by minimizing the volume and mass of the component while maintaining the necessary strength. However, since its purpose is not to make a shape reflecting the intent of the designer, this approach is not appropriate for an interactive editing system. Galyean et al. [9] proposed a voxel-based interactive shape editing
40
M. Hirose et al.
system that combined editing tools with a 3D input device. Perry et al. [10] proposed a digital clay system in which the user can design a shape in a similar way to clay modeling. In a study of the generation of geometrically interesting shapes based on a voxel representation, Mitra et al. [11] proposed a system which automatically generates a cubic object which projects three distinct shadow images, which are specified by the user, on plain surfaces when the solid is lit from three different directions. Further, the user can edit the shape of the object while keeping the shadow images not to be changed. This system is based on a similar concept to ours, in that while editing operations are being carried out by the user, the computer ensures that the required geometric constraints are being satisfied. However, the constraints applied are different. While Mitra et al. required that the object projected target images as shadows, our requirement is that the object rolls on a floor as smoothly as a sphericon. In addition, the models generated by our system must be single components, which was not a requirement of Mitra et al. Further, while Mitra et al. employed a standard voxel representation, we propose a new conical voxel representation. 2.3
Large Objects Based on a Sphericon
Muramatsu constructed large human-sized objects from stainless steel pipes based on the geometrical characteristics of a sphericon [12]. The height of the center of gravity of each object was constant, and they rolled smoothly. The objects were manually designed based on the geometrical constraints. In the system proposed in the present study, such objects could be designed without the need for the user to consider specific geometrical constraints. In addition, Muramatsu could not see how the designed objects would roll in advance, whereas in our system this is made possible by integrating a motion simulation.
3 3.1
Proposed Method Conical Voxel Representation
A sphericon is composed of four half-cones whose apex angle is 90 degrees, as shown in Fig.2 (center). We henceforth refer to these half-cones as “units”. To represent the position of any point in a unit, we introduce a coordinate system based on θ, r and h, as shown in Fig.2 (left). The origin O of the coordinate system is the center of the base circle of the original cone. O also represents the center of gravity of the sphericon. A point P in a sphericon is then represented by four parameters (i, θ, r, h), where i is an integer with a value from 0 to 3 that specifies the unit in which P is located. θ is the angle around the axis of the cone between the base generating line L and the generating line on which P lies, and takes values from 0 to π. r is the distance between P and the cone axis, divided by the radius of the base circle of the cone, and takes values from 0 to 1. h is the height of P above the base circle, divided by the height of the cone, and also takes values from 0 to 1.
An Interactive Design System for Sphericon-Based Geometric Toys
41
Fig. 2. Structure of a sphericon in a conical voxel representation
In the proposed system, each unit is subdivided based on the coordinates θ, r and h into smaller regions referred to as “cells”, as shown in Fig.2 (right). If the numbers of subdivisions along the coordinates θ, r, and h are denoted as Nθ , Nr , and Nh , respectively, the total number of cells in a sphericon is then given by 4 × Nθ × Nr × Nh . The position of each cell is then described by a combination of four index values (i, Iθ , Ir , Ih ), where i = 0, 1, 2, 3, 0 ≤ Iθ ≤ Nθ − 1, 0 ≤ Ir ≤ Nr − 1, and 0 ≤ Ih ≤ Nh − 1. In addition, each cell is assigned a binary value of 0 or 1. An object is defined as a set of cells which value is 1. In this paper, we refer to this representation as a conical voxel representation. Unlike the general cubic voxel representation, it has the advantage of being able to smoothly represent the surface of a sphericon. 3.2
Geometrical Constraints
The designed object must satisfy the three geometrical conditions described in Section 1. We will now show how this can be achieved using the conical voxel representation. Center of gravity constraint. When the center of gravity of the edited object is located at the same position as that of the original sphericon (i.e., θ = r = h = 0), the object rolls smoothly without vibration of the center of gravity. This condition is always satisfied when all four units have the same shape because the units are located symmetrically around the center of the sphericon. Thus, in our system, the condition is imposed that all units must have the same shape. Although it is possible to satisfy the center of gravity constraint even for units with different shapes, we introduced this additional constraint in order to make the interface simpler. When a unit is edited, its shape is duplicated in the other units, leading to overall symmetry. Tangency constraint. Here we define the ruling line, RulingLine(Ri, Rθ ), as a set of cells Cell(i, Iθ , Ir , Ih ) such that i = Ri , Iθ = Rθ , Ir = Nr − 1, and 0 ≤ Ih ≤ Nh − 1. When a sphericon is represented as a set of cells, it touches the floor along one of its ruling lines. Even if some cells in the ruling line are removed, the sphericon still rolls smoothly if one of the following constraints is satisfied.
42
M. Hirose et al.
(a)
(b)
Fig. 3. (a) Two cells, surrounded by red circles, exist on a ruling line. (b) A single cell, surrounded by a red circle, exists on a ruling line, and one exists at the apex of the unit.
(1) At least two cells exist on a ruling line (Fig.3 (a)). (2) At least one cell exists on a ruling line and one exists at the apex of the unit (Fig.3 (b)). Topology constraint. The object should comprise a single part. This constraint is simply satisfied when all cells whose value is 1 are traversed without passing cells whose value is 0. 3.3
Implemented System
We implemented a system which has two types of user interfaces with which the user can interactively design an object which satisfies the geometrical constraints described in previous subsection. The system also has the ability to display a 3D animation of the designed solid rolling on a floor. Details of the system are described in the following. Direct editing. One of interfaces we propose is a direct-editing interface with which the user can directly add or remove cells. Adding a cell means setting its value to 1, whereas removing a cell means setting it to 0. The user edits a model starting with a sphericon which is initially prepared by the system. Based on the geometrical constraints, editable cells are limited by the system. When a cell appears red, it can be removed. When it is green, the user can add a cell to that position. Finally, when it is black, the user cannot add or remove that cell. The user repeats the operations of adding and removing cells under these limitations. Fig.4 shows a screenshot of the direct-editing interface. The resolution of the conical voxels is adjustable through the control panel. The user selects a single cell or multiple cells using a keyboard. Although the topology constraint is satisfied by the system for each unit, the connectivity between different units must be confirmed by the user. Because it is not easy to evaluate whether two cells are actually in contact with each other when they are located in different units (e.g., the red and green units in Fig.2).
An Interactive Design System for Sphericon-Based Geometric Toys
43
Fig. 4. Direct-editing system, consisting of a control panel (left), a window for editing (center) and a window for previewing the motion of the object (right)
Fig. 5. The user paints a pattern on the flattened surface of a sphericon (left). The pattern is then applied to the external cells of the sphericon (right).
Pattern painting on flattened surface of a sphericon. Because the surface of a sphericon is composed of four conical subsurfaces, it is flattened into a plane, as shown in Fig.5. This flattened surface corresponds to the locus produced when the sphericon rolls. In the second interface, a pattern is painted on this flattened surface, and is then applied to the external cells of the sphericon. The cells shown in Fig.5 correspond to external cells Cell(i, Iθ , Ir , Ih ) where Ir = Nr − 1. The cells with Ir < Nr − 1 in the sphericon are initially removed. The pattern is painted using a mouse on a flattened sphericon surface prepared by the system. Based on the center of gravity constraint described earlier, the editable area is limited to one of the four units, and the pattern is automatically duplicated on the remaining units. When the user presses the “Tangency Checking On” button, the system notifies the user if the tangency constraint is not satisfied. The cells which need to be painted in order to make the designed object satisfy the topology constraint are automatically painted by the system. The user interface is shown in Fig.6.
44
M. Hirose et al.
Fig. 6. Pattern-painting system, which consists of a control panel (left), a window for painting (center top), a window for previewing the resulting object (right), and a window for previewing the motion of the object (center bottom)
Fig. 7. Animation of a rolling sphericon
Generating animation of a rolling sphericon. Since the aim of designing sphericon-like objects is to enjoy watching their rolling motion, we included a motion simulation to our system. The motion data is generated by simulating the physical behavior of the original sphericon using the PhysX physics engine [13] and then applying this to the designed object.
4 4.1
Results Obtained 3D Objects
We implemented the system proposed in this paper using c++ on a PC (CPU: Intel Core i7 2.80 GHz × 4, Memory: 4 GB, GPU: NVIDIA Quadro FX 580), and attempted to design new geometric toys which behaved like sphericons. We then fabricated the models using a 3D printer by exporting the geometries to OBJ
An Interactive Design System for Sphericon-Based Geometric Toys
(a)
(b)
(c)
(d)
45
Fig. 8. CG images of 3D models and photos of real objects printed by a 3D printer.(a,b,c) Designed using the direct-editing interface. (d) Designed using the pattern-painting interface.
formatted files to evaluate the results. Fig.8 shows four examples of designed 3D models and the final printed objects. All were confirmed to roll as well as a sphericon. The voxel resolution of each unit used during the design phase of the objects shown in Fig.8 are (a) 41 × 41 × 42, (b) 41 × 41 × 42, (c) 21 × 21 × 21, (d) 20 × 20 × 10. User study. We carried out a study to evaluate the response of users to the proposed system. The subjects were three university students in the Department of Computer Science. Table 1 shows the 3D models designed by the subjects and the time required to design each. We received positive comments such as “It was fun to interactively design 3D models. The system would be suitable for children”, and “I could design a geometrically interesting shape easily, because the system assisted me”. On the other hand, we received the following negative comments. “It is difficult to paint the intended patterns on the flatten pattern of a sphericon since the shape of cells is not a square”.
5
Conclusion and Future Work
We proposed an interactive system for designing objects which have the same geometrical characteristics as a sphericon. For high system efficiency, we used a conical voxel representation for defining these objects. We developed two different user interfaces to allow the user to interactively edit the object. We also implemented a preview feature to confirm the motion of the object by using a motion simulation. To evaluate our system, we printed out designed models using a 3D printer, and confirmed that they could roll as smoothly as a sphericon, confirming that sphericon-like objects can be successfully designed.
46
M. Hirose et al.
Table 1. 3D models designed by the subjects and the time required to design each Interface
subject 1
subject 2
subject 3
46 min
13 min
46 min
12 min
27 min
23 min
Direct editing The resolution of each unit is 20 × 20 × 20
Painting on a development The resolution of each unit is 20 × 20 × 10
Fig. 9. Object whose center of gravity constraint is not satisfied (left). Result of adding cells to adjust the center of gravity (right).
However, there are also limitations to the proposed system. A major limitation is that designed objects are limited to symmetrical shapes in order to obey the center of gravity constraint. It is possible to design dissymmetrical shapes by adjusting the center of gravity by adding additional cells after the object is designed. To evaluate this approach, we temporarily implemented this method in our system. Some cells are automatically added so that the center of gravity of the object moves to the appropriate position. Although the center of gravity constraint is fulfilled by adding some cells, as shown in Fig.9, the results are so far not very satisfactory. A strategy is necessary to determine which cells to add to adjust the center of gravity without causing large changes in the appearance of the object. Even if a designed object satisfies the tangency constraint and has two external cells on a generating line, it will not balance if those cells are located too close to each other. Precise physical simulations are required to identify such problems before producing real objects.
An Interactive Design System for Sphericon-Based Geometric Toys
47
One of drawbacks of using a conical voxel representation is that cells near the apex of a cone tend to become too small to edit. On the other hand, cells near the bottom are sometimes too large to apply to detailed shapes. In addition, it is difficult to represent smooth freeform surfaces. In future, for the practical implementation of such a system in designing children’s toys, it will be necessary to consider strength and safety. In addition, the addition of coloring methods to the simulation would be useful for designing colorful toys. Although some future work still remains, we believe that the results shown in this paper demonstrate the efficiency of an interactive design system in which a computer ensures that certain geometrical constraints are satisfied.
References 1. Sakane, I.: The history of play. The Asahi Shimbun Company, Japan (1977) (in japanese) 2. Toys From Times Past, http://www.toysfromtimespast.com/toys/ sphericonpins2.htm 3. Mitani, J., Suzuki, H.: Making Papercraft Toys from Meshes using Strip-based Approximate Unfolding. ACM Transactions on Graphics 23(3), 259–263 (2004) 4. Shatz, I., Tal, A., Leifman, G.: Paper Craft Models from Meshes. The Visual Computer: International Journal of Computer Graphics 22(9), 825–834 (2006) 5. Li, X.Y., Shen, C.H., Huang, S.S., Ju, T., Hu, S.M.: Popup: Automatic Paper Architectures from 3D Models. ACM Transactions on Graphics 29(4), article No. 111 (2010) 6. Mori, Y., Igarashi, T.: Plushie: An Interactive Design System for Plush Toys. ACM Transactions on Graphics 26(3) Article No. 45 (2007) 7. Furuta, Y., Mitani, J., Igarashi, T., Fukui, Y.: Kinetic Art Design System Comprising Rigid Body Simulation. Computer-Aided Design and Applications, CAD in the Arts Special Issues 7(4), 533–546 (2010) 8. Bendsoe, M.P., Sigmund, O.: Topology Optimization. Springer, Heidelberg (2003) 9. Galyean, T.A., Hughes, J.F.: Sculpting: an interactive volumetric modeling technique. SIGGRAPH Computer Graphics 25(4), 267–274 (1991) 10. Perry, R.N., Frisken, S.F.: Kizamu: a system for sculpting digital characters. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 47–56 (2001) 11. Mitra, N.J., Pauly, M.: Shadow Art. ACM Transactions on Graphics 28(5), article No. 156 (2009) 12. Muramatsu, T.: Solid geometric objects that maintain a constant height when moved. Japan Society for Graphics Science, Journal of Graphic Science of Japan 40(4), 11–16 (2006) (in Japanese) 13. PHYSX, http://www.nvidia.com/object/physx_new.html
A Multi-touch System for 3D Modelling and Animation Benjamin Walther-Franks, Marc Herrlich, and Rainer Malaka Research Group Digital Media TZI, University of Bremen, Germany
Abstract. 3D modelling and animation software is typically operated via single-pointer input, imposing a serialised workflow that seems cumbersome in comparison to how humans manipulate objects in the real world. Research has brought forth new interaction techniques for modelling and animation that utilise input with more degrees of freedom or employ both hands to allow more parallel control, yet these are separate efforts across diverse input technologies and have not been applied to a usable system. We developed a 3D modelling and animation system for multi-touch interactive surfaces, as this technology offers parallel input with many degrees of freedom through one or both hands. It implements techniques for one-handed 3D navigation, 3D object manipulation, and time control. This includes mappings for layered or multi-track performance animation that allows the animation of different features across several passes or the modification of previously recorded motion. We show how these unimanual techniques can be combined for efficient bimanual control and propose techniques that specifically support the use of both hands for typical tasks in 3D editing. A study proved that even inexperienced users can successfully use our system for a more parallel and direct modelling or animation process. Keywords: modelling, animation, multi-touch, bimanual interaction.
1
Introduction
The features and rendering power of 3D modelling and animation software have continuously expanded over the last decades to produce ever better results. Yet such systems are still largely operated with one-handed single-pointer input, i.e., mouse or tablet, controlling only two degrees of freedom (DOF) at the same time. This stands in contrast to how humans often employ both hands in real world tasks for a more parallel workflow. It has motivated many researchers to develop new 3D navigation, modelling, and animation strategies that explore high DOF input [3,16] or bimanual control [23,1]. Of these, the most interesting and recent are multi-touch 3D control techniques [8,17]. Multi-touch interactive surfaces offer additional DOF compared to keyboard/mouse interfaces. The direct mapping between movements on the multi-touch display and the virtual scene as well as the possibility to make large fluid movements increase the feeling for proportion, spatial relationships, and timing. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 48–59, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Multi-touch System for 3D Modelling and Animation
49
However, these novel interaction techniques are only designed, developed and evaluated in the limited scope of prototypical implementations. They are rarely integrated into working systems, or discussed in combination with other techniques. In this paper, we address the problem of developing a 3D modelling and animation system for multi-touch interactive surfaces. It integrates concepts for more parallel, direct and expressive modelling and animation. We have identified and met four challenges: – Integrating features into a legacy system. Rather than constructing a system from scratch, how can we extend existing software without being restricted in our design space? – Finding mappings for unimanual control. How can multi-touch input through one hand be used to navigate space, manipulate objects, and control timing? – Developing strategies for bimanual control. How can unimanual controls be combined for symmetric or asymmetric control of space, space and view, or space and time? How can we leverage the extra freedom of the second hand to address typical modelling or animation problems? – Bringing 3D layered performance animation to interactive surfaces. The direct coupling of input to output makes interactive surfaces predestined for performative animation approaches. How can mappings for multi-track performance animation be transferred to the surface? After discussing related work, we describe how we extended the open source 3D package Blender by multi-touch input and parallel event handling. We then relate system functionality of unimanual multi-finger gestures for feature manipulation, view navigation, and timeline control. We further explain how users can assign each hand to the control of one of these aspects in quick succession or even simultaneously. We explicate three new bimanual techniques, auto-constraints, object pinning, and view pinning, that we developed to tackle specific problems in modelling and performance animation. Finally, we present a user study which illustrates that even inexperienced users can successfully use our system for more parallel and direct modelling and animation.
2 2.1
Related Work Multi-touch Input for 2D and 3D Manipulation
The general problem of 2D and 3D manipulation on multi-touch interactive surfaces has been approached in many different ways, although not in the context of 3D modelling and animation. Techniques based on the number of fingers used include one finger selection and translation and two finger scale and rotate [22]. Employing least-squares approximation methods to derive stiff 2D affine transformations from finger movements [12] is a technique commonly used for the “ubiquitous” image browser. More recently, physics-based approaches have been explored, incorporating multi-touch input into a physics engine in order to simulate grasping behaviour using virtual forces [21,20].
50
B. Walther-Franks, M. Herrlich, and R. Malaka
Work in this area often tries to leverage the benefits of the direct-touch paradigm, yet it is not always desirable. Limited reach (input space too small), limited precision (input space too coarse), or object occlusion can all be reasons for indirect control on interactive surfaces [4,13]. Researchers have further investigated multi-touch techniques for 3D manipulation on interactive surfaces. Various multi-finger techniques have been proposed for translation and rotation in 3D space with different DOF and constraints [7,8,10]. The approximation approach for deriving affine transformations in the 2D domain has also partly been extended to 3D using a non-linear least squares energy optimization scheme to calculate 3D transformations directly from the screen space coordinates of the touch contacts, taking into account the corresponding object space positions [17]. While some of these techniques at least theoretically allow unimanual operation, none of them were explicitly designed to be used with one hand. 2.2
Bimanual Interaction
For real world tasks humans often use both hands in an asymmetric manner, meaning that the non-dominant hand (NDH) is used to provide a frame of reference for the movements of the dominant hand (DH) [6]. Asymmetric bimanual interaction using two mice for controlling the camera with the NDH and object manipulation with the DH was shown to be 20% faster than sequential unimanual control for a 3D selection task [1]. The same study also explored symmetrical bimanual interaction for 3D selection, controlling the camera and the object at the same time, and found symmetrical compared to asymmetrical bimanual interaction to impose a slightly higher cognitive load on unexperienced users. In how far this applies to multi-touch interaction has not yet been investigated. Researchers have explored the use of two pointers to perform various operations in 3D desktop applications, finding that the best techniques were those with physical intuition [23]. Comparing direct-touch input to indirect mouse interaction, the former has been found to be better suited for bimanual control, due to the cognitive overload of controlling two individual pointers indirectly [5]. Others have found that although users naturally tend to prefer bimanual interaction in the physical realm, in the virtual domain unimanal interaction is prevalent [19]. 2.3
Performance Animation
Performance animation, also known as computer puppetry, has a tradition of complex custom-built input hardware for high DOF input [3,16,18]. Multi-track or layered performance animation can split the DOF across several passes [16]. Dontcheva et al. identify several input-scene mappings and extend this technique to redoing or adjusting existing animation passes [2]. First steps toward more standardised input hardware were made by Neff et al., who explored configurable abstract mappings of high DOF content for layered performance animation that could be controlled with 2 DOF via mouse [15]. So far, the possibility of high DOF input through multi-touch interactive surfaces has only been explored for 2D performance animation [14].
A Multi-touch System for 3D Modelling and Animation
3
51
System
For our multi-touch modelling and animation system we built on the open source 3D modelling tool Blender. This allowed us to start with a very powerful and feature complete modelling and animation tool chain. In order to integrate multitouch input, we had to adapt many operations and parts of the internal event system. We use a version of multi-finger chording [11] to map unimanual multifinger gestures to existing 2 DOF operators. Blender features a complete environment for 3D modelling and animation. Currently, it’s architecture does not feature any multi-point processing. For our multi-touch extension we employ the OSC-based TUIO protocol [9]. We require the multi-touch tracker to provide TUIO messages over a UDP port. For registration of unimanual multi-finger gestures, our system uses temporal and spatial thresholds to form finger clusters. If a new finger input is within 1/5th screen width and height distance of the average position of an existing finger cluster, it is added to the agglomeration and the average position is updated. Clusters can contain up to four fingers. Once a cluster has existed for 120 ms, the system releases a multi-finger event. Each cluster has a unique id assigned to it, which it keeps until destruction. The registration event and every subsequent move or destruction event issues this id. Adding or removing a finger to the cluster will not change the gesture. This makes continuous gestures resistant to tracking interruptions or touch pressure relaxation. The cluster remains until the last of its fingers is removed. To reduce errors caused by false detections or users accidentally touching the surface with more fingers than intended, touches are also filtered by a minimum lifetime threshold, currently set to 80 ms. Significant modification of the Blender event architecture was necessary to enable multi-point input. Operators are the entities responsible for executing changes in the scene. For continuous manipulation such as view control or spatial transformations, normally modal operators are employed in Blender that listen for new input until canceled. To enable simultaneous use of multiple operators and in different screen areas, we had to get rid of modality as a concept for operators and dialogs as exclusive global ownership of input and application focus conflicts with parallel multi-touch interaction. We introduced local input ownership by assigning the id of an event calling it to each operator. An operator will only accept input from events with the id it was assigned. Thus, continuous input is paired to a specific operator and cannot influence prior or subsequently created operators, as they are in turn paired with other input events via id. Global application focus had to be disabled accordingly. This also meant optimisations like restricting ray casts for picking to specific areas are no longer possible. Furthermore, interpretation of touch input generally requires a more global approach because single touches can only be interpreted correctly with knowledge of all currently available and/or active operations and other past and current touches, imposing a significant processing overhead compared to singlepointer input.
52
B. Walther-Franks, M. Herrlich, and R. Malaka
Fig. 1. Blending of frames of temporal sequences showing unimanual continuous multifinger gestures for view control: two-finger pan, three-finger rotate, four-finger dolly move
4
Multi-finger Mappings for Unimanual Control
In order to allow asymmetric and symmetric control for maximum parallelisation, we make all basic controls usable with one hand. In the following, we describe how we map multi-finger gestures to object, camera, and time control. 4.1
Unimanual Object Manipulation
Object selection and translation are among the most common tasks in 3D modelling. We used the simplest gestures for these tasks, being one finger tapping for selection and one finger dragging for translation. Indirect translation is useful in many cases: for the manipulation of very small objects, many objects in close proximity (with possible overlap), or for ergonomic reasons on large interactive surfaces. Our system allows users to either touch an object and immediately move their finger for direct translation, or just tap the object to select it and then move a single finger anywhere in the 3D view on the multi-touch surface for indirect translation. 4.2
Manipulating Dynamic Objects
Even with input devices supporting the control of many DOF, performance animators often will not be able to animate all features concurrently. Further, animators will sometimes want to re-do or adjust recorded animation. Multitrack or layered animation can help to split DOF across several passes or adjust animation recorded in previous passes [2,15,16]. We follow Dontcheva et al.’s approach of absolute, additive and trajectory-relative mappings for layered performance animation [2]. 4.3
Unimanual Camera Control
As changing the view is a common task in 3D modelling, the employed gestures should be as simple and robust to detect as possible. Furthermore, gestures should be usable with either hand and regardless of handedness, and must not conflict with object manipulation gestures. Additionally, view control should work without using special areas or the like, in order to reduce screen clutter and to facilitate the feeling of direct interaction and control. In our system, camera/view control is implemented via two- three and four-finger gestures (see
A Multi-touch System for 3D Modelling and Animation
53
Fig. 2. Three frames of a temporal sequence (left to right) showing direct symmetric bimanual object manipulation (auto-constraints, axis highlighted for better visibility): both objects are moved simultaneously, one acting as the target object for the other
section 3). There is no “optimal” mapping between multi-finger gestures and view control functions. We found good arguments for different choices and they are to a certain extent subject to individual preferences. One important measure is the frequency of use of a certain view control, and thus one could argue that the more commonly used functions should be mapped to the gestures requiring less user effort, i.e., less fingers. A different measure is how common a certain mapping is in related applications. In our experience, changing the distance of the view camera to the scene is the least used view control. We thus decided to map it to the four finger gesture. Users can dolly move the camera by moving four fingers vertically up or down. Two fingers are used for panning the view and three fingers for rotation (see figure 1). 4.4
Unimanual Time Control
For time control we employ a timeline. We transfer multi-finger gestures into the time domain, reusing gestures from the space domain: One finger allows absolute positioning of the playhead enabling scrubbing along the timeline, two and three finger gestures move the frame window displayed in the window forward or backward in time, and four finger gestures expand or contract the scale of the frame window. We disabled indirect control for the timeline, as this would make absolute jumping to selected frames impossible.
5
Bimanual Control for View, Space, and Time
We now show how unimanual controls for view, space and time can be combined for more parallel, efficient bimanual control. We also present three techniques that exploit bimanual interaction to further support common 3D modelling and performance animation tasks: auto-constraints, object pinning, and view pinning. 5.1
Bimanual Object Manipulation
Our system enables simultaneous translation of several objects. This allows quite natural 2 DOF control of separate features with a finger of each hand. For example, an animator can pose two limbs of a character simultaneously rather than sequentially. While this is beneficial for keyframe animation, it is central to performance animation. Thus, a bimanual approach to layering animation theoretically halves the required animation tracks.
54
B. Walther-Franks, M. Herrlich, and R. Malaka
Fig. 3. Three frames of a temporal sequence (left to right) showing bimanual camera and object control (object pinning): the DH indirectly controls the object while the NDH simultaneously rotates the view
Auto-Constraints. To support users in performing docking operations, we implemented a technique we termed auto-constraints, which leverages bimanual interaction. The user can select one object to act as an anchor, which can be moved freely with one hand. When a second object is moved concurrently with the other hand, the movement of this second object is automatically constrained in a way that helps to align the two (figure 2). In our current implementation we use the axis that connects the centres of both objects (or the geometric centres of the two sets of objects), but an extension to a plane-based constraint would be possible. Currently, auto-constraints are per default enabled during modelling and disabled for animation as during animation moving several objects simultaneously and independently is usually preferred to docking. 5.2
Bimanual Camera and Object Control
Concurrent camera and object control follors Guiard’s principle of setting the reference frame in bimanual interaction [6] and has been suggested to improve efficiency as well as facilitating depth perception via the kinetic depth effect in mouse-based interaction [1]. The requirements of bimanual interaction on a multi-touch display are somewhat different than for bimanual interaction using indirect input, such as two mice or a touchpad, as independent but simultaneous interaction with each hand can break the direct interaction paradigm. For example, changing the view alters an object’s position on the screen. If this object is simultaneously being manipulated, it would move away from the location of the controlling touch. For our system, we developed object pinning to resolve this. Furthermore, we developed view pinning to solve the problem of selecting and manipulating dynamic objects, as can occur in layered performance animation. Object Pinning. While bimanual camera and object control facilitates a parallel workflow and smooth interaction, in a completely unconstrained system this might also lead to confusion and conflicts. By independently changing the view and the controlled object’s position freely, users might easily loose track of the orientation of the object and/or camera. Furthermore, users have to incrementally change the view and adjust the object position to get a good sense of the relative positions and distances in the virtual 3D space. A third problem often encountered in 3D modelling is to move a group of “background” objects relative to a “foreground” object. Object pinning is our answer to all three of
A Multi-touch System for 3D Modelling and Animation
55
Fig. 4. Four frames of a temporal sequence (left to right) showing bimanual asymmetric control of view and space: the NDH fixes a dynamic reference frame relative to the screen for local performance animation by the DH (view pinning)
these problems. By using the screen space position of the finger touching the object as a constraint and by keeping the distance to the virtual camera, i.e., screen z, constant, the object will stay at the same relative position in screen space at the user’s finger. Additionally, object pinning enables a “homing in” kind of interaction style (figure 3) for object docking and alignment. It also enables another level of indirection for transformation operations: by pinning the object to a certain screen space position the user can indirectly transform the object in world space by changing the camera. We implemented object pinning for panning and rotating the camera. View Pinning. A typical example of layering the animation of a character would be as follows: the creation of the trajectory as the first layer, and animation of legs, arms etc. in subsequent layers [2]. But the more dynamic the trajectory, the harder it becomes to convincingly animate local features from a global reference frame. Thus, the explicit setting of a reference frame by aligning the view to it is desirable for multi-track performance animation. View pinning allows easy control of the spatial reference frame by enabling the user to affix the view to the currently selected feature interactively with a multi-finger gesture. In performance animation mode, the two finger gesture used for panning the view is replaced by view pinning. When the gesture is registered at time t0 , the view stays locked to the feature selected at point t0 with the camera-feature offset at t0 . When the feature moves, the view will move with it (figure 4). The view will stay aligned in this manner, regardless of what is selected and manipulated subsequently, until the animator ends the gesture. The relation to panning the view is maintained, as a continuous move of the two-finger input will change the view-feature offset accordingly but keep it pinned. 5.3
Bimanual Object and Time Control
Bimanual interaction can be a great benefit for time control in classical key frame animation (as opposed to performative approaches) because it allows rapid switching between frames: by assigning one hand to set the time frame while the other controls space, users can rapidly alternate between time and space control without loosing orientation in either dimension as the hands/fingers act as a kind of physical marker. The scrubbing feature can also support a better overview of the animation. These benefits equally apply to performance animation. However, there are further ways in which bimanual input to the time and space domain can be exploited for the performance approach: with one hand moving continuously
56
B. Walther-Franks, M. Herrlich, and R. Malaka
through time, the other can act in space simultaneously. This allows fine control of the playhead with one hand, which can propagate non-linearly backwards or forwards at varying speeds as the user wishes, while at the same time acting with the other hand.
6
User Study
We conducted a study to see how people would use our system. Rather than setting tasks, we wanted to observe how users would interact with the system when freely exploring the unimanual and bimanual controls. Thus we designed a session for free 3D modelling and performance animation. 6.1
Setup and Procedure
We tested our system on a tabletop setup. For modelling, we configured the interface to show a 3D view area with button controls on both sides to access legacy functionality. For animation, we configured the interface to show a 3D view area and a horizontal timeline. Six right-handed participants (4 M, 2 F) aged between 23 and 31 years took part in the study. Users had a varying skill level in modelling and animation, which gave us the opportunity to see how skill levels influence the acceptance of the new input modality. A short verbal introduction was followed by a 5 minute moderated warm-up phase to introduce the participants to the basic modelling functionality. Instructions were given on how to operate individual controls, but not on which hand to use for what, or whether and how to use both hands in combination. Then followed a 15 minute free modelling phase for which users were asked to model what they wanted and to use the tools to their own liking. We repeated the procedure for the performance animation aspect: once participants had a grasp of the system, they were asked to freely animate a character rig, for which they had 15 minutes. 6.2
Results
Single-touch control for legacy software features worked without problems. The multi-finger gestures for view and time control were detected throughout, with only one participant encountering problems when fingers were placed too close to hold apart for the touch tracking. Participants understood the view controls and basic object manipulation well and used them easily. They soon adopted a workflow of quick switching between view and space control for manipulation in three dimensions, half of the participants did this with a dedicated hand for each control (see below). Selection of occluded and small objects was a problem for many participants. Most were able to overcome this by moving the view closer or around occluding objects. Indirect translation was successfully used to manipulate very small objects or in cluttered arrangements. Generally, unexperienced users had a much harder time to comprehend spatial relationships. For animation, the time controls were effortlessly used to view recorded motion.
A Multi-touch System for 3D Modelling and Animation
57
Five out of six participants explored bimanual control and increasingly used it. All combinations of unimanual control were employed (view/space, time/space, space/space) without participants getting any instructions to do so. Three participants operated view or time with dedicated NDH and space with dedicated DH. Three used their NDH to pin the view for animating with their DH in character space. One participant even used his NDH to manually operate the playhead whilst concurrently animating a character feature with his DH. Only one participant did not employ any mode of bimanual control. Auto-constraints were not used as much as anticipated, possibly because we did not pose an explicit alignment task. Object pinning was hardly used, again this might be due to the fact that we did not construe a specific situation where this control would prove helpful. However, view pinning for performance animation was easily understood and used as the benefit of locking the view to a frame of reference was immediately apparent. Given the short timeframe and lack of experience in performance animation, participants were able to create surprisingly refined character motion. Multitrack animation was mainly used to animate separate features in multiple passes, less to adjust existing animation. Additive mapping was used after some accustomisation. View pinning was successfully used to enable a more direct mapping in the local character frame of reference, as mentioned above. All participants created fairly complex models and expressive character animations within the short timeframe of 15 minutes each. In general, they stated having enjoyed using the system.
7
Discussion and Future Work
Our goal was to develop a system that integrates concepts for more parallel, direct, and expressive 3D modelling and animation on multi-touch displays. We now discuss in how far we have found solutions to the four challenges we derived from this goal – extending a legacy system, finding unimanual controls, establishing bimanual controls, and enabling multi-touch performance animation. Integrating features into a legacy system. One of the major issues we encountered is that current software packages are not designed for parallel interaction, as internal event and GUI handling is geared toward single focus/single pointer interaction. We presented our modifications of the internal event system in order to remedy these problems: no strict modality for operators/dialogs, no “shortcuts” like event handling per area or widget and context-dependence of touch interpretation, which requires suitable touch aggregation and internal ids for touch and software events/operators. Finding mappings for unimanual control. We demonstrated how to employ robust, easy to understand, and conflict free unimanual mappings for view navigation, object manipulation, and timing control. We showed the benefits of using these mapping both for direct and indirect control. It remains to be seen how multi-finger mappings for higher-DOF feature control [10,8] compare to this. Developing strategies for bimanual control. Our unimanual mappings also enabled both asymmetric and symmetric bimanual interaction for object
58
B. Walther-Franks, M. Herrlich, and R. Malaka
manipulation, view and time control. This results in smoother interaction in the asymmetric case of one hand acting after the other, and it enables true simultaneous interaction for more advanced users. Mappings controlling more than 2 DOF most likely require more mental work on the users’ side. The 3D manipulation through alternating 2 DOF control and view change that our system enables potentially provides a good tradeoff between mental load and control. Our user study showed large acceptance of bimanual combinations. Users automatically took a more parallel approach to the view/space and time/space workflow with a dedicated hand for each task. For users with a lifetime of training in mainly unimanual operation of computer systems, this is not at all self-evident [19]. Bringing 3D layered performance animation to interactive surfaces. Our user study clearly demonstrated that direct coupling of input to output is perfect for performance animation. While additive and trajectory-relative control lose some of this directness, view pinning was successfully shown to provide a solution to this problem. With our fully working multi-touch 3D authoring system we have laid the basis for further work in this area.
8
Conclusion
The goal of this work was to bring techniques for more parallel, direct, and expressive modelling and animation into a usable application on interactive surfaces. We addressed this by presenting our working multi-touch system for 3D modelling and animation. We met four challenges that we identified in the course of reaching this goal. We described how we adapted legacy software for parallel multi-touch control. We designed and implemented multi-finger mappings for unimanual manipulation of view, objects, and time. We showed how these can be combined for efficient bimanual control and further presented several new specialised bimanual techniques. Our system also implements real-time performance animation that leverages the directness and expressiveness of multi-touch interaction. Furthermore, we reported on a user study showing that our system is usable even by novice users. Finally, we critically discussed our results and suggested future research directions.
References 1. Balakrishnan, R., Kurtenbach, G.: Exploring bimanual camera control and object manipulation in 3d graphics interfaces. In: Proc. CHI 1999. ACM, New York (1999) 2. Dontcheva, M., Yngve, G., Popovi´c, Z.: Layered acting for character animation. ACM Trans. Graph. 22(3) (2003) 3. Esposito, C., Paley, W.B., Ong, J.C.: Of mice and monkeys: a specialized input device for virtual body animation. In: Proc. SI3D 1995. ACM, New York (1995) 4. Forlines, C., Vogel, D., Balakrishnan, R.: Hybridpointing: fluid switching between absolute and relative pointing with a direct input device. In: Proc. UIST 2006. ACM, New York (2006) 5. Forlines, C., Wigdor, D., Shen, C., Balakrishnan, R.: Direct-touch vs. mouse input for tabletop displays. In: Proc. CHI 2007. ACM, New York (2007)
A Multi-touch System for 3D Modelling and Animation
59
6. Guiard, Y.: Asymmetric division of labor in human skilled bimanual action: The kinematic chain as a model. Journal of Motor Behaviour 19 (1987) 7. Hancock, M., Carpendale, S., Cockburn, A.: Shallow-depth 3d interaction: design and evaluation of one-, two- and three-touch techniques. In: Proc. CHI 2007. ACM, New York (2007) 8. Hancock, M., Cate, T.T., Carpendale, S.: Sticky tools: Full 6dof force-based interaction for multi-touch tables. In: Proc. ITS 2009. ACM, New York (2009) 9. Kaltenbrunner, M., Bovermann, T., Bencina, R., Costanza, E.: Tuio - a protocol for table based tangible user interfaces. In: Gibet, S., Courty, N., Kamp, J.-F. (eds.) GW 2005. LNCS (LNAI), vol. 3881. Springer, Heidelberg (2006) 10. Martinet, A., Casiez, G., Grisoni, L.: The design and evaluation of 3d positioning techniques for multi-touch displays. In: Proc. 3DUI, pp. 115–118. IEEE, Los Alamitos (2010) 11. Matejka, J., Grossman, T., Lo, J., Fitzmaurice, G.: The design and evaluation of multi-finger mouse emulation techniques. In: Proc. CHI 2009. ACM, New York (2009) 12. Moscovich, T., Hughes, J.F.: Multi-finger cursor techniques. In: Proc. GI 2006. Canadian Information Processing Society (2006) 13. Moscovich, T., Hughes, J.F.: Indirect mappings of multi-touch input using one and two hands. In: Proc. CHI 2008. ACM Press, New York (2008) 14. Moscovich, T., Igarashi, T., Rekimoto, J., Fukuchi, K., Hughes, J.F.: A multi-finger interface for performance animation of deformable drawings. In: Proc. UIST 2005. ACM, New York (2005) 15. Neff, M., Albrecht, I., Seidel, H.P.: Layered performance animation with correlation maps. In: Proc. EUROGRAPHICS 2007 (2007) 16. Oore, S., Terzopoulos, D., Hinton, G.: A desktop input device and interface for interactive 3d character animation. In: Proc. Graphics Interface (2002) 17. Reisman, J.L., Davidson, P.L., Han, J.Y.: A screen-space formulation for 2d and 3d direct manipulation. In: Proc. UIST 2009. ACM, New York (2009) 18. Sturman, D.J.: Computer puppetry. Computer Graphics in Entertainment (1998) 19. Terrenghi, L., Kirk, D., Sellen, A., Izadi, S.: Affordances for manipulation of physical versus digital media on interactive surfaces. In: Proc. CHI 2007. ACM, New York (2007) 20. Wilson, A.D.: Simulating grasping behavior on an imaging interactive surface. In: Proc. ITS 2009. ACM, New York (2009) 21. Wilson, A.D., Izadi, S., Hilliges, O., Mendoza, A.G., Kirk, D.: Bringing physics to the surface. In: Proc. UIST 2008. ACM, New York (2008) 22. Wu, M., Balakrishnan, R.: Multi-finger and whole hand gestural interaction techniques for multi-user tabletop displays. In: Proc. UIST 2003. ACM, New York (2003) 23. Zeleznik, R.C., Forsberg, A.S., Strauss, P.S.: Two pointer input for 3d interaction. In: Proc. SI3D 1997. ACM, New York (1997)
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations Ola Kristoffer Øye1 , Dag Magne Ulvang1 , Odd Helge Gilja2 , Helwig Hauser3 , and Ivan Viola3 1
Christian Michelsen Research, Norway {olak,dmu}@cmr.no 2 National Center for Ultrasound in Gastroenterology, Haukeland University Hospital, Bergen, and Institute of Medicine, University of Bergen, Norway [email protected] 3 University of Bergen, Norway {Helwig.Hauser,ivan.viola}@uib.no
Abstract. Couinaud segmentation is a widely used liver partitioning scheme for describing the spatial relation between diagnostically relevant anatomical and pathological features in the liver. In this paper, we propose a new methodology for effectively conveying these spatial relations during the ultrasound examinations. We visualize the two-dimensional ultrasound slice in the context of a three-dimensional Couinaud partitioning of the liver. The partitioning is described by planes in 3D reflecting the vascular tree anatomy, specified in the patient by the examiner using her natural interaction tool, i.e., the ultrasound transducer with positional tracking. A pre-defined generic liver model is adapted to the specified partitioning in order to provide a representation of the patient’s liver parenchyma. The specified Couinaud partitioning and parenchyma model approximation is then used to enhance the examination by providing visual aids to convey the relationships between the placement of the ultrasound plane and the partitioned liver. The 2D ultrasound slice is augmented with Couinaud partitioning intersection information and dynamic label placement. A linked 3D view shows the ultrasound slice, cutting the liver and displayed using fast exploded view rendering. The described visual augmentation has been characterized by the clinical personnel as very supportive during the examination procedure, and also as a good basis for pre-operative case discussions. Keywords: biomedical and medical visualization, illustrative visualization.
1
Introduction
Ultrasound examination is one of the most widely used diagnostic imaging methods. Its main characteristic is the real-time acquisition, enabling the examination of the dynamics of the studied anatomical area with very high temporal resolution [15]. The key advantages over other existing imaging technologies include very good spatial resolution with the possibility to zoom into specific areas L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 60–77, 2011. c Springer-Verlag Berlin Heidelberg 2011
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
61
Fig. 1. Example of an illustrated Couinaud segmentation during a liver examination
to achieve sub-millimeter precision, absence of dangerous radiation, and broad availability due to its relatively low cost. Since ultrasonography essentially is the acoustic measurement of sound reflection, ultrasound imaging also offers to investigate different acoustic characteristics at the same time, including Doppler imaging, elastography, harmonic imaging, and speckle tracking. Despite the high popularity of ultrasound (US) as a diagnostic modality or a modality for guiding interventions, it still has not attracted much attention in the scientific visualization research community. While numerous advanced visualization techniques have been proposed to explore computed tomography (CT) data, ultrasound data visualization still lags behind. One of the reasons for this underdevelopment is certainly the noisy character of the acquired images. The noise of an ultrasound image is influenced by several factors such as acoustic scattering, which is the physical phenomenon behind the speckle artifact, or strong absorption of the acoustic signal by certain tissue types such as bones or fat. Finally, the high dependency on the skills of the examiner is an aspect that differentiates ultrasound from other imaging modalities. High-quality 3D ultrasound is currently becoming available on the latest examination workstations. Although useful in some areas such as volume estimation [11], many diagnostic relevant questions are still better answered with 2D ultrasound. Also US-guided interventions prefer 2D ultrasound over 3D rendering. One reason is that algorithms for 3D US rendering still do not provide sufFig. 2. Illustration of the Couinaud liver ficient image quality and rendering segmentation from a bottom to top view. techniques that are insensitive to sub- Image is courtesy of Kari C. Toverud. stantial amounts of noise are still under development. But even if this obstacle would be resolved in the near future, 2D slice inspection still would remain attractive for several reasons, e.g., due to
62
O.K. Øye et al.
the absence of occlusion among the structures of interest, no need to select a good viewpoint or to place a clipping plane. In many cases, 2D ultrasound will therefore remain the preferred diagnostic US technique. On the other hand, special challenges are associated with 2D US imaging, including the need to stay oriented with respect to 3D space. Even experienced examiners find it difficult to quickly identify anatomical features when complex areas of anatomy are under inspection. To compensate for this, it is usual that examiners are slicing through the 3D anatomical region multiple times from different directions and under different roll and pitch angles in order to improve their mental model of the inspected anatomy. The process of gaining a matured mental map of a complex anatomical arrangement prolongs the duration of the examination and therefore contributes to high costs of medical care. One such case of a complex anatomy inspection is ultrasonic examination of the liver. A healthy liver consists of the hepatocytic parenchyma, the portal vein tree, the artery tree, the hepatic vein tree, and the biliary tree. The arterial tree and the portal vein tree provide blood for detoxification and the hepatic vein tree passes the filtered blood into the inferior vena cava and back into the central circulatory system. During examinations it is often essential to evaluate the spatial relationship of pathologies with respect to these tree structures. In case an examination is followed by an intervention, the clinician has to design an access path or a surgical cut that avoids the invasion into these anatomical trees. For navigational purposes, the liver is generally partitioned into several spatial regions. A widely used partitioning of the liver is the Couinaud segmentation [9], depicted in Fig. 2. This partitioning is based on the most frequent anatomical arrangement of the vascular trees and it partitions the liver into eight standard segments. When the US examiner prepares a report ahead to a surgical treatment, the Couinaud segmentation is used to define the spatial position of the pathology that is subject of the surgery. The current state-of-the-art is to refer to the navigational posters that are hanging all around the world at the walls of scanning rooms, illustrating the Couinaud segmentation with respect to a standard liver anatomy. The illustrated model is mentally matched with the patient’s anatomical specifics, in order to determine in which segment(s) the pathology is located. This liver partitioning is quite complex and difficult to understand from freehand slicing and these navigational posters are used even by the experienced medical personnel. To allow for a good orientation in the complex liver anatomy during 2D US examinations, examiners have a need for improved assistive visualization that can replace the static navigational posters by dynamic patient-specific anatomical maps. In this paper, we present visualization methodology that provides new navigational aid during the liver scanning. Our framework consists of two linked views. In a 2D view a dynamic overlay over the US slice is shown to convey how the Couinaud segmentation is intersecting the current ultrasound plane. In a linked 3D view the spatial placement of the US slice is presented in the context
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
63
of the 3D liver organ. An example of such a 2D and 3D guided visualization is depicted in Figure 1. The main contributions of this paper are: – a novel approach to define an approximative Couinaud partitioning of the liver (Section 4.1) – an interaction approach to sculpting the liver parenchyma model in order to improve the match with the subject’s liver (Section 4.2) – illustrative overlays and dynamic labeling of the 2D US slice (Section 5.1) – an application of 3D illustrative visualization techniques to guide the examiner and support her orientation in the complex liver anatomy (Section 5.2)
2
Related Work
Ultrasound is an imaging modality with lower signal-to-noise ratio as compared to computed tomography or magnetic resonance imaging. The anatomical information can be well interpreted by an experienced examiner, however, automatic interpretation methods have often difficulties to provide sufficiently good results. In our work we therefore aim at assisting the examiner with approximative visual aids that easify the interpretation process, instead of aiming at automatic interpretation. For this purpose we apply illustrative visualization methods that inherently convey the approximative character of presented visualization. In the following we review the most related work in the field of ultrasound data visualization, illustrative visualization, object manipulation, and liver segmentation. Ultrasound visualization research has been mainly focusing on 3D rendering of ultrasonic data. Numerous methods have been proposed to provide a noise-free rendering of interfaces between tissues with different echogenicity. Performing noise reduction filtering prior to the image synthesis was first proposed by Sakas et al. [20]. Smoother images can also be achieved during the ray traversal by evaluating the distribution of echo along the viewing rays and estimating probable interface position between tissues [14]. As the ultrasound has very high temporal resolution, analysis of redundancy in overlapping spatial regions from consecutive time-steps preserves the temporal coherence of the 3D rendering and also results in smoother rendering [16]. Ultrasound imaging is often used for guidance of minimally-invasive surgical treatments. During intervention the clinician has to closely follow the screen and frequently switch focus back to the patient and to the surgical tools. An approach that attempts to reduce the focus switching is to employ advanced display technologies and project the ultrasound image directly on the patient [3]. Advances in the graphics hardware technology enable interactive performance of low-frequency global illumination methods. Recent work on volumetric lighting and scattering has demonstrated the superiority of low-frequency lighting over gradient-based local illumination models [18,23] when rendering the 3D ultrasound data. Non-photorealistic rendering styles, such as contour rendering have been applied to reduce the visual clutter in the ultrasound volume data. Such reduction is shown to be efficient for multi-modal visualization of B-mode and Doppler signals [17].
64
O.K. Øye et al.
Illustrators employ various methods to reduce visual overload when creating hand-crafted illustrations and contour depiction is one of them. Other methods include selective modulation of tissue transparency, artificial deformations, or exploded views. Many of these techniques have been adapted for interactive data visualization. Exploded views for example, have been demonstrated on visualization of CT data [5]. A multimodal visualization of CT and US data for liver tumor ablations can effectively reduce spatial occlusion employing the cut-away view metaphor [7]. Interactive 3D medical data visualization when employed for visual communication purposes can be significantly enhanced by providing textual descriptors. Textual labeling is necessary to provide description of various anatomical regions. Therefore several available visualization systems support internal or external labeling [4,6]. However label management in interactive visualization raises a number of challenges, such as placement, alignment with structural shape, or dynamic repositioning invoked by change of viewpoint [13]. In our method we enhance the ultrasound slice with internal labels. Similarly to other approaches, we aim at placement of the label in the center of the associated structure. Unlike approaches tailored for 3D visualization, which aim at positioning in 3D space [19], we aim at appropriate positioning on the plane defined by the ultrasound slice. Our approach allows for mesh modifications of the generic liver model. There are many techniques relevant to mesh editing [8], as this is essential manipulation operation in commercial 3D modeling packages. The main difference of our approach to standard mesh manipulation techniques is the way how examiner employs the transducer as the interaction tool to achieve desired mesh modifications. Enhancing ultrasound with higher-level semantics such as Couinaud segmentation [22] in the case of liver examination is also employing illustrative visualization methods and is tightly related to our approach. Their multimodal approach aims at translating semantics available for one modality via registration to ultrasound. Semantics such as liver segmentation can be obtained automatically from several contrast-enhanced computed tomography scans. Usually one enhancement phase is used to provide a good contrast between parenchyma and the portal vein tree and another enhancement phase provides a good contrast for the hepatic vein tree extraction. These two segmented trees can be used to perform fully automatic liver segmentation which is nowadays used for establishing the surgery plan [21,2,10].
3
Illustrative Couinaud Segmentation of the Liver
Our visualization technology has resulted from specific needs from the clinical environment. Unlike previous work [22], which addressed the US slice augmentation with Couinaud segmentation information in a multimodal setting, we propose a new approach that is not dependent on any additional 3D modality that carries the Couinaud segmentation. Instead of such an additional modality that provides the 3D information about the boundaries of the liver parenchyma,
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
65
Fig. 3. Illustrated Couinaud segmentation and examination workflow
we utilize a standard 3D liver model, available from a generic human anatomy atlas [1]. The model is also partitioned according to the Couinaud segmentation. Naturally, such a generic model does not in general match the specific liver parenchyma of a particular human subject under examination. Therefore, we propose a procedure to establish a non-rigid transformation mapping of this generic model to the anatomical specifics of the human subject. This match is an approximation sufficient for fulfilling the navigational tasks. We suggest two steps to realize this model-to-patient matching (step two in Fig. 3). First, the vascular trees of the portal vein and the hepatic veins are matched so that the Couinaud segmentation, pre-defined in the generic model, is mapped to the patient’s liver (Sec. 4.1). This ensures a good Couinaud partitioning, however, it does not guarantee a good match of the parenchyma boundaries. Therefore, we incorporate a second step that allows the examiner to modify the shape of the model to achieve best possible match between patient and the model (Sec. 4.2). Both matching steps are designed to be intuitive and utilize the natural interaction of the examiner. Since the operator is using the US transducer during the examination, we chose a transducer with magnetic positional tracking (Flock of Birds, Ascension Technology, Vermont, USA) as the primary interaction tool. Its positioning in space defines the vascular tree placement in the model and the model transformation. After this matching procedure is finished, a space partitioning of the patient’s anatomy is achieved. It is important to realize that our approach of adapting a generic model to a partitioning specified by an examiner is not image registration in the traditional sense, where a best possible match between two datasets is the primary concern. Rather, we build an approximate representation of the vascular structure and the parenchyma from the anatomical information specified by the examiner. The examiner is free to modify this representation throughout the examination, in case he decides that the representation is insufficiently precise for certain parts of the liver. Based on this representation, we enhance the ultrasound examination by including navigational aids and illustrative enhancements that communicates to the examiner which segments of the liver are inspected with a given slice position and orientation (step three in Fig 3 and Sec. 5). This helps the examiner better conceive the geometry and spatial relations of the examined liver. Here it is important to convey the approximate borders between the segments, the number (ID) of the segment, and the approximate boundary of the liver organ.
66
O.K. Øye et al.
Based on discussions with a certified medical illustrator, we decided to visualize the orientational aids as two linked views. In one view, the 2D US plane is shown with overlays that define the segment partitioning on the slice together with dynamically placed labels. The 3D view shows the US slice embedded in the personalized liver model. We suggest an illustrative exploded view metaphor to better understand the liver cut as defined by the US slice as well as guided viewpoint mechanisms that allow to see the 3D scene from good viewpoints without the need to manually control the viewpoint transformation. All orientation enhancements are presented in an illustrative style and smooth surfaces are used to convey the approximative character of the space partitioning. The above described stages of our method for an illustrative Couinaud segmentation are depicted in Fig. 3. The following Sections 4 and 5 discuss the individual steps of our method in more detail.
4
Couinaud Partitioning and Liver Fitting
The Couinaud liver segmentation is a partitioning of the organ into eight segments that is based on the vascular trees in the liver. The portal vein is entering approximately in the vertical midpoint of the liver organ. The entering vessel and its first bifurcations define a horizontal plane that partitions the liver into the top and bottom compartments. The organ is then further partitioned by planes according to the three hepatic veins (the left, middle and the right hepatic vein), which are approximately perpendicular to the horizontal plane as defined by the portal vein. Segment 4 is defined over two (neighboring) top and bottom compartments. An additional central Segment 1 is defined close to the inferior vena cava, enclosed from the bottom by the portal vein and from the top and sides by the hepatic veins [9]. 4.1
Vascular Plane Selection and Model Fitting
The Couinaud segments are simple to model as a geometrical space partitioning using planes. Each of the four above mentioned vascular structures defines a plane and the set of these four planes is the basis for the Couinaud segmentation. The planes can be conveniently defined on the patient by positioning the tracked US transducer accordingly. When the US image coincides with one of these planes, a longitudinal cross-section through the corresponding vessel is visible, and the position of the probe is stored as a definition of the plane. Segment 1 is approximated by a sphere with the center between the portal vein and the junction point where the hepatic veins are joining to enter the inferior vena cava. For less experienced examiners, it can be challenging to determine which of the hepatic veins (left, center, right) is visible in the sagittal US image. We have therefore an optional vein landmarking step, where the user tags each vein with one or more sphere landmarks, one color for each vein (Fig. 4 left). The landmarking procedure is performed using the transverse transducer orientation, where the vessels can be easily identified. The spatial location of the veins is through landmarking well-defined. The user can use the landmarks for
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
67
Fig. 4. Users have the option to put landmarks on the three hepatic veins to help identify them when observed from another probe angle
orientation in the 3D view (Fig. 4 center), and as a means of identifying veins when examining the vein from an angle where not all hepatic veins are visible (Fig. 4 right). The positioning of the four planes defines the internal partitioning of the liver, but not the parenchyma. To approximate the parenchyma, a generic liver model is fitted to the geometry defined by the planes, as illustrated in Fig. 5. The generic parenchyma model (top) carries information of the partitioning planes, and by matching these planes to the patient specific planes defined by the examiner (mid), we can obtain a transformation from the generic parenchyma model to a patient-specific parenchyma model (bottom). The physical relevance with respect to the Couinaud partitioning lies in the three intersections of the vertically aligned planes (hepatic veins) with the horizontally aligned plane. For each intersection, a coordinate system can be defined in the generic and patient specific partitioning. By mapping these onto each other, a transformation can be found. The coordinate systems are defined as follows: The origin pi lies on the line of horizontal/vertical plane intersection (i indexes the three intersections i = (1, 2, 3) and is cyclic so that for i = 3, i + 1 = 1). The origin pi is found by projecting the probe position used for the respective plane definition onto the corresponding line of intersection. In general the non-orthogonal axes of the coordinate system are then defined as follows: v1i = pi+1 − pi v2i = li · |v1i | v3i = (li × ni ) · |v1i |
(1)
where li is a normalized vector pointing along the ith line of intersection and ni is the normal of the ith vertical plane. The axes v2i and v3i are multiplied with |v1i | so that we have the same local scaling along all three axes, necessary in order to have a correct scaling of the parenchyma model in all dimensions. These three coordinate systems are set up for the intersections both in the generic model and in the patient specific fit, resulting in three pairs of coordinate systems. By matrix inversion, three transformations (T1 , T2 , and T3 ) can be found that map each generic system to its patient-specific equivalent. This does however only specify the transformation for mesh vertices in the vertical planes, so in order to transform also vertices between planes correctly, for each vertex, we calculate in the vertex shader the distance to the two neighboring vertical planes, and
68
O.K. Øye et al.
interpolate the transformation matrix based on the respective distances: di = d(v, Pi ) dj = d(v, Pj ) dj di T= Ti + Tj (2) di + dj di + dj where i and j indexes the two neighboring vertical planes Pi and Pj for the vertex v. If the vertex has only one neighboring vertical plane (left or rightmost segments), the transformation corresponding to this plane intersection is used. This operation results in a smooth transformation of the parenchyma mesh for vertices between planes. Normal vectors are transformed accordingly for proper shading of the model. The final result is shown in the lower part of Fig. 5. By defining the four planes, we also implicitly define a patient coordinate system that is independent from the tracking coordinate system. This is shown together with the fitted liver model in Fig. 5. The patient left vector is given by subtracting the transducer position used when defining the left plane from that of the right plane, the front axis, pointing into the patient’s abdomen, is given by the orientation of the US probe while setting the planes. The patient up vector is then given by the cross-product between the left and the front vectors. This information is used for defining canonical viewpoints (Sec. 5.2). 4.2
Liver Sculpting
Fig. 5. The alignment of feature planes in the liver. Upper image: generic liver model with generic Couinaud segmentation, with one of the vertical/horizontal intersection coordinate systems depicted. Middle image: three transformations arising from the horizontal/vertical plane intersection coordinate system pairs. Lower image: the patient-specific fit. The segmentation procedure deduced the patient coordinate system shown at the bottom.
The fit described in the previous section is primarily a fit to the Couinaud segmentation, while the intermediate vertices are interpolated. Since the anatomical structure of the vessel tree exhibits considerable variations between individuals, the interpolated parenchyma fit only is an approximation. To allow the examiner to correct for deviations or to further specify the anatomy, we utilize the concept
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
69
Fig. 6. Examples of parenchyma sculpting interaction with the US transducer. The local mesh transformation is interpolated between the transformation of the anchor point and the identity transform based on the distance of the mesh point from the anchor point.
of interactive local mesh transformation, guided by the interaction with the US transducer (Fig. 6). The user locks a region of the liver model to a grab point as defined by the US transducer (by pressing a button), shown in Fig. 6 right. The transducer is then used to define a semi-local target displacement of the model, transforming the vertex v to v according to the movement of the transducer (Eq. 3) until the model is released by the user. Toggling the lock state places a grab point s on the US plane either near the transducer or towards the end of the US fan (near and far, Fig. 6 right). This influences whether the front wall or back wall of the liver should be primarily influenced by the transformation. Selecting the far end of the US fan ensures that, while operating on the patient’s abdomen, the sculpting can be applied to regions that are otherwise not the first hit intersection between the transducer’s view vector and the liver geometry. As the transducer is moved, the transformation T from the anchor orientation to the current transducer orientation is calculated. Each vertex v is then transformed with the transformation T in the vertex shader, weighted with the normalized distance d¯ from the anchor point s. With increasing distance from the US plane, the transformation converges towards the identity map I. This ensures a local transformation that smoothly falls off when moving away from the anchor point. When the user is satisfied with the transformation, the current transformation matrix is pushed onto a stack of matrices, and the procedure can be repeated for another region of the model. The resulting stack of matrices are applied in the vertex shader. This also allows for the implementation of an undo functionality, since matrices can be easily popped from the stack. Fig. 6 shows sculpting examples and illustrates the principle of the transformation specification defined by Equation 3: T : s → s ¯ v) = d(s, v) d(s, dmax ¯ v) · T + d(s, ¯ v) · I · v v = 1 − d(s, (3) ¯ by maximal distance dmax parameter where the distance d is normalized (d) adjusting the localness of the transformation.
70
5
O.K. Øye et al.
Illustrative Enhancements of Examination
After the rapid space partitioning and optional parenchyma sculpting to finetune the patient-to-model fit of the liver geometry, all the necessary semantics are in place for being used for the illustrative examination enhancement. In this section we describe the concepts of several different enhancement techniques as well as their efficient implementations, utilizing the flexibility of nowadays graphics hardware. Both views encode Couinaud segmentation by coloring each segment by a distinct color. Segment classification is calculated for each rendered fragment based on the distance from the individual segmentation planes in three dimensions. 5.1
Ultrasound Slice Enhancement
The illustrative enhancement of 2D US slices essentially consists of two approaches – dynamic overlay of Couinaud segments, which can be represented either by an outline or by a filled area, and a simple and intuitive label placement, denoting the 3D Couinaud segments that are intersected by the 2D US plane. Both enhancements are demonstrated in Fig. 7. The intersection of the US plane with the patient-specific model is shown as an outline in the US plane. This is achieved through clipping the mesh in the US plane, and orienting Fig. 7. Illustrative enhancement of the ulthe model so that the view vector is trasound slice consisting of parenchyma pointing into the clip plane. Then, border, inter-segment borders, and dya render-to-texture pass that leaves namic labels only back-facing fragments visible is performed, resulting in what we denote as visible interior map. This is a map which defines the visible part of the interior of each segment, which is part of the personalized liver model, for the given clip plane. The resulting texture from this rendering pass is smoothed with a Gaussian blur of configurable radius, and used as a mask when rendering the border outline. The blur provides the smooth border line seen in the image encoding the approximative character of the segmentation space partitioning. The rendering of the segmentation outline can be further enhanced with labels identifying which segment is represented by a particular outline. The label associated to an outline is depicted with the same color. The positioning of the label is based on the average fragment position in each visible segment inside the sliced liver parenchyma. For every frame the border and segment information, as confined by the model outline mask, are rendered to a texture. Instead of using the colors from our segmentation color scheme, a given alpha value is written for each fragment, encoding which segment it belongs to (0.1, 0.2, . . ., 0.8). This
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
71
texture is then passed to OpenCL processing, where a kernel identifies which segment a given pixel belongs to, based on its alpha value. The output is an 8 × 3 element output array where the number of samples and the sum of x and y coordinates of each segment are updated atomically by the kernel. This approach is analogous to an 8-bin histogram which sums up two coordinate values for each bin. The number of pixels contributing to each bin can be used for obtaining average value of each coordinate. These values are used as the input coordinates for positioning the labels. The rendering and processing is performed on a reduced resolution version of the US plane for improved performance. The labels are animated with a slightly delayed move towards the target position in order to be less sensitive to tracker noise and abrupt movements of the transducer. This ensures a more robust labeling movement. A drop shadow is added to the labels to enhance separation from the underlying data display. 5.2
Illustrative Guidance in 3D Space
The 2D US slice enhancement gives a good overview of the intersected structures, but for conveying how the US slice is embedded with respect to the 3D extents of the liver model, a 3D view is more effective. To alleviate occlusion problems with respect to the US slice (by the liver organ) we employ an illustrative exploded view metaphor as well as configurable transparency of the liver mesh. Furthermore, we ensure that the examiner is provided with good views of the model without explicit manual interaction. This guidance of the focus is achieved by automatic viewpoint steering. An example of illustrative Fig. 8. Illustrative 3D enhancements con3D enhancements is depicted in Fig. 8. veying the embedding of ultrasound in the The exploded view can provide the 3D liver model examiner with a better understanding of the spatial relations of the interior of the liver in a 3D context. In the exploded view, the liver model is cut along the US plane, and the interior becomes visible. The liver segments along the cut are uncovered and additionally textured with the US data. The exploded view implementation has been inspired by the GPUbased clipping approach of Gradinari [12]. First, again the visible interior map is rendered to a texture for the clip plane using the current viewing transformation, similarly to the outline rendering in the 2D view. Then the clipped part of the model is rendered and transformed according to the animated explosion, without offsetting the fragment coordinates for fragment coloring. Following this, a quad covering the exploded part is drawn in the clip plane, using the visible interior map for determining which fragments should be colored by segments’ color and
72
O.K. Øye et al.
textured with the ultrasound. The procedure is repeated for the second half of the exploded model. The explosion is animated with an angle from the US plane, so that the cut plane immediately becomes visible to a user looking along the US plane. During an examination, it is of great value to be able to orient the model according to the segmentation information. In our case, several guided viewpoints, utilizing the segmentation defined patient-specific coordinate system, have been realized. The front view aligns the model so that examiner can see the liver model along the horizontal plane, with the liver oriented according to patient’s up vector. Similarly, the examiner can select top-down, down-top, left-right, and right-left views. In addition, the examiner can bring the current position of the US plane into focus by transforming the view direction so that the ultrasound plane is in the center of the view. This is useful in cases when the examination position changes and the transducer moves out of the view. Finally, we provide a lock-to-transducer setting, in which the view is locked to the current US plane position. While scanning the patient and moving the transducer, the model will orient so that the US plane stays in the same position as it was in when the lock-to-transducer was toggled. As the examiner moves the US plane, the plane position is slightly transformed to provide a hint on the direction of the movement in the display. This gives a better understanding of the transducer movement, especially also for additional clinical personnel, participating in the examination, who are not interacting with the transducer themselves. This nuance can provide an effective transducer movement cue when viewing the post-examinational video sequence as an integrated part of a medical report.
6
Case Study
The presented visualization technology has been implemented as a plugin extension of the VolumeShop rendering framework [4]. To enable live examination, we have attached the Epiphan LR framegrabber to a Vingmed System Five (GE Vingmed AS, Horten, Norway) ultrasound workstation, which streamed US images in real-time to our rendering framework. The framerate of the framegrabber is about 30 Hz for 1280 × 1024 resolution and 24-bit RGB depth. No significant latency has been observed during the image transfer. The quality of the transferred images was comparable in terms of the sharpness and contrast to those visible on the workstation monitor. The workstation supports positional magnetic tracking using the Bird sensor attached to the transducer. The tracking transformation, however, was difficult to stream into the system from the US workstation. Therefore we used magnetic tracking attached directly to the computer, where our prototype was running. The implementation takes advantage of the latest graphics hardware and most of the computationally intensive tasks are executed on the GPU (processor NVIDIA GeForce GTX 260). This resulted in an interactive framerates and a performance between 15 and 30 Hz.
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
73
We demonstrate our illustrative enhancement technology by results from an US examination of a healthy male volunteer of age 31 with a normal vascular tree topology. Fig. 9 is showing the examination in form of screenshot storyboard. First, the transducer was used to place the four planes as related to vascular trees of the portal vein and the hepatic veins (fig. 9.1-.4). In the first four 2D views it is possible to see the plane placement along the portal vein bifurcation, left hepatic vein, middle and right hepatic vein. Associated 3D view shows the plane after successful placement, represented by a quad. The Segment 1 is defined separately by placing a spherical approximation between vena cava inferior and the portal vein (fig. 9.5) . The selection procedure took approximately five minutes and resulted into parenchyma mesh fitting in the 3D view. Now the examiner can on request enable the illustrative overlays for orientational purposes. 2D view offers various representations of the cross-section with Couinaud partitioning, such as contour-based(fig. 9.6) or fill-based overlays (fig. 9.7) accompanied with dynamic labels. 3D view offers exploded views(fig. 9.7) or guided camera viewpoints (fig. 9.8-.10) for better navigation in 3D without the need of explicit viewpoint selection. While inspecting the liver, the examiner has observed a misfit between the model and the parenchyma of the volunteer from specific US position (Segments 2 and 4). This is visible on the 2D slice view number eight. The geometry was transformed using the transducer to give a better matching. Better match is shown in the ninth frame in the slice view. 3D view shows a slight modification of the 3D shape of the liver model. The evaluation of the fit and the consequent adaption took approximately two minutes. Afterwards, the liver was examined from different transducer positions, employing all presented functionalities for the 2D and for the 3D view. The illustra- Fig. 9. Case study screenshot stotive enhancements assisted in gaining a clear ryboard with individual steps of our workflow
74
O.K. Øye et al.
understanding of the spatial relations within the liver. The illustratively enhanced examination was shown on an external monitor visible to the doctor as well as to the volunteer. Even the volunteer, as a person without any medical background, was able to easily follow how the Couinaud partitioning relates to the structures in his liver.
7
Discussion
We have tested our technology in the context of US examinations of six healthy volunteers. The transducer was operated by three different US examiners from the Section of Gastroenterology, Department of Medicine, which is our main clinical partner in this project. The testing environment is shown in Fig. 10. The advanced visualization technology has raised interest among the participating examiners; they considered it as practical navigational aid that can assist their work. Moreover, they stated that the Couinaud segmentation enhancement, if only available as functionality in the software of the scanner, would have a large group of potential users, i.e., abdominal US examiners worldwide. A good level of understanding of the examination procedure was reported also from the side of non-medical participants, in our case the volunteers themselves. This observation indicates that such an illustrative enhancement of difficult-to-understand examinations can be effectively communicated between the doctor and the patient. At the moment, we do not explicitly handle the movement within the body as caused by the subject’s respiration. The examiners have developed an accepted procedure, where the patient’s respiration is paused on request in a particular phase of the respiration cycle, known as stop’n’go approach. The navigational aids are spatially valid for this phase of the respiration. Other navigational aids, which already are implemented in US workstation software, are successfully used in the clinical environment exactly with this controlled respiration process. Although our methods do work with the controlled respiration procedure, we plan to integrate a patient-specific respiration gating within our visualization technology in the future. We expect that this will provide an additional speedup in performing the Couinaud segmentation procedure as the patient can be examined not only during the breathhold. The Couinaud segmentation is a widely accepted liver partitioning although it does not anatomically fit to all human individuals. In some cases the topology of the hepatic veins might have larger variations for some individuals. Examiners are using the Couinaud partitioning for the localization of pathologies, even if the Couinaud segmentation is, strictly seen, not compatible with a subject’s anatomy. Being a higher level of abstraction, for such cases, the Couinaud segmentation is still reported to be sufficient for preparing the examination report. For computer-supported surgery planning the segmentation of liver must be precisely aligned with the specifics of the patient’s anatomy. In such a case, we can imagine to adapt our segmentation procedure to allow for a higher flexibility in terms of the number of planes, specified for a non-standard liver case. This
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
75
would probably prolong the segmentation procedure, but the extra time will be traded with providing higher flexibility to the surgery planner. Having US examination as the application case in mind, straight planes are sufficient for approximating the geometry of the associated vascular trees. For surgery planning, however, a more precise fit with the anatomy is needed and curved planes seem to be an improvement. For such a scenario we can imagine to offer a semiautomatic solution, where a straight plane specified by the clinician can be curved by adding surface control points for better alignment with the vessel. State-of-the-art liver surgery planning systems require at least one early-enhanced and one late-enhanced CT scan. Providing a good surgery Fig. 10. Testing the prototype implemenplanning procedure based on US ex- tation of our illustrative Couinaud segmenamination will strongly reduce the ex- tation enhancement methods posure of patients to ionizing radiation. We keep this application area as an aim for our future work. Our work also causes implications for more theoretical foundations of medical visualization. Traditionally, in medical visualization there has been a relatively strict border between patient-specific semantics and generic models that represent an average human anatomy. Live abdominal US examinations are scanning features that are constantly moving due to respiration and cardiac cycles. It is extremely difficult in such a setting to perform a pixel-accurate feature matching for an entire organ such as the liver. Often, when a US examination precedes other 3D imaging acquisitions, there is no 3D data available which could enable the extraction of patient-specific features. The discussions with our clinical partners have convinced us that a certain level of approximity can be acceptable as the precise match between the model and a patient can be finalized mentally. In some anatomical areas the match has to be precise for treatment decision making. Particularly, in interventions that are US-guided it is very important that the clinician is offered functionality for fine-tuning the matching. We have therefore designed an approximative model matching procedure with the possibility to fine-tune the match for specific areas of parenchyma. It has been recognized as the most appropriate trade-off between time and precision, specifically for the case of a guided US examination. An approximative patient-specific approach has therefore an interesting consequence on the binary categorization between a generic model and patient-specific semantics. It provides a visual abstraction of the anatomical representation, where
76
O.K. Øye et al.
the examiner has full control over the level of detail. This concept blurs the strict border between patient-specific semantics and generic model, and is well aligned with the characteristics of the examination type and employed modality. In our approach we match the patient anatomy to the Couinaud segmentation which relates to the most common vascular topology among humans. The matching procedure is optimized to be performed fast on a regular vascular tree. In future work we will also investigate methods to handle special topologies, when for example the patient has already undergone a liver surgery treatment.
8
Conclusions
In this work we have presented a new methodology for assisting ultrasound liver examinations with approximative illustrative visualization of patient-specific Couinaud segmentation. Approximative fit is generated based on positions of important anatomical features in the subject’s liver such as portal and hepatic vein vessel trees. When a better fit for specific areas is required, the mesh can be additionally manipulated by interacting with the transducer. Ultrasound examiners, who are used to utilize static navigational posters, consider the new technology as highly useful and provide a number of comments to improve the workflow. Specifically, interaction mechanisms for approximative delineation of the pathological areas, respiration gating, and export functionality for advanced medical reporting are suggested as the high priority features to include in the next version of the prototype. After incorporating these features, we plan a thorough clinical evaluation including several examiners from different clinical facilities and patients with different liver pathologies. In the long perspective, we aim at providing visualization technology for assisting the entire medical procedure of liver pathology treatment imaged with ultrasound, from examination, via surgery planning, up to the treatment.
References 1. Anatomium: 3D anatomy model data sets of the entire 3D human anatomy web site (2010), http://www.anatomium.com/ 2. Bade, R., Riedel, I., Schmidt, L., Oldhafer, K.J., Preim, B.: Combining training and computer-assisted planning of oncologic liver surgery. In: Proceedings of Bildverarbeitung f¨ ur die Medizin, pp. 409–413 (2006) 3. Bajura, M., Fuchs, H., Ohbuchi, R.: Merging virtual objects with the real world: Seeing ultrasound imagery within the patient. In: Proceedings of SIGGRAPH 1992, pp. 203–210 (1992) 4. Bruckner, S., Gr¨ oller, M.E.: VolumeShop: An interactive system for direct volume illustration. In: Proceedings of IEEE Visualization 2005, pp. 671–678 (2005) 5. Bruckner, S., Gr¨ oller, M.E.: Exploded views for volume data. IEEE TVCG 12(5), 1077–1084 (2006) 6. B¨ urger, K., Kr¨ uger, J., Westermann, R.: Direct volume editing. IEEE TVCG 14(6), 1388–1395 (2008)
Illustrative Couinaud Segmentation for Ultrasound Liver Examinations
77
7. Burns, M., Haidacher, M., Wein, W., Viola, I., Gr¨ oller, M.E.: Feature emphasis and contextual cutaways for multimodal medical visualization. In: Proceedings of EuroVis 2007, pp. 275–282 (2007) 8. Chen, M., Correa, C.D., Islam, S., Jones, M.W., Shen, P.Y., Silver, D., Walton, S.J., Willis, P.J.: Manipulating, deforming and animating sampled object representations. Computer Graphics Forum 26(4), 824–852 (2007) ´ 9. Couinaud, C.: Le foie: Etudes anatomiques et chirurgicales. Masson Edition, France (1957) 10. Erdt, M., Raspe, M., Suehling, M.: Automatic Hepatic Vessel Segmentation Using Graphics Hardware. In: Dohi, T., Sakuma, I., Liao, H. (eds.) MIAR 2008. LNCS, vol. 5128, pp. 403–412. Springer, Heidelberg (2008) 11. Gilja, O.H., Hausken, T., Berstad, A., Ødegaard, S.: Invited review: Volume measurements of organs by ultrasonography. Proceedings of the Institution of Mechanical Engineers 213(3), 247–259 (1999) 12. Gradinari, A.: Bonus Article: Advanced Clipping Techniques. In: More OpenGL Game Programming, Course Technology PTR (2005), http://glbook.gamedev.net/moglgp/advclip.asp 13. Hartmann, K., G¨ otzelmann, T., Ali, K., Strothotte, T.: Metrics for functional and aesthetic label layouts. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P. (eds.) SG 2005. LNCS, vol. 3638, Springer, Heidelberg (2005) 14. H¨ onigmann, D., Ruisz, J., Haider, C.: Adaptive design of a global opacity transfer function for direct volume rendering of ultrasound data. In: Proceedings of IEEE Visualization 2003, pp. 489–496 (2003) 15. Ødegaard, S., Gilja, O.H., Gregersen, H.: Basic and New Aspects of Gastrointestinal Ultrasonography. World Scientific, Singapore (2005) 16. Petersch, B., Hadwiger, M., Hauser, H., H¨ onigmann, D.: Real time computation and temporal coherence of opacity transfer functions for direct volume rendering of ultrasound data. Computerized Medical Imaging and Graphics 29(1), 53–63 (2005) 17. Petersch, B., H¨ onigmann, D.: Blood flow in its context: Combining 3D B-mode and color Doppler us data. IEEE TVCG 13(4), 748–757 (2007) 18. Ropinski, T., D¨ oring, C., Rezk-Salama, C.: Interactive volumetric lighting simulating scattering and shadowing. In: Proceedings of PacificVis 2010, pp. 169–176 (2010) 19. Ropinski, T., Praßni, J.S., Roters, J., Hinrichs, K.H.: Internal labels as shape cues for medical illustration. In: Proceedings of Workshop on Vision, Modeling, and Visualization, pp. 203–212 (2007) 20. Sakas, G., Schreyer, L.A., Grimm, M.: Preprocessing and volume rendering of 3D ultrasonic data. IEEE Computer Graphics and Applications 15(4), 47–54 (1995) 21. Soler, L., Delingette, H., Malandain, G., Montagnat, J., Ayache, N., Koehl, C., Dourthe, O., Malassagne, B., Smith, M., Mutter, D., Marescaux, J.: Fully automatic anatomical, pathological, and functional segmentation from CT scans for hepatic surgery. Computer Aided Surgery 6(3), 131–142 (2001) 22. Viola, I., Nylund, K., Øye, O.K., Ulvang, D.M., Gilja, O.H., Hauser, H.: Illustrated ultrasound for multimodal data interpretation of liver examinations. In: Proceedings of VCBM 2008, pp. 125–133 (2008) ˇ eszov´ 23. Solt´ a, V., Patel, D., Bruckner, S., Viola, I.: A multidirectional occlusion shading model for direct volume rendering. Computer Graphics Forum 29(3), 883–891 (2010)
Iconizer: A Framework to Identify and Create Effective Representations for Visual Information Encoding Supriya Garg, Tamara Berg, and Klaus Mueller Computer Science Department, Stony Brook University {sgarg,tlberg,mueller}@cs.stonybrook.edu
Abstract. The majority of visual communication today occurs by ways of spatial groupings, plots, graphs, data renderings, photographs and video frames. However, the degree of semantics encoded in these visual representations is still quite limited. The use of icons as a form of information encoding has been explored to a much lesser extent. In this paper we describe a framework that uses a dual domain approach involving natural language text processing and global image databases to help users identify icons suitable to visually encode abstract semantic concepts. Keywords: human-computer interaction, non-photorealistic rendering.
A Framework to Identify and Create Effective Representations
79
propose that these biases in preference will emerge naturally when mining large collections of images taken and posted to the internet by people. Icons (in computing) have been around since the 1970s to make computer interfaces easier to understand for novice users, mapping concepts to standardized visual representations. The majority of these icons are symbolic representations of applications that need to be memorized by the user. Clip art, on the other hand, aims to be more descriptive and is meant for illustration. A very narrow set of clip art is used in practice, often marginally matching the situation at hand. A third option for selecting iconic representations is to use web search to gather images fitting a desired concept. However, for complicated concepts, this results in limited success because multiple queries or lengthy searching must be performed to match a concept exactly. The framework we present provides a computer-aided system that allows users to quickly and effectively design clip art that is well targeted to their concept of interest. We achieve these capabilities by extending and synthesizing techniques rooted in non-photorealistic rendering and computer graphics, image processing, web-scale content-based image retrieval and natural language processing. The ability to design well-targeted expressive clip art in a cohesive illustrative style enables applications at a scale much grander than a singleton. We may use them for the illustrations of documents, books, manuals, and the like, and they are also applicable to visualize taxonomies of objects and even more general concepts. Further, they can replace or complement textual annotations and photographs within node-link diagrams often used in analytical reasoning tasks, making these representations much more expressive. Our paper is structured as follows. Section 2 presents the general philosophy behind our methodology, which is rooted in a joint lexical and visual analysis. Section 3 presents previous work in the area of visual languages and icon generation as well as the background of our approach. Section 4 describes our approach in detail. Section 5 presents results and some discussion on our system, and Section 6 ends the paper with conclusions and future work.
2 Overall Motivation and Philosophy We aim to find Visual Information Encodings (VIEs) that are intuitive, i.e., are already part of one’s visual vocabulary. This avoids the need for memorization of a set of dedicated symbols for iconic communication. VIEs are relatively easy to find for most objects and actions because they can be observed in real life and are already part of one’s visual vocabulary – yet their interpretation and aesthetics still leaves much room for artistic freedom in determining the best VIE design. However, as with visual languages, the greatest challenge comes from determining good VIEs for abstract concepts. Take for example, the concept ‘travel’. When asked, people will offer a wide variety of possible VIEs for these concepts, and this variety is also reflected in the query results with image search engines. Hence, we desire a VIE that reaches the broadest consensus among a sufficiently wide population. We propose an indirect approach to find this consensus, circumventing the need for an active solicitation of user responses to candidate VIEs. Instead we exploit existing public lexical databases like WordNet [7] augmented with aggregated statistical
80
S. Garg, T. Berg, and K. Mueller
Fig. 1. Specialization from an unknown person to Inspector Blanding
information in the form of lexical triggers [1] and couple these with public image search engines. These triggers are computed by analysing thousands of documents and looking for words that commonly co-occur. Thus we resort to employing methods that statistically analyse data that humans have produced and which in some form represent their view of the world. Examples of lexical triggers are ‘sky → blue’ or ‘travel → passport’. We also require a similar notion on the image side, that of visual triggers, to map concepts to suitable VIEs. The notion of a visual trigger is not readily available from any current image database. We propose a dual-domain approach to (incrementally) build and format these visual triggers. We note that this is an extremely large undertaking and all we can do here is to propose a methodology by which this could be done. Our approach uses the available lexical triggers to allow users to interactively explore the concept space, select suitable representative concepts, and finally create icons. As mentioned, concepts can range from very specific to fairly general. Figure 1 illustrates this via an example that shows a conceptual zoom across many conceptual levels, here from ‘Man’ all the way to a specific person ‘Inspector Blanding’. These conceptual zooms are not just multi-resolution representations obtained by low-level abstraction, i.e., by intensity or gradient domain filtering. Rather, they are semantic zooms, i.e., categorical refinements or generalizations within an object hierarchy.
3 Related Work and Background Visual languages range from icon algebra [17] to the encoding of all information into multi-frame artist-developed cartoon-like renditions [3]. The set of icons is typically fixed, developed manually, and can be composited. Semanticons [28] is an innovative way to create new file icons by abstracting terms occurring in the file or file name along with a commercial database of images. None of these applications exploit any semantic analysis, nor do they make use of the large publicly available lexical and image databases to broaden the semantic base for abstraction and enable VIE learning. Other related work includes that of Rother et al. [27] which enable the automatic creation of collages from image collections to compose a single image with blended collection highlights. Here the user has no control over the layout of images inserted into the collage. Alternatively, Photo Clip Art [18] provides an interface that enables insertion of photo-realistic objects into new images, correctly constrained to be in a natural looking context within the resulting image. Their goals however are different than ours in that their visual objects are meant to enrich or compose graphical scenes and collages, without placing special emphasis on conveying specific
A Framework to Identify and Create Effective Representations
81
Fig. 2. System block diagram
semantic information. Finally, there is research targeting the automatic illustration of text via application of 3D graphics engines. The systems by Götze et al. and Götzelmann et al. [10,11] annotate their renderings according to the underlying text, while Word-Eye [4] analyses descriptive text according to a set of hand-coded rules to produce 3D scene renderings. These systems require a library of 3D models to which their illustrations are limited. The Story-Picturing-Engine [15] has a greater gamut, indexing image search engines with exact lexical terms occurring in the text, but no further semantic interpretation, integration, or visual abstraction is made. One high level challenge we face is how to find representative or iconic images for a given query term. For this purpose we can exploit the vast number of images available on the Internet. The sheer number of images presents challenges not experienced in traditional well organized, labelled collections of images. Recent work has explored methods for choosing the most representative or canonical photographs for a given location [29] or monument [20] through clustering, where hard geometric constraints can be used since the object is a single instance seen from multiple viewpoints. These methods are less useful here since we apply our system to object categories that vary quite widely in appearance. Other methods work towards finding the most aesthetically pleasing search results, since returning a poor quality image is probably never aligned with a user's needs [5,26]. We propose a human-computer integrated approach that combines textual analysis with image clustering techniques and human aided image selection. Another area with a similar high level goal to our work is the problem of image classification for content based image retrieval (CBIR). These systems utilize purely image based information for content analysis and retrieve images by measuring their similarity to a given query image (an in depth review is presented by Datta et al. [6]). The most successful approaches typically integrate a variety of colour, texture, shape, or region based cues. However, the problem of content based retrieval is extremely challenging and far from solved for most object categories, and these systems are often helped by having a human in the loop to guide the search process via relevance feedback [9,12,32]. We take such an approach here.
82
S. Garg, T. Berg, and K. Mueller
Fig. 3. The user interface. (a) Query-box (b) Related words (c) Translations to foreign languages (d) History + query builder (e) Utility words (f) Image results
4 Approach Our overall system is depicted in Figure 2. It consists of two main components, the Ibridge builder and the VIE designer. If our concept has a direct physical representation, then our job is fairly easy. Else, we require an indexical sign (I-sign). I-signs must use a good ontological metaphor by which the abstract concept is represented as something concrete, such as an object, substance, container, person, or some visual action. Good mappings improve distances in conceptual space, moving the concept closer to the visual encoding. The first part of our framework is designed to build this bridge crossing conceptual space – we call it the iconicity-bridge or Ibridge. We use lexical databases and image search engines to derive potential Ibridges. Our second step takes the resulting images to create a VIE that represents the most central visual theme of a concept (the graphics-based VIE-Designer). This involves clustering the images, finding the median in each cluster, and finally abstracting the median so that it captures the common features of all objects in the cluster. In the following sections, we describe each of these components in detail. 4.1 The I-Bridge Builder A good I-bridge is an association that is deeply rooted in our semantic understanding of the world. Such associations consist of pure lexical classifications, such as synonyms, antonyms, hyponyms (specializations), and hypernyms (generalizations) as well as statistical knowledge about co-occurrence relations between terms in
A Framework to Identify and Create Effective Representations
83
documents or spoken language (so-called trigger relations). The former is captured in public lexical databases such as WordNet [7], while we use Lexical FreeNet [1] to provide the statistical knowledge component. Basic English [21] gives us the equivalent of a word in restricted English – e.g. ‘bombshell’ translates to ‘great shock’. Utility or helper terms are also provided which combined with the original query often help narrow image results to a particular visual metaphor. For example, utilizing the ‘gear’ utility with the query ‘travel’ might direct a user towards images of suitcases or backpacks. Lastly, language translators are used to translate concepts that are polysemous (have multiple meanings), or are brand names in English. In our web-based interface (Figure 3), all of these lexical and statistical associations are exposed to the user as tools for enabling conceptual connections and exploring the space of an input concept. This results in a powerful interactive interface to help the user build an effective I-bridge. Figure 3a shows the query input box. Figure 3b displays semantically related words from WordNet and statistically related words from Lexical FreeNet. Figure 3c provides translations of the query concept into four languages plus Basic English, while Figure 3d maintains a history of the explored query concepts during an I-bridge building session. In Figure 3e helper terms such as ‘equipment’ or ‘tool’ are provided to enable focusing on particular visual senses. Finally, Figure 3f displays the top results from Google image search. After the initial display of results, the user can do several things. In a perfect world, several relevant images would be found in the first go and the user can store them to the saved images panel. Otherwise, the user can continue browsing the lexical space until the images reflect his desired concept. For example, the user might want to explore concepts semantically or statistically related to his query (Figure 3b). Alternatively, the user could use the query builder to make complex queries. For example, for a query like ‘art’, the image results returned might be too ‘artsy’ and not represent the high level concept ‘art’ in a simple, concise manner. An image query of ‘art’ + ‘supplies’ (a utility word from Figure 3e) gives more concrete results such as images depicting coloured pencils, crayons or paint. The image search results themselves can suggest good I-bridges. For example, the image results for travel includes an airplane, a map and a compass, suitcases, and a person on the beach. In order to construct a good VIE for the user’s candidate I-sign, we will require a large number of relevant images to mine for the most iconic visual representation. In our experience with the I-bridge builder we have observed that more specialized queries tend to return image sets that are more coherent. For example, the query ‘man’ returns a diverse set of images with many depictions, while the image set for ‘police man’ is more homogeneous. This reveals a powerful strategy for I-sign learning: Join all the image instances obtained with specializations of the target term, obtained through our semantic analysis, and then use this collection to build the VIE. In order to get a diverse and comprehensive set of images for a query, we download the top 200 results from Google Image Search. We ignore the later results as they tend to become much more noisy and unreliable. Instead, to increase the size of our data set while maintaining high quality, we translate our queries into 4 other languages and collect the top 200 results from each translated term.
84
S. Garg, T. Berg, and K. Mueller
(a) Images with corresponding edge images. The edge images have the randomly sampled points highlighted. The green points are the ones with no good match.
Weighted edge
Average
Abstract based on final edge image
(b) Building the average weighted edge image for the exemplar based on edge-matching. This is followed by image abstraction to design the final “icon” Fig. 4. Transformation of a median image to the final icon
4.2 The VIE-Designer Constructing visual equivalents of model-based abstractions requires a semantic abstraction of these I-signs, which as mentioned above goes much beyond the imagebased abstraction methods available today. More concretely, we seek a picture of the given concept that unifies all of the concept’s known facts, but abstracts away the unknown facts. In images, a ‘fact’ is expressed as a visual feature, or a collection of features. An image set that bears feature ‘noise’ is a set of images that share some features (facts), but also contain a wide selection of other random features (unknown facts). We can construct an average or exemplar image for a category by looking for features in common across a set of queries. In this section, we present our algorithm to extract the basic icon for a set of images. Given a set of images belonging to the same category, we can find the most common shape features among them, and produce an icon. This part of our system is implemented in MATLAB. 4.3 Exemplar Finding In our system, since the images we use are results of queries to image search engines, most images will have a simple layout with the object of interest covering a large part the image and a relatively clean background. Hence, we can use a global scene descriptor like gist [22] that captures the general layout and shape of the image subject, to cluster our images. Gist is an image descriptor commonly used in computer
A Framework to Identify and Create Effective Representations
85
vision and graphics applications. Gist provides a whole image feature descriptor encoding a coarse representation of the oriented edges at a range of frequencies present in an image. Since the size of the descriptor depends on the size of the original, we resize all images to a fixed size of 160x160 to compute equivalent descriptors. Next, we cluster the images and select cluster exemplars using affinity propagation (AP) [8]. This is a state of the art clustering method that takes as input measures of similarity between data points and ‘preference’ values indicating preference for a particular data point to be an exemplar i.e. cluster center. The algorithm then selects a set of good exemplars and corresponding clusters through a message passing algorithm. In our case we measure similarity between images using Euclidean distance between their Gist descriptors, and set the ‘preference’ values to be Google Image Search rank of each image (since images appearing earlier in the ranking tend to better reflect the search term). Given the resulting clusters, we order the images within each cluster based on their similarity to the exemplar. Finally, we select the cluster used to build our final VIE. This choice is based on two criteria – the size of the cluster and the average distance from the exemplar. 4.3.1 Image Abstraction At this stage we have a cluster representing a query, and its exemplar. Instead of presenting the exemplar as the icon, we abstract it so that only the relevant details present in all the images in the cluster are maintained, while removing the details specific to them. This works well only in cases where the whole object can be cleanly separated from the background. In cases where all the images have a dense background (for e.g. in case of animals), we simply use a non-photorealistic (NPR) version of the exemplar as the VIE. The NPR version works better than the original image when multiple VIEs share screen space – the abstraction process modifies images such that they look like they have come from the same source.
Fig. 5. Icon for average car. It is built by finding the combined median of the 9 sub-categories shown here. The images under each column are the top results for that category. The top row shows edges at the scale at which the cars look most similar.
86
S. Garg, T. Berg, and K. Mueller
Z Zoom into category:: “Co offee, tea, espresso”
Fig. 6. Icons showing the taxonomy under the category “Small Appliances”. On the right, we zoom in to the icons for the sub-category “Coffee tea espresso”.
To distil an exemplar down to its VIE we use a shape representation to find edges in the exemplar that are also present in the other images within the cluster – relevant details – and remove those edges that are not present in other cluster images – details specific to only the exemplar. We first randomly sample feature points (locations at which to compute local features) from the edges present in each image. Then we extract shape-context [2] descriptors for each of these feature points. Shape-context is a computer vision feature descriptor that describes object shape with respect to a given feature point by computing a log-polar histogram of the edges surrounding that feature point. Next we perform a shape based alignment between the exemplar and the remaining images using the Hungarian algorithm to find the optimal one-to-one matching between the point sets (Figure 4). The Hungarian algorithm is a well-known combinatorial optimization algorithm which solves the assignment problem in polynomial time. Similarity is measured between two points using the Euclidean distance between their normalized shape-context vectors. We then remove bad feature point matches based on: a) low shape-context feature similarity, and b) large distances in image space (indicating false matches between points in very different parts of the object). This method works well at matching objects which are misaligned due to rotation or translation, but cannot handle objects that are flipped. Since the Hungarian algorithm is cubic in complexity, the running time grows fast with the number of points sampled on the edges. We typically use 100 points to give us a good balance between speed and accuracy. At this complexity level, the shape matching takes a few seconds per image, with the total time dependent on the size of the cluster. At this stage, we are left with point pairs that are highly similar. We assign scores to the exemplar points based on their similarity to the matching points in the other images. Further, the remaining points on the exemplar edges are assigned scores equal to the nearest exemplar point on a connected edge. This gives us a complete weighted edge image for the exemplar. We show an example for blender in Figure 4 displaying the original edge image, matched edge points, and final weighted edge image. In this figure, the colour map goes from white to black via yellow and orange. We can see here that the outer edges are red/orange and black, indicating edges present in both the exemplar and the non-exemplar image. We repeat this step by
A Framework to Identify and Create Effective Representations
87
Fig. 7. Icons for different concepts discussed in the evaluation section
matching the exemplar to all the other images within the exemplar’s cluster. In the end, the final weighted edge image is the mean of the weighted edge images calculated to each non-exemplar blender. The final weighted edge image helps us design the abstracted version of the object. We use Poisson-blending [25] guided by the noise-free edges to create an abstracted illustration [23,31] by removing the features at higher levels. Finally, we add back the edges calculated in the previous step to give it a more defined and iconic look. The final icon for the blender shows an image which is almost a silhouette, but shows important details like the base, the jar, and the cap (Figure 4b).
5 Results and Evaluation Figure 5 demonstrates the exemplar-finding algorithm. Here we use the problem of finding the average car within a collection of many cars. We find the following nine subcategories for a car, since these represent car categorization by shape: sedan, coupe, convertible, sports car, SUV, van, pickup truck, station wagon, and minivan. For each category we use gist-based affinity clustering to get a set of representative visual triggers. The user selects the best cluster(s) from each category, and finally we cluster together the top ranked images using our shape-context based clustering. This gives us the average car and shows the utility of clustering at multiple scales. At the end of clustering, a sedan emerges as the average car. Next, in Figure 6, we demonstrate our entire algorithm (exemplar-finding and abstraction) for the construction of taxonomy visualization. Here we seek to assign an icon to each level, instead of just to the leaves. We first calculate the icons for the leaf nodes using the algorithm outlined above. For the inner nodes with only leaf nodes as children, we form a collage of at most the top four children under it. This order can simply be calculated based on the popularity – for example, on Amazon.com, the subcategories always appear in the order of popularity. For example, when someone selects the category “Small Appliances”, they are probably looking for a Coffee maker. Further, when forming the collage, the more important categories are allocated more space. As we move further up the taxonomy tree, one node will have many subcategory trees underneath it. In order to keep the icons compact and representative, we just percolate the icon for the top sub-category upwards.
88
S. Garg, T. Berg, and K. Mueller
Our system requires user input and interaction at various stages of the VIE design. To get insight and feedback from multiple people, we had members of our lab interact with the system. The users identified different concepts they were interested in visualizing, and we built the I-Bridge using our system. Some interesting results (as shown in Figure 7) were: • Success: Person climbing a ladder (via Google images) • Oil spill: Images of the duck in the oil spill. This is an example where a current event highly modifies the most relevant icon. • Renaissance: Triggers artists like Shakespeare and Michelangelo. We can represent the concept using the people, or the art created by them. • Gothic: We can represent this concept by using an icon with a person in gothic attire (via Google Images), or by using examples of gothic architecture (via Synonym relation) • Affinity: Its synonym kinship gives us images of family trees • Countries: The associated terms and images give us the political map, flag, and landmarks (Taj Mahal for India, Great Wall for China) of the country. • Thrill: Rollercoaster. This indeed represents a good icon for representing the experience of a thrill.
6 Conclusions and Future Work In this paper, we presented an approach which can accomplish the goal of finding good visual information encodings for concepts we are interested in. This requires the integration of many fields – linguistics, vision, computer graphics, and user interfaces, with a human in the loop. Our framework has great prospects in the design of clip art for various applications, such as taxonomies, book illustrations, and the expressive augmentation of graphical node/link diagrams to make these much more engaging and informative. In future we plan to fully integrate our framework into a graph drawing engine, use abstractions more freely to summarize certain facts and attributes, and use compositions for compact visual story telling with context and key players. Given the current status of implementation, we believe that we can deploy our interface and backend processes to a wider circle of users, over the web. Such a community-driven effort will likely result in much more robust icons, and give further insight into personal preferences. We plan to evaluate both usability and performance, in a conjoint manner (using the approach in [9]) using three types of experiments – determining I-signs given textual concepts, choosing between two I-signs for a textual concept, and finally given an I-sign, choosing among two alternative textual concepts. Nevertheless, it goes undisputed that not all concepts have good visual representations and encodings. This is particularly true for difficult non-object concepts such as ‘worship’. For those concepts that do lend themselves well to visual encodings, we believe that the power of our approach is its ability to communicate possibly quite subtle differences much more efficiently than textual descriptions. Acknowledgement. This research was supported by NSF grants CCF-0702699 and CNS-0627447.
A Framework to Identify and Create Effective Representations
89
References 1. Beeferman, D.: Lexical discovery with an enriched semantic network. In: Proceedings of the ACL/COLING Workshop on Applications of WordNet in Natural Language Processing Systems, pp. 358–364 (1998) 2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(4), 509–522 (2002) 3. Chuah, M.C., Eick, S.G.: Glyphs for software visualization. In: 5th International Workshop on Program Comprehension (IWPC 1997) Proceedings, pp. 183–191 (1997) 4. Coyne, B., Sproat, R.: Wordseye: An automatic text-to-scene conversion system. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 487–496 (2001) 5. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Studying aesthetics in photographic images using a computational approach. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3953, pp. 288–301. Springer, Heidelberg (2006) 6. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (CSUR) 40(2), 5:1–5:60 (2008) 7. Fellbaum, C.: others: WordNet: An electronic lexical database. MIT Press, Cambridge, MA (1998) 8. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–977 (2007) 9. Giesen, J., Mueller, K., Schuberth, E., Wang, L., Zolliker, P.: Conjoint analysis to measure the perceived quality in volume rendering. IEEE Transactions on Visualization and Computer Graphics 13(6), 1664–1671 (2007) 10. Götze, M., Neumann, P., Isenberg, T.: User-Supported Interactive Illustration of Text. In: Simulation und Visualisierung, pp. 195–206 (2005) 11. Götzelmann, T., Götze, M., Ali, K., Hartmann, K., Strothotte, T.: Annotating images through adaptation: an integrated text authoring and illustration framework. Journal of WSCG 15(1-3), 115–122 (2007) 12. He, J., Tong, H., Li, M., Zhang, H.J., Zhang, C.: Mean version space: a new active learning method for content-based image retrieval. In: Proceedings of the 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, pp. 15–22 (2004) 13. Hoffman, D.D.: Visual intelligence: How we create what we see. WW Norton & Company, New York (2000) 14. Horn, R.E.: To Think Bigger Thoughts: Why the Human Cognome Project Requires Visual Language Tools to Address Social Messes. New York Academy Sciences Annals 1013, 212–220 (2004) 15. Joshi, D., Wang, J.Z., Li, J.: The Story Picturing Engine—a system for automatic text illustration. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 2(1), 68–89 (2006) 16. Kovalerchuk, B., Brown, J., Kovalerchuk, M.: Bruegel iconic correlation system. Visual and Spatial Analysis, 231–262 (2004) 17. Kovalerchuk, B.: Iconic reasoning architecture for analysis and decision making. In: Visual and Spatial Analysis, pp. 129–152. Springer, Netherlands (2004) 18. Lalonde, J., Hoiem, D., Efros, A.A., Rother, C., Winn, J., Criminisi, A.: Photo clip art. ACM Transactions on Graphics, TOG (2007) 19. Leyton, M.: Symmetry, causality, mind. The MIT Press, Cambridge (1992)
90
S. Garg, T. Berg, and K. Mueller
20. Li, X., Wu, C., Zach, C., Lazebnik, S., Frahm, J.M.: Modeling and recognition of landmark image collections using iconic scene graphs. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 427–440. Springer, Heidelberg (2008) 21. Ogden, C.K.: Basic English: a general introduction with rules and grammar. K. Paul, Trench, Trubner (1944) 22. Oliva, A., Torralba, A.: Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision 42(3), 145–175 (2001) 23. Orzan, A., Bousseau, A., Barla, P., Thollot, J.: Structure-preserving manipulation of photographs. In: Proceedings of the 5th International Symposium on Non-Photorealistic Animation and Rendering, pp. 103–110 (2007) 24. Palmer, S., Rosch, E., Chase, P.: Canonical perspective and the perception of objects. In: Attention and Performance IX, pp. 135–151 (1981) 25. Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. 22(3), 313–318 (2003) 26. Raguram, R., Lazebnik, S.: Computing iconic summaries of general visual concepts. In: Proc. of IEEE CVPR Workshop on Internet Vision, pp. 1–8 (2008) 27. Rother, C., Bordeaux, L., Hamadi, Y., Blake, A.: AutoCollage. ACM Transactions on Graphics (TOG), 847–852 (2006) 28. Setlur, V., Albrecht-Buehler, C., Gooch, A., Rossoff, S., Gooch, B.: Semanticons: Visual metaphors as file icons. In: Computer Graphics Forum, pp. 647–656 (2005) 29. Simon, I., Snavely, N., Seitz, S.M.: Scene summarization for online image collections. In: Proc. of ICCV, pp. 1–8 (2007) 30. Strothotte, C., Strothotte, T.: Seeing between the pixels: pictures in interactive systems. Springer-Verlag, New York, Inc. (1997) 31. Strothotte, T., Schlechtweg, S.: Non-photorealistic Computer Graphics: Modeling, Rendering, and Animation. Morgan Kaufmann Pub., San Francisco (2002) 32. Zhou, X.S., Huang, T.S.: Relevance feedback in image retrieval: A comprehensive review. Multimedia Systems 8(6), 536–544 (2003)
A Zone-Based Approach for Placing Annotation Labels on Metro Maps Hsiang-Yun Wu1 , Shigeo Takahashi1 , Chun-Cheng Lin2 , and Hsu-Chun Yen3 1
Dept. of Complexity Science and Engineering, The University of Tokyo, Kashiwa, Chiba, Japan [email protected], [email protected] 2 Dept. of Industrial Engineering and Management, National Chiao Tung University, Hsinchu, Taiwan [email protected] 3 Dept. of Electrical Engineering, National Taiwan University, Taipei, Taiwan [email protected]
Abstract. Hand-drawn metro map illustrations often employ both internal and external labels in a way that they can assign enough information such as textual and image annotations to each landmark. Nonetheless, automatically tailoring the aesthetic layout of both textual and image labels together is still a challenging task, due to the complicated shape of the labeling space available around the metro network. In this paper, we present a zone-based approach for placing such annotation labels so that we can fully enhance the aesthetic criteria of the label arrangement. Our algorithm begins by decomposing the map domain into three different zones where we can limit the position of each label according to its type. The optimal positions of labels of each type are evaluated by referring to the zone segmentation over the map. Finally, a new genetic-based approach is introduced to compute the optimal layout of such annotation labels, where the order in which the labels are embedded into the map is improved through the evolutional computation algorithm. We also equipped a semantic zoom functionality, so that we can freely change the position and scale of the metro map.
1
Introduction
Annotating landmarks with texts and images is a popular technique for guiding specific travel routes and places of interest, especially in commercially available guide maps. In such map annotation, both internal and external labeling techniques play an important role. Internal labels are placed close enough to the reference points called sites, and usually used to annotate landmarks with small labels containing textual information. On the other hand, external labels are often used to assign large annotation labels, such as reference thumbnail photographs, and usually spaced sufficiently apart from the corresponding sites, L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 91–102, 2011. c Springer-Verlag Berlin Heidelberg 2011
92
H.-Y. Wu et al.
(a)
(b)
Fig. 1. Examples of annotation label layouts in Taipei MRT maps: (a) A conventional layout where image labels are placed closer to the corresponding sites. (b) The proposed layout into which commonly used design in hand-drawn maps is incorporated.
whereas line segments called leaders are introduced to connect the sites and labels to clarify their correspondence. Internal labeling instantly allows us to find correspondence between the sites and annotation labels while it usually suffers from the space limitations especially when we have to fit many labels into a small space around the map content. External labeling overcomes this problem by seeking more labeling space away from the map content, but at the cost of applying leaders that often disturb the visual quality of the annotated map. In hand-drawn metro map illustrations, the mixture of internal and external labels is effectively employed to apply textual and image annotation labels. However, the arrangements of such textual and image labels are quite different in that the textual labels are placed in the vicinity of the corresponding site, while the image labels are more likely to be aligned around the corner of the map domain or along its boundaries, so that we can fully enhance the aesthetic arrangement of such labels on the entire map as shown in Fig. 1(b). Indeed, conventional approaches attach image labels in the same way as the textual ones, resulting in a map that cannot maintain a visually plausible arrangement of image labels, as seen in Fig. 1(a). Nonetheless, formulating this kind of aesthetic map design as a computational algorithm is still a challenging task in the sense that we have to solve a rectangle packing problem, which usually leads to a well-known combinatorial NP-hard problem. Furthermore, the problem becomes more complicated especially for annotating metro maps. This because the labeling space around the metro network often consists of multiple non-rectangular regions including small ones, where we cannot directly apply the conventional boundary labeling techniques neither. In this paper, we present a new approach for placing textual and image annotation labels on the metro maps while maintaining the above aesthetic arrangement of such labels in the map domain. This is accomplished by segmenting the entire map into content, internal, and external zones, so that we can arrange annotation labels according to their label types. We compute such zone-based segmentation by applying conventional image processing techniques to dilate the network of metro lines on the map image. This segmentation also allows us to
A Zone-Based Approach for Placing Annotation Labels on Metro Maps
93
introduce the potential field over the outermost external zone, where we can aesthetically align the image annotation labels along the map boundaries by referring to the potential values. Note that in our setup, we take a metro network as input in which each vertex of the underlying graph represents a station together with its geographical position and each edge corresponds to the metro route between the corresponding pair of adjacent stations. We also assume that textual and image annotation labels correspond to each station of the metro network, and retain the name of the station and a thumbnail photograph of the view around that station, respectively, as shown in Fig. 1. We formulate the optimal placement of textual and image annotation labels as a combinatorial optimization and search problem, employing genetic algorithms (GA) as a heuristic optimization method to simulate the process of natural evolution. In the process of solving the problem, each chromosome is defined as a value-encoding sequence of label IDs, where each label is embedded into vacant labeling space in a greedy fashion one by one in order. In our approach, in order to limit the number of possible positions of each label over the map, the labeling space around the metro network is decomposed into a set of square grids in each zone, which effectively reduces the computational complexity for the label placement. The final label layout is obtained by optimizing the function that penalizes the number of missing labels and sum of the normalized leader lengths. The remainder of this paper is structured as follows: Sect. 2 provides a survey on conventional techniques for label placement and metro map visualization. Sect. 3 describes how we can segment the entire map domain into several zones using image processing techniques. Sect. 4 provides our new genetic-based approach to fitting the textual and image annotation labels within the labeling space aesthetically around the the metro map content. After having presented several experimental results in Sect. 5, we conclude this paper in Sect. 6.
2
Related Work
Annotating point features has been one of the fundamental techniques especially in the area of cartography. Christensen et al. [5] conducted an empirical study on several heuristic algorithms for this purpose. Applying internal labels to point features has been intensively investigated so far, and several schemes for interactive design of label placement [8], real-time label arrangement [16], and consistent label layout in dynamic environments [1] have been developed. On the other hand, the concept of boundary labeling has been introduced by Bekos et al. [3] as one of the external labeling techniques. They provided mathematical formulations together with efficient algorithms for arranging labels on the boundary margins around the rectangular content area. Lin et al. also performed theoretical studies on several different types of correspondences between the sites and boundary labels including one-to-one correspondence [14], multiple-to-one correspondence [13], and its improved version with hyper-leaders and dummy labels [12].
94
H.-Y. Wu et al.
Employing both internal and external labels provides an effective means of annotating map features while it has still been a challenging research theme. Several functional and aesthetic criteria have been proposed for this purpose by formulating the label layouts in hand-drawn illustrations [10], which were followed by a real-time algorithm for annotating 3D objects [9]. However, these schemes used the external labels as the replacements of internal ones when the labels cannot be placed close enough to the corresponding sites, and always tried to embed both types of labels as close as possible to the sites. (See Fig. 1(a).) Bekos et al. [2] also presented the combined use of internal and external labels while the external labels were expected to stay on the predefined boundary margins only. Our proposed approach differs from the conventional ones in that it also takes advantage of a set of small empty space around the given metro network as the labeling space, while keeping the aforementioned aesthetic layout of textual and image annotation labels on the entire map domain. Combinations of internal and external labels have also been employed to annotate various targets including column charts [17], 3D virtual landscapes [15], surfaces of 3D bumpy objects [6], and 3D illustrations [7]. Metro map visualization itself is an interesting theme that has been intensively researched recently [21]. In this category, the metro map was aesthetically deformed first and the stations were then annotated by textual labels. Hong et al. [11] presented an approach to visualizing metro maps based on the spring models, B¨ottger et al. [4] developed a scheme for distorting metro maps for annotation purposes, Stott et al. [19] formulated multiple criteria for optimizing metro map layout and placing textual labels, and N¨ ollenburg and Wolff [18] employed mixed-integer programming to draw and label metro maps in a visually plausible manner. In our approach, we respect the original geometry of the given metro network and focus on the optimal placement of annotation labels only.
3
Zone-Based Segmentation of the Map Domain
Our approach begins by partitioning the metro map domain into three zones: the content, internal, and external zones. The content zone tightly encloses the metro network and we do not place any labels in that zone. The internal zone is next to the content zone and we can place textural labels only there. The external zone is the complement of the previous two zones and we can embed any types of labels there. Note that in this zone-base map segmentation, we can lay out textual labels on both the internal and external zones while image labels can be placed within the external zone only, as shown in Fig. 3. 3.1
Dilating Metro Lines on the Map Image
For defining the three different zones over the metro map, we first generate the metro network image by drawing the given metro lines, then apply the morphological dilation operations to the image to synthesize dilated metro network images. Note that, in our implementation, we employed the ordinary 3 × 3 rectangular kernel mask, which is equipped with the OpenCV library by default.
A Zone-Based Approach for Placing Annotation Labels on Metro Maps
(a)
(b)
(c)
(d)
(e)
(f)
95
Fig. 2. Partitioning a metro map domain into three zones. (a) Original metro network rendered from the input graph data. (b) Binary metro network image. (c) Content zone obtained by a small number of dilation operations. (d) Internal zone obtained by a medium number of dilation operations. (e) Tentative zone obtained by a large number of dilation operations. (f) Potential field obtained by applying Gaussian filtering to (e). Black and white colors correspond to high and low potential regions, respectively.
Although Cipriano and Gleicher [6] applied both the dilation and erosion operations to compute the scaffold surface over the input 3D bumpy objects, our approach just uses the dilation operations since our aim here is to design the zone-based segmentation of the labeling space around the metro network. Fig. 2 shows how an original metro network image is dilated for obtaining the aforementioned three zones of the map domain. Fig. 2(a) represents the original metro network image rendered by taking as input the graph of metro lines. The first task here is to convert this color image into a black-and-white binary image as shown in Fig. 2(b). We are now ready to define the content zone by applying dilation operations nc times to this binary image and extract the black region from the resulting dilated image in Fig. 2(c). The internal zone can be defined in the same way as the black region in Fig. 2(d) by applying the dilation operations ni times, where ni > nc . Finally, the complement of the content and internal zones is defined as the external zone, which corresponds to the white region in Fig. 2(d). Note that, in our implementation, we set nc = 4 and ni = 32 by default in the above image dilation stages. 3.2
Defining Potential Fields for Image Annotation Labels
We also define a potential field specifically for placing image annotation labels so that we can align them along the boundary of the map domain in a greedy fashion for later use. (See Sect. 4.) The potential field that we are going to formulate here is similar to the distance field that G¨otzelmann et al. [9] used for annotating 3D objects, while it is different in that our potential field is the
96
H.-Y. Wu et al.
(a)
(b)
Fig. 3. (a) Zone-based segmentation on Taipei MRT map. White, orange, and gray colors (with gradation) indicate the content, internal, and external zones, respectively. In the external zone, black and white colors indicate high and low potential regions, respectively. (b) Taipei MRT map with textual and image annotation labels.
reversed version of the distance field in order to align image labels along the map boundary rather than in the neighborhood of the corresponding site. We also keep potential values rather uniform in the region away from the boundary to make image labels freely move around in that region to avoid undesirable conflicts with other labels. The potential field has been defined in our approach again by applying dilation operations ne times, where ne > ni , to the binary image in Fig. 2(b), so as to obtain the sufficiently dilated metro network image as shown in Fig. 2(e). This dilated image is then blurred with the Gaussian filter to obtain the potential field as shown in Fig. 2(f), where the black and white colors indicate high and low potential regions, respectively. In our implementation, ne = 64 by default. 3.3
Discretizing the Map Domain into Grid Square Cells
Basically, we can find good positions for each label by referring to the zonebased map partition and potential field that we have obtained. However, allowing the annotation labels to move over the map domain pixel by pixel leads to the excessive degree of freedom in their position. In our approach we discretize the map domain into a set of grid square cells in order to effectively limit the number of available positions for each label, which allows us to reduce the search space for optimized label placements. Note that in our implementation, we fixed the side length of the grid square to be 16 pixels. This discrete representation of the map domain can be easily obtained by dividing the side lengths of the image by the grid square size. We then refer to each pixel value of the resized image for retrieving the zone-based map segmentation and potential values. In our implementation, we use the OpenCV library again to perform the image resizing operations. Fig. 3(a) represents an example of the resulting set of grid square cells together with the zone-based segmentation and potential field described previously.
A Zone-Based Approach for Placing Annotation Labels on Metro Maps
97
Fig. 4. A chromosome defined as a value-encoding sequence of label IDs. The annotation labels will be fit into the vacant space around the metro network one by one.
4
Genetic-Based Optimization of Label Placement
In this section, we present a new approach for automatically laying out textual and image annotation labels in the space of arbitrary shape around the metro network. Our idea here is to introduce a genetic-based approach that allows us to fit annotation labels effectively into the limited labeling space. 4.1
Encoding the Order of Label IDs
In our genetic-based formulation, each chromosome is defined as a value-encoding sequence of label IDs as shown in Fig. 4, where the annotation labels will be fit into vacant labeling regions in a greedy fashion one by one in the order the label IDs appear in the sequence. We first initialize the chromosome pool by a set of randomly ordered sequences of label IDs, then improve the quality of the chromosome pool by discarding bad chromosomes and reproducing better children from the selected fine parent chromosomes using crossover and mutation operations. Note here that these crossover and mutation operations are carried out while maintaining the condition that each label ID appears only once in each chromosome sequence. Of course, we only consider sites that are contained in the current map domain when we zoom in/out the map content. 4.2
Greedy Placement of Textual and Image Annotation Labels
In our GA-based formulation, the position of each label is uniquely determined by the order the corresponding label IDs in the chromosome. For this purpose, we have to seek the best position of each label while avoiding possible overlaps and crossings with other labels that have already been fixed so far. For placing annotation labels, we try to fit them into a vacant region so that it becomes the closest to the corresponding site. For example, suppose that we have already placed the first three labels of the sequence Fig. 4 step by step, and try to find the optimal position of Label #1 next as shown in Fig. 5. We locate the best position for Label #1 that maximizes the closeness to the corresponding site on the condition that the label can avoid any overlaps and crossings with the existing labels in a greedy fashion.
98
H.-Y. Wu et al.
Fig. 5. Greedy search for the best positions of labels in the order of closeness to the sites
(a)
(b)
Fig. 6. Spatial distribution of the fitness values for placing (a) textual and (b) image labels around the transfer station in circles. White color indicates the better positions.
This way, each textual label can be matched with its corresponding site because they are close enough to each other in general. However, if their distance exceeds the predefined threshold, we need to draw a leader between them to fully clarify the correspondence between the station and its name in the metro map visualization. Such a case can be found in Fig. 3(b). For the image annotation labels, on the other hand, we use a different strategy to find the optimal positions. As described earlier, we already synthesized the potential field in the external zone, which allows us to align the image labels around the corner of the map domain or along its boundaries. For finding the best position for an image label, we compute the sum of the potential values on the square cells that will be covered by the label at each possible position, and then employ the position having the optimal sum while avoiding conflicts with other existing labels on the map. In our framework, we place textual and image annotation labels individually using two different chromosome pools. Actually, we place textual labels first and then image labels, where we try to avoid overlap between the textual and image labels while allowing crossings between the textural labels and leaders connected to the image labels. This is because we can considerably alleviate the visual flickers due to such crossings, by assigning different colors to the texts and leaders of the image annotation labels, as shown in Fig. 3(b). Moreover, we also design the arrangement of annotation labels so that all the labels are free of conflicts with the metro network itself. In our implementation, we first compute the list of best possible positions for each textual or image
A Zone-Based Approach for Placing Annotation Labels on Metro Maps
99
label in the preprocessing stage, and then explore the optimal position of the annotation label by visiting the list of possible positions from the head to the tail. (See Fig. 5 also.) Note that we limit the size of the list for each label up to 500 positions in our implementation, and give up placing the label if we cannot find any conflict-free positions in the list. Figs. 6(a) and (b) show the spatial distribution of the position fitness for textual and image annotation labels, respectively, on the grid square cells around the crossing station node in circles. Here, the white color corresponds to the better positions of the labels. We search for the best conflict-free positions of annotation labels in a greedy fashion according to their types, in the order it appears on the corresponding chromosome sequence. As for the leader of each label, we sample the points on the label boundary and employ the one that minimizes the leader length as the joint between the leader and label. 4.3
Definition of an Objective Function
For evaluating the goodness of each chromosome, we use the same objective function both for textual and image annotation labels. In our framework, we count the number of the annotation labels that are blocked out from the labeling space by other existing labels, then increase the penalty score accordingly in order to penalize missing labels. We also compute the total sum of the normalized leader lengths so that we can enhance the visual quality of the label layout by minimizing the associated distances between the labels and sites. The actual definition of our objective function can be written as λ1 × fpenalty + λ2 × fleader , where fpenalty and fleader represent the penalties of the missing labels and the sum of the normalized leader lengths, respectively, and λ1 and λ2 indicate the corresponding weight values. We set λ1 = 0.5 and λ2 = 0.5 by default in our implementation. Furthermore, we also assign a priority value to each label in order to make important labels more likely to stay in the metro map. This can be accomplished by just multiplying the penalty scores of each label by the corresponding priority value in the above definition. With this strategy, we can retain the important annotation labels within the labeling space on the finalized metro map.
5
Results
Our prototype system has been implemented on a laptop PC with an Intel Core i7 CPU (2.67GHz, 4MB cache) and 8GB RAM, and the source code is written in C++ using the OpenGL library, OpenCV library and GLUT toolkit. The implementation of the genetic-based computation in our system has been based on the GAlib package [20]. Note that all the resulting map images were synthesized at the resolution of 1024 × 768 in this paper.
100
H.-Y. Wu et al.
(a)
(b)
Fig. 7. (a) Uniform and (b) adaptive label size adjustment for Taipei MRT maps. Important stations are emphasized by enlarging the text and image annotation labels.
(a)
(b)
Fig. 8. Influence of different weight values λ1 and λ2 on the label layouts over Tokyo subway maps. (a) λ1 = 0.8 and λ2 = 0.2. (b) λ1 = 0.2 and λ2 = 0.8. The number of labels is maximized in (a) while the total sum of leader lengths is minimized in (b).
Fig. 3(a) shows the zone-based segmentation of Taipei MRT map calculated in our system and Fig. 3(b) presents the finalized layout of textual and image annotation labels on the map. The synthesized map clearly shows that we can aesthetically align image annotation labels around the corner or along the boundaries of the map domain, while textual labels are placed close enough to the corresponding stations. Note here that we assign higher priority to transfer stations and terminals by default in our system, thus annotation labels are more likely to be applied to these stations. We can also emphasize such important stations explicitly by enlarging the corresponding annotation labels as demonstrated in Fig. 7(b), which successfully draw more attention to some specific stations compared to Fig. 7(a). Moreover, the label layout can be controlled by tweaking the weight values in the objective function for our genetic-based optimization. Fig. 8 exposes such an example where we can increase the number of embeddable labels (Fig. 8(a)) or minimize the total leader lengths (Fig. 8(b)). Fig. 9 demonstrates that our system also provides a semantic zoom interface for
A Zone-Based Approach for Placing Annotation Labels on Metro Maps
(a)
101
(b)
Fig. 9. Semantic zoom visualization. Tokyo subway maps at (a) the original scale and (b) finer scale.
Tokyo subway maps. This allows us to inspect an interesting region in more detail without missing the global context by freely changing the position and scale of the map. The computation cost for the label placement depends on the number of stations in the window and its distribution, while our interface basically provides interactive responses.
6
Conclusion
This paper has presented an approach for automatically designing the aesthetic layout of textual and image annotation labels to embed supplemental information into the metro map. The metro map domain is first partitioned into three zones so that we can systematically lay out the textual and image labels by referring to the aesthetic criteria induced from hand-drawn guide maps. Our encoding of label placement was implemented using genetic algorithms to find optimized layout of both types of labels in an interactive environment. Possible future extension includes persistent placement of important labels across multiple scales for visually plausible map visualization. Optimizing the layout of both the annotation labels and metro networks will be also an interesting research theme. A more sophisticated interface for exploring the annotated metro map content with a variety of labeling box and leader styles remains to be implemented. Acknowledgements. This work has been partially supported by JSPS under Grants-in-Aid for Scientific Research (B) (20300033 and 21300033), NSC 98-2218-E-009-026-MY3, Taiwan, and NSC 97-2221-E-002-094-MY3, Taiwan.
References 1. Been, K., N¨ ollenburg, M., Poon, S.H., Wolff, A.: Optimizing active ranges for consistent dynamic map labeling. Computational Geometry: Theory and Applications 43(3), 312–328 (2010)
102
H.-Y. Wu et al.
2. Bekos, M.A., Kaufmann, M., Papadopoulos, D., Symvonis, A.: Combining traˇ ditional map labeling with boundary labeling. In: Cern´ a, I., Gyim´ othy, T., Hromkoviˇc, J., Jefferey, K., Kr´ alovi´c, R., Vukoli´c, M., Wolf, S. (eds.) SOFSEM 2011. LNCS, vol. 6543, pp. 111–122. Springer, Heidelberg (2011) 3. Bekos, M.A., Kaufmann, M., Symvonis, A., Wolff, A.: Boundary labeling: Models and efficient algorithms for rectangular maps. Computational Geometry: Theory and Applications 36, 215–236 (2007) 4. B¨ ottger, J., Brandes, U., Deussen, O., Ziezold, H.: Map warping for the annotation of metro maps. IEEE Comptuer Graphics and Applications 28(5), 56–65 (2008) 5. Christensen, J., Marks, J., Shieber, S.: An empirical study of algorithms for pointfeature label placement. ACM Trans. Graphics 14(3), 203–232 (1995) 6. Cipriano, G., Gleicher, M.: Text scaffolds for effective surface labeling. IEEE Trans. Visualization and Computer Graphics 14(6), 1675–1682 (2008) ˇ 7. Cmol´ ık, L., Bittner, J.: Layout-aware optimization for interactive labeling of 3D models. Computers and Graphics 34, 378–387 (2010) 8. do Nascimento, H.A.D., Eades, P.: User hints for map labeling. Journal of Visual Languages and Computing 19, 39–74 (2008) 9. G¨ otzelmann, T., Hartmann, K., Strothotte, T.: Agent-based annotation of interactive 3D visualizations. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P. (eds.) SG 2006. LNCS, vol. 4073, pp. 24–35. Springer, Heidelberg (2006) 10. Hartmann, K., G¨ otzelmann, T., Ali, K., Strothotte, T.: Metrics for functional and aesthetic label layouts. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P. (eds.) SG 2005. LNCS, vol. 3638, pp. 115–126. Springer, Heidelberg (2005) 11. Hong, S.H., Merrick, D., do Nascimento, H.A.D.: Automatic visualisation of metro maps. Journal of Visual Language and Computing 17, 203–224 (2006) 12. Lin, C.C.: Crossing-free many-to-one boundary labeling with hyperleaders. In: Proc. IEEE Pacific Visualization Symposium 2010 (PacificVis 2010), pp. 185–192 (2010) 13. Lin, C.C., Kao, H.J., Yen, H.C.: Many-to-one boundary labeling. Journal of Graph Algorithm and Appplications 12(3), 319–356 (2008) 14. Lin, C.C., Wu, H.Y., Yen, H.C.: Boundary labeling in text annotation. In: Proc. 13th International Conference on Information Visualization (IV 2009), pp. 110–115 (2009) 15. Maass, S., D¨ ollner, J.: Efficient view management for dynamic annotation placement in virtual landscapes. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P. (eds.) SG 2006. LNCS, vol. 4073, pp. 1–12. Springer, Heidelberg (2006) 16. Mote, K.: Fast point-feature label placement for dynamic visualizations. Information Visualization 6, 249–260 (2007) 17. M¨ uller, S., Sch¨ odl, A.: A smart algorithm for column chart labeling. In: Butz, A., Fisher, B., Kr¨ uger, A., Olivier, P. (eds.) SG 2005. LNCS, vol. 3638, pp. 127–137. Springer, Heidelberg (2005) 18. N¨ ollenburg, M., Wolff, A.: Drawing and labeling high-quality metro maps by mixedinteger programming. IEEE Transactions on Visualization and Computer Graphics 17(5), 626–641 (2011) 19. Stott, J., Rodgers, P., Mart´ınez-Ovando, J.C., Walker, S.G.: Automatic metro map layout using multicriteria optimization. IEEE Transactions on Visualization and Computer Graphics 17(1), 101–114 (2011) 20. Wall, M.: GAlib: A C++ library of genetic algorithm components, http://lancet.mit.edu/ga/ 21. Wolff, A.: Drawing subway maps: A survey. Informatik - Forschung und Entwicklung 22, 23–44 (2007)
Using Mobile Projection to Support Guitar Learning Markus Löchtefeld, Sven Gehring, Ralf Jung, and Antonio Krüger German Research Center for Artificial Intelligence (DFKI) Stuhlsatzenhausweg 3, 66123 Saarbrücken {markus.loechtefeld,sven.gehring,ralf.jung,kruger}@dfki.de
Abstract. The guitar is one of the most widespread instruments amongst autodidacts, but even though a huge amount of learning material exists, it is still hard to learn especially without a guitar teacher. In this paper we propose an Augmented Reality concept that assists guitar students mastering their instrument using a mobile projector. With the projector mounted onto the headstock of the guitar, it is possible to project instructions directly onto the strings of the guitar. With that the user is easily able to realize where the fingers have to be placed on the fretboard (fingering) to play a certain chord or a tone sequence correctly. Keywords: Guitar, Mobile Projection, Learning Interfaces, Projector Phone.
to the fretboard of the guitar frequently. Guitar teachers are facing the same problem but they are able to react and give instructions that help overcoming the individual problems of the student. Guitar focused online communities such as WholeNote or UltimateGuitar, allow remotely located users to exchange feedback and instructions. This can be from increased value especially for novices. But often the feedback in such communities can be rude which could lead to even more frustration such that the complete opposite of the desired effect is reached.
Fig. 1. Guitar out of the bequest of Franz Schubert with notes carved into the fretboard. (Today in possession of the Haus der Musik, Vienna).
With the increasing miniaturization of projection units, the integration of such units into mobile phones has become possible. These so-called projector phones have the ability to project large-scale information onto any surface. Projector phones can enhance the design space for Augmented Reality (AR) applications tremendously. In this paper we propose an AR system for projector phones that can overcome the problems that autodidactic guitar students have to face. With a projector phone mounted onto the headstock of a guitar, it is possible to project instructions directly onto the appropriate position of the fretboard. The projected information includes fingering and phrasing instructions for chords and melody sequences. The remainder of the paper is structured as follows: At first we give an overview on existing work in the area of music learning interfaces and mobile projection interfaces. After that, we describe our concept to ease guitar learning and lay out the initial user feedback. Finally, we conclude and discuss future work.
2 Related Work 2.1 Music Learning Interfaces Many HCI approaches for learning a musical instrument exist. Especially for the piano a wide variety of commercially available products as well as research projects exist. Piano learning interfaces range from keyboards with keys that can light up to
Using Mobile Projection to Support Guitar Learning
105
indicate what should be played (for example manufactured by Yamaha as well as Casio [3, 20]) to the Moog PianoBar, which is an LED bar that can be attached to any standard piano [11]. Yamaha’s Disklavier takes this one step further and actuates the keys of the piano that needs to be played as well [2]. The possibility to actuate the keys was picked up by the MusicPath project, which allows piano teaching from remote locations through the connection of two Disklaviers [13]. Both, the teacher and the student can see what the other is playing through the actuated keys and with this it also allows to communicate the strength that is used in the keystroke. All the above-mentioned interfaces have one drawback and that is the lack of information about the hand gestures. It is not obvious which finger is used to play which note. This problem was addressed by Xiao and Ishii with MirrorFugue [19], which allows visualizing hand gestures of a remote collaborative piano player. Prior HCI approaches to alleviate the learning of the guitar mainly focused on using AR displays overlaying a camera image with the instructions on how to play a distinct chord or which notes to play next [4, 9]. These approaches, which are based on optical markers, have the same disadvantages as video lessons: the student sees the instructions in an inverted view and has to switch his view permanently between the display and the guitar. Besides this, the student has to manage to keep the optical marker, which is attached to the guitar, inside the video image. This retrenches the student further since he is not free to move the guitar. Even though these markers can be replaced by a markerless tracking - since guitar-necks normally provide a rich amount of features that could be tracked - still the area in which the student can move the guitar around would be limited. The approach presented in this paper lets the students move their guitar freely and the instructions that are given are presented directly on the fretboard of the guitar in such a way that the student`s focus of attention can stay on the guitar the whole time. The usage of stereo cameras to track the fingers of students was presented by Kerdvibulvech [7]. Burns et al. created a system that tracks the fingering with just a normal Webcam using a circular Hough transformation [1]. Both approaches where able to determine the position and check if a chord was played right but not able to give any instructions. These techniques could be integrated into our concept as well. With emerging mobile phones that are able to capture 3D content such as the LG Optimus 3D [8] it would be possible to provide feedback about fingering as it was done by Kerdibulvech and to get a more personalized guitar teaching application. There are commercially available guitars that are especially made for guitar novices as well, such as the Yamaha EZ EG [21], which is a MIDI guitar without real strings, where a button, which can be lit up, replaces every note on the guitar neck. By using colour patterns, the students can learn chords and songs. However, the Yamaha EZ EG has several disadvantages. First of all it is a special guitar that does not provide the flexibility and the feeling of a real guitar. Once the student has learned to master this instrument he has to start again getting used to real strings. Secondly, the guitar is expensive compared to cheaper normal beginner instruments. Besides the Yamaha EZ EG, there is Fretlight [5], a fretboard with an integrated LED for each note that can be controlled via a computer. Fretlight has several disadvantages, it is not applicable on a standard guitar, the guitar needs to be connected to computer and the content for Fretlight is not freely available. In contrast to that, the concept presented in this paper on the contrary can be used with every guitar without changing
106
M. Löchtefeld, et al.
or damaging the instrument even the student’s father’s 1959 Gibson Les Paul. Utilizing such an approach the concept of this paper is similar to the Moog PianoBar that can be attached to every normal piano as well [11]. With emerging projector phones in sight it would be a cheap alternative as well since the only additionally equipment one would need is the headstock mounting for the projector phone. Besides the correct playing of notes learning a musical instrument requires also to gain continuous expressivity on the instrument. Johnson et al. used other output modalities to ease the learning of a music instrument [6]. Their prototype – MusicJacket – was able to give vibro-tactile feedback to the arms to indicate to a novice player how to correctly hold the violin and how to bow in a straight manner. And with that they are able to enhance the expressivity of the student on the instrument. A similar approach for fine-tuning of the body expression was presented by Ng with i-maestro [14]. By using a motion capturing system they created a 3D augmented mirror that gives interactive multimodal feedback on the playing and body pose of the student. A drawback of these two approaches is the need for a huge amount of instrumentalization of the player. Furthermore, an adaption of these approaches for guitar players would hardly be feasible since sensors and actuators would need to be attached to the student’s fingers and this most certainly would negatively influence the students playing. 2.2 Mobile Projection Interfaces Since battery-powered mobile projection units become available, more and more research focuses on how these projector units can be utilized in AR application scenarios. Basic research on the augmentation of objects using a hand-held projector was conducted by Raskar et al. with the RFIG-lamps [15]. While they used active RFID-tags, Schöning et al. relayed on computer vision approaches to augment a paper map using a hand-held projector phone prototype [17]. Mistry et al. showed how a body worn camera-projector-unit could be utilized in different every-day life situations with their SixthSence application [10]. For example they augmented products in a supermarket with additional projected information. Projected instructions in general were shown by Rosenthal et al. [16] to be very useful for many tasks. They conducted a study in which, they focused on everyday activities like folding paper or modelling a specific sculpture with Play Doh. They found evidence that micro projected instructions can improve speed and reduce errors for a variety of task-components present in manual tasks. In this paper we expand this approach to project instructions on how to play the guitar.
3 Concept Our approach to ease guitar learning is to project information on fingering of chords or songs directly onto the fretboard of the guitar using a mobile projector. With the advances in miniaturization of projection technology, mobile projectors can be integrated in a wide variety of devices at a low price point. We propose two different possible AR approaches. The first approach utilizes a projector phone mounted at the
Using Mobile Projection to Support Guitar Learning
107
Fig. 2. Concept of using mobile projection for guitar learning - using a projector phone mounted on the headstock (right) or a tablet computer with integrated projector (left)
headstock of the guitar (see Fig. 2 right). The second one is based on a tablet computer with integrated projector (see Fig. 2 left). The first approach would allow the student to move freely around and would only need the mount with projector phones becoming ubiquitous. But the mount could also constrain the approach to only be able to use a part of the fretboard. Since the angle between the fretboard and the projector is really low and with that some frets are not in the field of projection. This would not be the case when using a tablet with integrated projector since the instructions would be projected from the front. However to enable such a projection first of all a technique that is able to track the guitar neck with high precision has to be established for the tablet. This again would limit the radius of movement (which is what existing approaches suffer from [4, 9]) to make sure that the guitar’s neck is in the field of projection as well as in the camera image if a computer vision based approach is used. An advantage of the tablet approach would be the ability to show the sheet of music on the screen as well as further instructions or notes. 3.1 Visualization Regardless if either a tablet or phone is used, the most important factor for such a system is the visualization. The guitar allows more versatile phrasing of a note than a piano, especially complex playing techniques like string bendings or Hammer-On`s that are often used need to be distinctively but easily recognizable visualized. This is why the visualization should not be limited to the projection of finger positions only but rather exploit all playing techniques to enrich the expressiveness of the student on the instrument.
108
M. Löchtefeld, et al.
Fig. 3. Common visualizations: A C-Major chord diagram (left) and an ASCII-Tab showing a part of a song (right). While in the chord diagram the numbers indicate which finger should be used, the numbers in the ASCII-Tab only refer to the fret a finger should be placed. To indicate which finger the student should use to fret a certain note often numbers are used in today’s chord diagrams (compare Figure 2 left). The correct fingering is essential for fast and clean guitar playing and therefore a factor that should be taken into account when designing a guitar learning application. Using the wrong finger for a certain chord can lead to slower playing or even worse, when a chord variation should be played it may not be possible since the finger that should play the variation is blocked. The use of numbers is difficult to realize in the projection since the space on the fretboard is limited and it may be cumbersome to read. Thus, we choose to use different colours to indicate what finger the student should use, which can be combined with more complex visualizations e.g. for string bendings. Generally detailed symbols or characters are hardly recognizable on the fretboard. Therefore we propose to use only basic shapes like circles or squares that are distinguishable when being projected on the fretboard. The following visualizations (Fig. 4) have been developed in collaboration with a guitar teacher and a more advanced guitar student: When a single note has to be played, a coloured dot is projected onto the fret. If more than one note has to be played at the same time different colours indicate which finger the student should use to play which note. Such information is indeed contained in chord diagrams but normally not in guitar tabs for a whole song. For a blank string a white dot is projected on the nut of the guitar. Since musicians typically are more concerned about which notes or chords they have to play next than what they are playing at the moment upcoming notes have to be visualized as well. We choose to fade out the colour to visualize this, so that the next notes are the brightest and the following are fading out slowly (compare Fig. 4 [a]). The student can adjust how many notes are shown and how far from the current beat they should be to adapt the technique to his learning performance.
Using Mobile Projection to Support Guitar Learning
109
Fig. 4. Visualizations for different playing techniques To visualize that a string has to be bend, a triangle is projected on the fret and the size indicates the pitch to which the string has to be bended. The direction of the head of the triangle indicates if an up- or down-bending should be performed (Fig. 4 [b]). A slide from one note to another - which is performed on only one string - is indicated with an arrow on the fretboard. The origin of the arrow indicates the note on which the sliding starts and the arrowhead indicates the destination note to which the student has to slide to with his finger (Fig. 4 [c]). To indicate a Hammer-On or Pull-Of, a dotted arrow is projected again with the arrowhead indicating the destination note (Fig. 4 [d]). The dotted arrow reflects the movement the player makes compared to the complete arrow of a slide. When performing a Hammer-On or Pull-Of the notes between the start and the end are skipped which is reflected by the gaps in the arrow. A finger Tremolo on a specific note is visualized through a curled line (Fig. 4 [e]). Whereat the curliness of the line can indicate how articulated the tremolo needs to be played. The advantage of the described visualizations is their unambiguousness. They are well distinguishable even with a distorted projection. Their form originates from the movements of the fingers of the guitar player and the visualizations that are used in today's guitar tabs. With that they are easy to learn and to recognize even for people who a not familiar with the system. The proposed visualization method for single notes is similar to the Yamaha EZ-EG or Fretlight but these systems are not able to
110
M. Löchtefeld, et al.
visualize more complex playing techniques and furthermore they give no information about which finger the student should use, which is important especially for novices. 3.2 Input Techniques One problem of video lessons is that when a student wants to figure out a certain part, he has to repeat this part in the video over and over again. Therefore he has to take his hands from the guitar away to control the video. The Yamaha EZ-EG and Fretlight suffer from the same problem. Therefore we propose to integrate different input techniques to control the projection without the need to remove the hands from the guitar. Three different modalities would be feasible - speech, gesture and sound input. Speech input is a reliable technology in today’s mobile phone but it always requires pressing one button to trigger the recording. To control the projection easy and short commands like “rewind” or “play” would be enough but a continuous recording and processing of audio data would be needed. With applications like AmpliTube, which allows using the iPhone as an Amplifier for an electric guitar, another possible input technology would be to detect if the student plays a predefined sequence of notes. But this again would require continuous recording and processing. The third feasible technique would be gesture recognition. Most phone today contain sensors like accelerometers or gyroscopes, and with them gestural detection is easily possible. When a projector phone is attached to the headstock of the guitar, swings with the guitar neck could be interpreted as gestures. This approach seems to be the most promising since the computational overhead is comparably small and movements normally are not that fast so that an acceleration threshold can be used to distinguish the normal small movements from an intentional gesture.
4 Prototype We created a prototype of the described concept using an AAXA L1.v2 laser projector that is mounted on the headstock of an Epiphone SG guitar (see Fig. 5). The mount consists of a Joby Gorillapod that was fixed to the neck using cable straps. On top of the Gorillapod the laser projector was mounted. With this mount, all guitar tuners are accessible and normally functionally while at the same time the projector is easily adjustable. The price for the mount - which is the only additional part that would be needed with projector phones becoming ubiquitous - was under 20€ € and would even be much cheaper in mass production. The projector weighs 122gramms and including the mount the whole prototype weight 210gramms. Mounted to the headstock there is no adverse effect on the playability of the guitar even though it adds a little bit more weight. To control the projection we implemented a Qt application running on an Apple MacBook. The application is capable of projecting 25 different chords and also able to read tabs for complete songs in ASCII format (see Fig. 3 right) and project the notes onto the fretboard. When projecting a complete song the tempo in which the notes are shown can be adjusted individually to the learning speed of the student. Unfortunately ASCII tabs contain no information about the correct fingering,
Using Mobile Projection to Support Guitar Learning
111
Fig. 5. The prototype consisting of a Gorillapod and an AAXA L1.v2 Laser Projector. On the back of the headstock (not visible in the image) a Phidget Accelerometer was attached.
Fig. 6. The prototype projecting a chord, C-major (left) and G-major (right). Beneath are the corresponding standard chord diagrams. therefore we will user another standard in the next iteration. We choose Qt for our implementation since it allows porting the application mobile devices running Symbian or Maemo/MeeGo. But up to now, no Qt-enabled device features the possibility to show different content on the devices screen and on the TV-Out. When the application gets ported to a mobile device, the screen of the projector phone will show additional instructions. The alignment of the projected image to the fretboard was done manually. In future implementations we would aim for automatic vision based recognition of the fretboard and automatic alignment. And also there is no correction to the distortion of the projection, which means that in the higher end of the fretboard the projected symbols are more distorted than in the lower ones, which are closer to the projector.
112
M. Löchtefeld, et al.
But with the chosen visualization techniques the different symbols are still easily distinguishable when projected on the fretboard. For the gestural input we attached a Phidget accelerometer to the headstock of the guitar. For the recognition of the gestures we used the $1 Unistroke Recognizer by Wobbrock et al. [18]. The recording of the gesture starts when the acceleration of the headstock reaches a certain threshold. From the three dimensional data that the accelerometer provides only two dimensions are used since movement along the axis of the guitar neck is not feasible when playing the guitar seated. The two remaining axis are mapped to the $1 Unistroke Recognizer x- and y-axis.
5 Initial User Feedback The visualizations for different notes and playing techniques were demonstrated to two advanced guitar players and one guitar novice. They rated the visualization as straight forward and easy to learn. First tests with the prototype showed that the different shapes and colours are easily distinguishable and referable to the strings they should belong to (compare Fig. 6). Also the mount was proven to be robust and stable enough to keep the projection aligned with the fretboard even when the guitar was heavily moved. The only thing that the testers stated was that they were not able to determine which note to play when the projection was blocked through their hand, which happened when the note lies behind the hand. With the Yamaha EZ EG and Fretlight the problem is analogue because the fingers cover the light that shows the position of the note on the fretboard. When using a tablet with integrated projector and project from the front onto the fretboard the light would only be blocked if the angle between the tablet and the guitar were precipitous. Otherwise, the instructions would simply be projected onto the fingers and the student could estimate the exact position. But still all rated this to be a minor problem.
6 Conclusion and Future Work In this paper we presented a concept that eases guitar learning using mobile projection. The advantage to the existing approaches are the easy to understand visualizations that also give information about the correct fingering of the notes which is essential for fast and correct playing. With projector phones becoming ubiquitous, this system could become a cheap and powerful alternative to the existing approaches. For future work we want to carry out extensive user studies testing different visualizations for different fingers and strings as well as upcoming notes. Furthermore a study comparing the proposed concept to video lessons and Fretlight will be conducted. Additionally we want to integrate the microphone of the mobile phone to automatically detect if the student played the right note and with that be able to give feedback immediately. Furthermore the calibration for the fretboard should be done automatically using the camera of the mobile device to detect the frets. In addition we want the concept to be able to be used for remote guitar teaching in connection with a camera-based finger detection system e.g. [1]. When rehearsing a song the synchronization of two or more projectors could be useful for example when one
Using Mobile Projection to Support Guitar Learning
113
guitarist is playing the rhythm part and the other plays the lead part of the song as it is a accomplished for piano with MusicPath [13]. The success of video games like Guitar Hero also indicates that a game with a similar design could be a promising usecase for this concept as well.
References 1.
2.
Burns, A., Wanderley, M.: Visual methods for the retrieval of guitarist fingering. In: Proceedings of the 2006 conference on New interfaces for musical expression (NIME 2006) IRCAM, Centre Pompidou, Paris, France (2006) Casio Disklaviers, http://uk.yamaha.com/en/products/musical-instruments/ keyboards/disklaviers/ (last visited March 24, 2011)
3.
Casio LK-230F5, http://www.casio.co.uk/Products/Musical_Instruments/Digita l_Keyboards/Keylighting/LK-230F5/At_a_Glance/ (last visited March 24, 2011)
4.
5.
6.
7. 8.
9.
10.
Cakmakci, O., Berard, F., Coutaz, J.: An Augmented Reality Based Learning Assistant for Electric Bass Guitar. In: Tenth International Conference on Human-Computer Interaction, HCI 2003, Rome, Italy (2003) Fretlight, http://fretlight.com/ (last visited March 24, 2011) Johnson, R., van der Linden, J., Rogers, Y.: MusicJacket: the efficacy of realtime vibrotactile feedback for learning to play the violin. In: Proceedings of the 28th of the International Conference Extended Abstracts on Human Factors in Computing systems (CHI EA 2010). ACM, New York (2010) Kerdvibulvech, C., Saito, H.: Vision-based guitarist fingering tracking using a Bayesian classifier and particle filters. In: Mery, D., Rueda, L. (eds.) PSIVT 2007. LNCS, vol. 4872, Springer, Heidelberg (2007) Optimus, L.G.: 3D, http://www.lg.com/uk/mobile-phones/all-lg-phones/LGandroid-mobile-phone-P920.jsp (last visited 24.03.2011) Liarokapis, F.: Augmented Reality Scenarios for Guitar Learning. In: Third International Conference on Eurographics UK Theory and Practice of Computer Graphics, Canterbury, UK (2005) Mistry, P., Maes, P., Chang, L.: WUW - wear Ur world: a wearable gestural interface. In: Proceedings of the 27th International Conference Extended Abstracts on Human Factors in Computing Systems (CHI 2009), ACM, New York (2009) Moog PianoBar, http://www.moogmusic.com/newsarch.php?cat_id=24 (last visited March 24, 2011) Motokawa, Y., Saito, H.: Support system for guitar playing using augmented reality display. In: Proceedings of the 5th IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2006). IEEE Computer Society, Washington, DC USA (2006) MusicPath: Networking People and Music, http://musicpath.acadiau.ca/main.htm
11.
Ng, K.: Interactive feedbacks with visualisation and sonification for technology-enhanced learning for music performance. In: Proceedings of the
114
12.
13.
14.
15.
16.
17.
M. Löchtefeld, et al.
26th Annual ACM International Conference on Design of Communication (SIGDOC 2008). ACM, New York (2008) Raskar, R., Beardsley, P., van Baar, J., Wang, Y., Dietz, P., Lee, J., Leigh, D., Willwacher, T.: RFIG lamps: interacting with a self-describing world via photosensing wireless tags and projectors. In: ACM SIGGRAPH 2005 Courses (SIGGRAPH 2005). ACM, New York (2005) Rosenthal, S., Kane, S.K., Wobbrock, J.O., Avrahami, D.: Augmenting onscreen instructions with micro-projected guides: when it works, and when it fails. In: Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp 2010). ACM, New York (2010) Schöning, J., Rohs, M., Kratz, S., Löchtefeld, M., Krüger, A.: Map torchlight: a mobile augmented reality camera projector unit. In: Proceedings of the 27th International Conference Extended Abstracts on Human Factors in Computing Systems (CHI 2009). CM, New York (2009) Wobbrock, J.O., Wilson, A.D., Li, Y.: Gestures without libraries, toolkits or training: a $1 recognizer for user interface prototypes. In: Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology (UIST 2007). ACM, New York (2007) Xiao, X., Ishii, H.: MirrorFugue: communicating hand gesture in remote piano collaboration. In: Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction (TEI 2011), ACM, New York (2011) Yamaha EZ-200, http://uk.yamaha.com/en/products/musicalinstruments/keyboards/digitalkeyboards/portable_keyboards/ ez-200/?mode=model (last visited March 24, 2011)
18.
Yamaha, EZ-EG, http://usa.yamaha.com/products/musicalinstruments/entertainment/lighted_key_fret_instruments/ez_ series/ez-eg/?mode=model (last visited March 24, 2011)
Don’t Duck Your Head! Notes on Audience Experience in a Participatory Performance Gesa Friederichs-B¨ uttner Technologie–Zentrum Informatik und Informationstechnik (TZI), Bibliothekstr.1, 28359 Bremen, Germany [email protected] http://dm.tzi.de/
Abstract. By introducing the transdisciplinary political dance production Parcival XX-XI and exemplifying two participatory scenarios out of the play, we discuss the audience’s appreciation of interactive digital media usage within the traditional frame of theatre. In this context, we developed a short-guided interview to be conducted with members of the audience after each performance, planned as an on-going evaluation. Based on 15 interviews, we present four reasons of why the audience tends to (not) duck the head when asked to participate in Parcival XXXI : Fear, fun, frustration and schadenfreude. Keywords: Participatory Performance; Interaction; Gesamtkunstwerk; Theater; Digital Media; Dance; Dramaturgy; Evaluation; Experience; Audience; Design.
1
Introduction
If theatrical performances meet digital media, we nowadays can think of more than only displaying video sequences as a background scenery: Performers can combine their actions on stage with digital content as they express themselves with and through technology-based interfaces, capturing e.g. a person’s movement [1][2][3]. Simultaneously, audio-visuals can be output according to individual sets of rules. In the transdisciplinary political dance production Parcival XX-XI, we take this opportunity and invite the audience for active participation in the creation of the play. In this way, we give away part of the authorship to the former passive spectators. We, the artists, believe that digital media opens up manifold opportunities on stage. But how does the audience experience theatre when it suddenly has to leave the expected frame of it? In order to get a clue about how the visitors of Parcival XX-XI experienced the use of digital media in the play, we designed a short guided interview to be conducted with members of the audience after each show. This qualitative research is planned as an on-going evaluation. At the point of writing, we base L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 115–125, 2011. c Springer-Verlag Berlin Heidelberg 2011
116
G. Friederichs-B¨ uttner
the results on 15 interviews which so far present the following reasons of why the audience tends to (not) duck the head when asked to participate in Parcival XX-XI : Fear, fun, frustration and schadenfreude. Before we talk in detail about insights gained from the interviews, we will give a short note about the artistic aim of Parcival XX-XI. This is followed by the course of action – from the practically point of view and the exemplification of interactive media elements in the play. 1.1
Artistic Aim
As described in [7], “Parcival XX-XI is a transdisciplinary political dance production by the company urbanReflects and the University of Bremen. We aim towards a new mode of storytelling on stage, combining aesthetic possibilities of contemporary dance and digital media with a conceptual, non-literarised narrative, which raises socially relevant questions. The myth of the Holy Grail provides the narrative; we are in search of a better world. In the course of human history, homo sapiens buried different promises of salvation – communism with its postulate of equality on earth and fascism with its conception of world supremacy of the superrace. We also decode the new capitalism as ‘wrong grail’ and ask: Where are our visions for a more human world order? By the use of digital media, we take the recipient on our interactive search for a better future. On the interface of video projection and the live dancer, we create images which show facets of a frightening but also hopeful and visionary transformation of the search for the Holy Grail into the 21st century.” 1.2
The Course of Action
After nine months of planning and rehearsing, Parcival XX-XI has celebrated its premiere in February 2011. So far it has been shown 7 times in Freiburg and Bremen. Further performances will follow throughout the year. After each performance, interviews with members of the audience are to be conducted to allow an evaluation on a rolling basis. As findings increase our understanding of the audience appreciation of the use of digital media in Parcival XX-XI, we will modify the setting. In this paper, however, we will not discuss results concerning the use of non-interactive digital media elements as we particularly seek for design strategies for participative moments in theatre, using interactive digital media elements. 1.3
Some Duck The Head. Some Do Not!
Talking about interactive and participatory moments, we here shortly describe related research concerning the spectator’s experience of user-interaction and the differentiation of audience members into people that ’duck the head’, thus do not want to participate in the creation of the play and stay in the passive position, and into people that do not ’duck the head’, and participate. Reeves et al discuss in [5] how a spectator should experience a users interaction with a computer. They introduce a framework of design strategies for public
Don’t Duck Your Head!
117
interfaces, distinguishing between the performer as primary user of an interface and spectators as secondary users – as the one who watches ‘the other’ interacting. In Parcival XX-XI, only a limited number could be invited to participate. Thus, many members of the audience have been kept in the position of the spectator and watched the former passive spectator being an active actor – or user – on stage. As a result, a differentiation between these states is of relevance for our research and has to be considered for designing interactive moments within a participatory performance. With Parcival XX-XI, we break out of the very strict theatrical outlines by inviting the audience to take part in the creation of their own experience. By pointing out that there is a difference between an active audience and a passive one, we agree with Sheridan et al. who proposed in [4] the differentiation between performer, participant and spectator by ascribing them different levels of knowledge about the performance frame. According to Sheridan et al, ’participants’ step into the performance frame and are willing to learn which actions will have effects on the system. As soon as they are technically aware of it and start to interpret their action, thus create meaning to not only the system but also the dramaturgy, they become ’performers’. As we move within the frame of theatre, we do not include bystanders into our considerations. Furthermore, we will not provide an in-depth discussion on the six dancer’s role in the play who all have an established knowledge about the performance frame and can thus be specified as ’performers’. Instead, we focus on the audience, being either a passive spectator or becoming an active participant, and at best, a performer, in the creation of the play. 1.4
Interactive Media in Parcival XX-XI
In the following, we explain the technology used for creating participatory moments in Parcival XX-XI, followed by ’the rules of the game’ and with the help of two scenarios, the exemplification of interactive media elements. Why not hiding it? In this state of Parcival XX-XI, we decided to use Nintendo Wiimote controller as the ’tool’ to allow interaction between the audience and digital visuals. As we create participatory moments, we give away part of the authorship to the audience. But, dramaturgically, we only allow these acting people to navigate within a restricted set of options. As described in [7], “we draw an analogy to the closed system ’theatre’ here which never leaves the mode of representation and therefore always prepares a performing frame. We furthermore understand society as a closed system, which provides for its citizens a kind of performing frame and thus only limited freedom of action.” In this way, the members of the audience are again only ’passive’ parts within the system we created. They consume! And this it what you do when using a Nintendo Wiimote controller. Instead of placing and hiding them within whatever requisite they might fit into, we play offensively with the aesthetics of it, creating figures like a ’Wii fairy’ – a ’performer’ who appears once in a while on stage, encouraging spectators to step into the performance frame and helping participants to
118
G. Friederichs-B¨ uttner
Fig. 1. Near the bar and entrance area of the theatre, the audience was invited to familiarize themselves with the use of the Nintendo Wiimote controller
understand the effect of their actions. Last but not least, the ’Wii fairy’ hands out and collects the Nintendo Wiimote controller. To be dramaturgically understood, we had to ensure that the interaction with the tool and its effect on the play is clear. Thus, we did not want to leave the audience in the dark on how to use these controllers and designed a ’preperformance’ to give everyone the chance to become a ’participant’ before the actual performance started. On top, together with the entrance ticket, everyone got a sheet of paper with the ’rules of the game’. We left everyone in the entrance area for about 20 minutes to take part in the ’pre-performance’. This event consisted of one ’Wii fairy’, one dancer on a diagonal wall, projections and all the visitors. Each 5 minutes, a jingle started, saying ‘it’s time to intervene’. In these moments, the ’Wii fairy’ began inviting everyone to use the controller after her, encouraging and helping in case of doubts, troubles and misunderstandings – all in a non-verbal way. Her goal was to ’teach’ the two gestures used for the interactive moments within Parcival XX-XI. This procedure was repeated various times. The ’rules of the game’ 1. To carry a Wiimote controller means to carry the responsibility. 2. Wiimote controller are only allowed to be put on the Wii fairy’s silvery tablet or to be given into someone else’s hands. 3. As soon as you hear a jingle saying: “it’s time to intervene”, you shall go on stage to ’play’. 4. Gesture 1: Holding the controller still and upright in front of your body. Gesture 2: Banging the controller down.
Don’t Duck Your Head!
119
Fig. 2. In this scenario, four audience members are in charge of dressing (and undressing) one person each. The clothing items are projected on the dancers who improvise their movements according to it.
For the ‘pre-performance’, gesture 1 caused one clothing item to appear on the diagonal wall. The dancer on it improvised her movements according to their appearances. The more items appeared, the more ambitious for the dancer to make the clothes fit her. Gesture 2 caused the disappearances of all items and thus a new start, and probably a relieve for the dancer. These two simple gestures were also used during Parcival XX-XI, mapped with different semantics. In the following, we introduce two scenarios of Parcival XX-XI in which members of the audience could participate in the play – interacting via Nintendo Wiimote controller with digital visuals, using the introduced gestures. Scenario 1. In this first scenario, four audience members are invited on stage. Each of them is in charge of dressing one person (and undressing someone else’s). The problem: In total there are only three clothing items but four people. As a result, one person will always stay naked. The two gestures allow the participant to either steal one item (banging the Nintendo Wiimote controller), or hold its own dress and keep it (holding the controller still and upright in front of your body). In this way, nobody else can steal your dress. Or one does nothing. We ask: Is there a way to escape from being jointly guilty? Scenario 2. In this second scenario, three audience members are invited on stage. Each of them is in charge of controlling one avatar. All avatars fight against dancers. The two gestures allow the participant to either attack the dancer or defend him- or herself. As described in [7], the problem is that “although the four participants control their virtual avatar, the general role they have to take is pre-determined by the play: It is not the participants choice, on what side they are fighting. We suggest perspectives for considerations but without imposing a solution.”
120
G. Friederichs-B¨ uttner
Fig. 3. In this scenario, three audience members are in charge of controlling one avatar. Each avatar fights against one dancer. The dancers improvise their movements according to what the avatar does.
After this short overview of what Parcival XX-XI is about, we now go on to describe first results of interviews conducted with members of the audience.
2
Qualitative Research – Interviewing the Audience
All interviews were held in German. They imply three short questions and are designed as lose guidelines to be asked and recorded after each performance. As these questions are translated by the author for this paper, the answers in the following are, too. 1. Which aspects especially caught your eye? 2. How did you perceive the use of digital media? 3. How would you rate the use of Wiimote controller? After the first three performances, 15 people (6 females and 9 males) between 26 and 68 years old, were asked. For statistic purposes, name, sex, age and occupation have also been collected. Each interview took between 5 – 20 minutes. As we are still in the beginning of the evaluation, we here start with outlining re-appearing statements and meanings from the interviews. We draw first conclusions from it in order to better understand the audience appreciation of participatory moments in Parcival XX-XI, using interactive digital media. 2.1
Fear
“I used to look away when she came around to find participants. And it worked out. I did not need to do it. She respected it. (. . . ).”
Don’t Duck Your Head!
121
The person talks about the ’Wii fairy’ who appeared on stage as soon as the jingle ’it’s time to intervene’ started, thus right before a participatory scene. We here see that fear and / or self-effacement is one hunk in participatory performances. Thinking within the frame of theatre – which often implies a stage and an audience sitting in front of it – many visitors are intimidated by the chance of going into the centre of action, the stage and thus become a ’participant’. They tend to hide, duck their heads and feel relieved when it does not hit them. Bringing into life a ’Wii fairy’, making the audience learn how to use the tool beforehand, and hand out ’rules of the game’ demonstrates our encouragement for active participation in the play but still, we respected everyone who did not want to join. Issues like being afraid of making a fool of oneself might be one reason for it. A phenomena I also often observed in connection with interactive video installations in exposed locations. Further, some interviewees mentioned some sort of fear of technology and the side-effect of doing it ’wrong’. “I am very happy that I did not need to act. (. . . ) I would not have known what to do. (. . . ) But then I saw what they do on stage and thought that it’s like a child’s play. (. . . ) Still, probably I would have pressed the wrong bottom. But I was not afraid of being on stage.” “I think, everyone is afraid of going on a stage, (. . . ) you have to do it especially good, and everyone is afraid of that.” 2.2
Being the Other – Fun and Frustration
As a ’participant’, we strive for immediate understanding. If we cannot relate our actions to the context of the on-going performance, thus the interaction is not self-explanatory, it’s not a long tramp toward frustration. In this case, the designer of the interface and the producers of the piece failed in transporting the motif of why to use interactive and participatory moments in the play. The ’participant’ is stuck and will not create meaning for her- and himself and the play. Similarly, the people quoted below, notice that in such cases, interactivity in a technical sense still might be achieved, as the audience understands how to use the tool. But, waiting for an eye-opener, one looses the connection to the ’narrative’ or purpose of the scene, and thus does not get more engaged with the play but feels rather disconnected and frustrated. Still, a few participants liked being on the stage, looking at the whole from a different perspective and enjoying the moment of using the tool as such. Whether this is the purpose of the piece, is a different question, but as we look at the audience appreciation of digital media usage in Parcival XX-XI, we should not ignore the fact that some have fun, just because of ’using’ digital media or being on a stage. “It is very difficult to immediately understand the effect of my action in this very spur-of-the-moment of moving the controller (. . . ) and the understanding should be immediate, because this is what makes it interactive.” “I knew how to handle the controller and what the gestures were but not what really happened then.” “I had fun being on stage and interacting with these figures but I did not quiet understand if I did it right.”
122
G. Friederichs-B¨ uttner
After all, rather hiding it? As discussed in 1.4, our team decided for an offensive presentation of the Nintendo Wiimote controller in the play. In this way, the technology becomes a present player, not only in its function but also in its meaning. One interviewee explains that by exposing the controller, one gets an extra hint for understanding its role within the play: The theatre goer as a consumer within a fixed set of rules. “Some people in my circle of friends who all work with media, probably would have said ’again, a Wii-project’, but for amateurs, it is a game-controller, first of all. If you want to highlight this fact, then it is good to use it as it was. In this way, the theatre goer becomes a consumer. He or she consumes as he or she interacts. And to look at the controller helps to understand your role in the play.” Some comments out of the audience reveal different opinions. They suggested to hide the technology in some sort of requisite, as it was also realized in [8]. “I believe that a Wiimote controller is not the right tool. If you connected it to some sort of laser-gun, everyone would have known how to deal with it – without the need to explain it. Everyone knows how to use a gun from watching TV.” “You could have taken a piece of wood and hide the Wiimote controller in it (. . . ).” “If you only wanted to capture the movement of the participants, I would have liked a more decent tool.” These results concerning the exposition of the Nintendo Wiimote controller and whether this is a ’good idea’, is not to be answered at this point but leads us to the need of further investigations and tests. For us, it is of interest whether we can increase the appreciation of this ’tool’ by increasing the understanding of the relation between the tool and its dramaturgically means for the play. To do so, we aim to change the gestures and find ways to better ’tell’ the motif of our participatory moments. The general question, whether this tool is better to be replaced by something else, might follow. The opportunities are manifold and capturing movement is only one way to interconnect a person to digital audiovisuals and can e.g. also be done with 3D camera tracking. Here, a comparison between different techniques (e.g. Nintendo Wiimote controller versus Microsoft Kinect 3D-Motion-Controller) should be considered. 2.3
Watching the Other – Schadenfreude and Frustration
As a spectator in a participatory performance, we also aim to understand the participant’s / performer’s effect to the context of the on-going performance. If we cannot follow what is happening on stage, we again move toward frustration. We here hit the fact that for interaction and participation within a theatrical performance, one has to appropriately design such moments by not only considering the perspective of the user, the participant / performer, but also the spectator.
Don’t Duck Your Head!
123
“During this one scene, it felt like being at a tennis match. You watched to the audience, or the projections and missed the action of the dancer or the other way around. You never could follow all nine things. (. . . ) With nine things, I mean 3 dancers, 3 projections and 3 out of the audience”. “I did not know who with who (. . . )”. “There was going on a lot and I could not follow.” These people commented on scenario 2, in which 3 dancers, 3 members of the audience and 3 projections interacted with each other. To a certain degree, we here find a logistical problem as eyes cannot watch it all at the same time. But, we can also assume problems of understanding the over-all context of interaction here. In this case, it was not of need to follow all the actions taking place on stage. Each participant / performer shall become focused on controlling one avatar who fights against the dancer on his or her side. The three avatars do not relate but behave independent from each other. If the participant / performer tries to follow everyone on stage, a disorientation and lost of context can hardly be avoided. If we figure out that this is more often the case, we have to reconsider how to get the participant / performer fully involved in their ’own little world’, the specific area on stage, between avatar and dancer. On the contrary, in scenario 1 the purpose of participation was a different one and the projections on stage relate not only to the action of the participant / performer but also to each other. Here all participants / performers shall understand the entire scenery including the relationship between all levels of interactions. Taking the next comment into account, saying that being in the position of the spectator opens up a better view on the complete image than participating yourself, we might have asked for too much and have to consider re-designing certain aspects of the scene. “When I only watched the interaction of the others, I better followed the complete image. It was much more thrilling than performing and giving impulses myself. I would not have expected it but after the event, I can say it.” One rather surprising result of the interviews for us is, many people enjoyed it a lot to observe the new participant and further searched in his or her behaviour for discomfort, happy-ness and so on. It arose for us the impression that some people were filled with Schadenfreude because they successfully ducked their heads. “I really liked it to look at the people (. . . ), pulled out of the audience’s ease and comfort. Suddenly they had to join in and do something.” “It is interesting to observe how different one behaves on stage. Some totally liked it, some were more shy and introverted, they seem to be carefully as if they were afraid of doing something wrong. This all tells a lot of each individual person there.” “I found it amusing to look at the participatory scenes.” “(. . . ) what I really liked was to look at the others”.
124
3
G. Friederichs-B¨ uttner
Conclusion and Future Work
We believe, as long as a participatory performance requires the audience to step away from their chairs up to a stage, we will not get everyone to ’play’ with us: Not all the audience is willing to leave its passive spectatorship. Eventually, this is one of the traditional characteristics, amongst many others, of the ’frame of the theatre’ and as it is, appreciated. Within this frame and for ’good interactivity’, we should always consider the situation, wishes and expectations of the audience. Still, we can try to take away as much burdens as possible from ’potential participants’. At the same time, we have to design the play for the audience who will stay on its chair. Thus, we have to plan a participatory play from more than one perspective. In our on-going evaluation, we will continue to search for the audience’s ’biggest burdens’ to participate in our play and try to define overall design methods for overcoming those. Strategies to catch the majority of the audience could be not to invite single people but groups or even everyone to the stage. Furthermore, one can create scenarios in which all interact from their chairs e.g. with some sort of TED in their hands and best, anonymously. But, in our case, we aimed to create moments of interaction between the audience and not only the digital visuals but also the dancers. In this way, we created a tripartite interaction (cf. [6]) which comes with extra challenges for implementing the gesamtkunstwerk of Parcival XX-XI. As reported in previous research [7], the producers shall not only consider how to make an interface easy to use but also better understandable, especially in its motif concerning the dramaturgically means for the play. Going through the material of the first interviews and taking into account various discussions with friends, colleagues and so on, we have to admit that we partly failed in communicating this motif and thus, did not entirely fulfill our goal to make the participatory moments within Parcival XX-XI as essential to the piece as the dance or non-participatory digital media elements. From the dramaturgical point of view, we cannot do the play without the two introduced scenarios in this paper and in our opinion, this is the main essence for successfully including such participatory moments. Still, our goal to allow spectators to not only become participants but also performers in the play, we have to adapt and change certain aspects of Parcival XX-XI. Only if we, as the producers, play with ’performers’, we can assume that the audience understands the ambiguity of the dramaturgy (cf. also [9]) and then, we can settle back. We close the paper with one comment of a person who attended the play twice. Once passively and once actively. “It is definitely different to be part of it as an active participant. If you are passive, you sit down. It is more comfortable. If you stand, you are in the center, and to be in the focus, is a bit weird, (. . . ) with your back to the audience. It is slightly strange and embarrassing but also exciting. To do it and to see how the dancers react to your action.”
Don’t Duck Your Head!
125
Acknowledgments. This work was funded by the Klaus Tschira Stiftung. We further like to acknowledge the support of the Ministry for Science, Research and the Art and Federal Cultural Foundation of Germany, the Landesverband Freier Theater Baden–W¨ urttemberg, Sparkasse Freiburg, FOND Darstellende K¨ unste e.V., Landesbank Baden–W¨ urttemberg LBBW, the Cultural Office Freiburg and the Senate of Bremen.
References 1. Broadhurst, S.: Troika Ranch: Making New Connections. A Deleuzian Approach to Performance and Technology. Performance Research 13(1), 109–117 (2008) 2. Morrison, A., Andersson, G., Breˇcevi´c, R., Skjulstad, S.: Disquiet in the plasma. Digital Creativity 20(1&2), 3–20 (2009) 3. Wu, Q., Boulanger, P., Lazakevich, M., Taylor, R.: A real-time performance system for virtual theatre. In: Proceedings of the 2010 ACM Workshop on Surreal Media and Virtual Cloning, pp. 3–8 (2010) 4. Sheridan, J., Bryan-Kinns, N., Bayless, A.: Encouraging Witting Participation and Performance in Digital Live Art. In: 21st British HCI Group Annual Conference, pp. 13–23 (2007) 5. Reeves, S., Benford, S., O’Malley, C., Fraser, M.: Designing the Spectator Experience. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 741–750 (2005) 6. Sheridan, J., Dix, A., Lock, S., Bayliss, A.: Understanding Interaction in Ubiquituos Guerrilla Performances in Playful Arenas. In: Fincher, S., Markopolous, P., Moore, D., Ruddle, R. (eds.) People and Computers XVIII-Design for Life: Proceedings of HCI 2004, pp. 3–18 (2004) 7. Friederichs-B¨ uttner, G., Dangel, J., Walther-Franks, B.: Interaction and Participation – Digital Media and Dance in Interplay. To be published: IMAC 2011, Copenhagen (2011) 8. Taylor, R., Boulanger, P., Olivier, P.: dream.Medusa: A Participatory Performance. In: Proceeding of the 8th International Symposium on Smart Graphics, vol. 2008, pp. 200–206 (2008) 9. Gaver, W., Beaver, J., Benford, S.: Ambiguity as a Resource for Design. In: Proceedings of the Conference on Human Factors in Computing Systems, pp. 233–240 (2003)
CorpusExplorer: Supporting a Deeper Understanding of Linguistic Corpora Andr´es Esteban1 and Roberto Ther´ o n2 University of Salamanca, Spain [email protected][email protected]
Abstract. Word trees are a common way of representing frequency information obtained by analyzing natural language data. This article explores their usage and possibilities, and addresses the development of an application to visualize the relative frequencies of 2-grams and 3-grams in Google’s ”English One Million” corpus using a two-sided word tree and sparklines to show usage trends through time. It also discusses how the raw data was processed and trimmed to speed up access to it. Keywords: visual analysis, linguistic corpora.
1
Introduction
While computers have been applied to the analysis of natural language for years, the increase in the amount of relevant data calls for new or modified visualization techniques and tools that allow the exploration of all this information. There have been numerous attemps at natural language visualization since the mid 1990s, but the field is still constantly developing. One of the problems that we feel has yet to be explored in detail is the representation of n-gram (group of n words) relative frequency information inside a text corpus (a collection of texts). These data sets are comprised by records which show the number of times a certain n-gram is found, along with any additional information such as temporal or geographical details. There have been few attempts aimed at visualizing this type of data, with the one notable exception being Word Trees[7], but they have mostly been restricted to exploring suffix information, failing to give the deeper insight that a linguist might need to identify interesting word usage patterns, although recently Double Trees[2] have been used in an attempt to cover this gap. 1.1
Word Trees
As defined by the IBM ManyEyes website1 , which is where Word Trees were first made available, a word tree ”is a visual search tool for unstructured text, such as a book, article, speech or poem”[7]. 1
http://www-958.ibm.com/software/data/cognos/manyeyes/page/Word_Tree. html
L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 126–129, 2011. c Springer-Verlag Berlin Heidelberg 2011
CorpusExplorer
127
Their main disadvantage is possibly that the tree is only developed towards one side, so they can only represent prefix or suffix information, while linguists are usually interested in both items at the same time, and in many cases they can also need simultaneous access to 3-gram information. Multiple level word trees somewhat address this problem, but in practice they only extend to prefix or suffix information of n-grams for an n bigger than 2. Two-sided word trees are very similar, but they feature a central word and a word tree on each of its sides. We believe they are better suited to the problem at hand, since they are able to display both prefix and suffix information at the same time, allowing users to navigate the tree in either direction and offering a more complete exploration of the data. There are multiple ways to represent the relative importance of the items in a word tree. One of the most commonly used is different font sizes, which displays more important items in a bigger size. According to several studies[6,4,3], size is one of the features that the human brain is able to detect in a preattentive manner, that is, a member of a ”limited set of visual properties that are detected very rapidly and accurately by the low-level visual system.”[4], which, when used correctly, can draw attention to the most important items without the need to make a conscious effort. Some other characteristics that can and have been used include color (hue and brightness)[4,1], opacity[5] and even blur[1].
2
Google Ngram Viewer and Associated Data
An application was developed to make use of the data set released by Google accompanying its Google Ngram Viewer2 , a visualization tool which allows plotting an n-gram against others in order to navigate the frequency data produced by analyzing the Google Books corpus. The English One Million corpus was chosen for containing less books than the others, and therefore being more manageable. The raw data still totalled several Gigabytes and many billions of records, so it was necessary to strip it as much as possible so that it could be handled with reasonable response times. Some of the decissions that were made are: 1. 2. 3. 4.
case is not relevant. All n-grams are to be converted to lower case. punctuation is not relevant. n-grams containing punctuation are ignored. the number of pages and books is not relevant, and can be discarded. individual records for each n-gram and year pair are not necessary, and time information can be pre-processed and condensed into a string to greatly reduce the number of records.
With those constraints in mind, a Perl script was written to download the files, decompress them and process them sequencially, storing the results in a MySQL database. Another one queries the database and returns the results as text through an HTTP server. 2
http://ngrams.googlelabs.com/
128
3
A. Esteban and R. Ther´ on
CorpusExplorer
Once the data had been processed and stored, an application was written in Processing3 to visualize it interactively. A two-sided word tree would is used to represent both prefix and suffix information for 2-grams. Font size was chosen as the primary means of showing the relative importance of the 2-gram, with all the vertical space divided linearly between the ocurrences of all the words in the side, and brightness is also used to draw some extra attention to the most frequent items. Lines of different width and brightness joining the central word to those on the sides show the relative frequency of 3-grams once a 2-gram is selected (see Figure 1). The items can be arranged either alphabetically or by frequency, and a slider was added over the word tree to control the maximum number of words that are displayed.
Fig. 1. Temporal distribution of the ”western civilization” 2-gram (using lines to mark 3-grams in the data set which start with ”western civilization”)
The use of multiple levels was decided against to avoid having too much information on the screen. Instead, equivalent functionality is obtained by clicking on one of the words on the trees, which goes on to be the central word, and the frequency information for it replaces the one at the sides. Assuming that the interesting information is contained in the rising and falling trends in the usage of the 2 and 3-grams, sparklines were chosen for their ability to represent this kind of data in a simple and condensed way, liberating screen real-state for other uses. The application allows users to select up to four different 3
http://processing.org
CorpusExplorer
129
Fig. 2. Sample sparklines for the ”civilization which”, ”civilization what”, and ”civilization for” 2-grams. A 2-gram is added on each step.
2-grams simultaneously to see how their usage trends compare (see Fig. 2) by looking at a graph which merges all the sparklines.
4
Conclusion
We believe that with the presented tool a linguist would be able to identify not only which are the most common groupings of words, as when using a concordance, but also to detect which words have substituted others in use, pinpoint which specific moment in time the trends started reversing and find other words with similar patterns.
Acknowledgments This work was supported by the Ministerio de Ciencia e Innovacin of Spain under project FI2010-16234.
References 1. Callaghan, T.: Dimensional interaction of hue and brightness in preattentive field segregation. Attention, Perception, and Psychophysics 36, 25–34 (1984), doi 10.3758/BF03206351 2. Culy, C., Lyding, V.: Double tree: an advanced kwic visualization for expert users. In: 14th International Conference Information Visualisation, pp. 98–103 (2010) 3. Healey, C., Enns, J.: Large datasets at a glance: combining textures and colors in scientific visualization. IEEE Transactions on Visualization and Computer Graphics 5(2), 145–167 (1999) 4. Healey, C.G.: Perception in visualization, http://www4.ncsu.edu/healey/pp/ index.html 5. Kosara, R.: Blur and uncertainty visualization, http://eagereyes.org/ techniques/blur-and-uncertainty 6. Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cognitive Psychology 12(1), 97–136 (1980) 7. Wattenberg, M., Viegas, F.: The word tree, an interactive visual concordance. IEEE Transactions on Visualization and Computer Graphics 14(6), 1221–1228 (2008)
Glass Onion : Visual Reasoning with Recommendation Systems through 3D Mnemonic Metaphors Mary-Anne (Zoe) Wallace Research Group Digital Media, University of Bremen Bibliothekstr. 1, D-28359 Bremen, Germany [email protected]
Abstract. The Glass Onion is a project in its infancy. We aim to utilize the Recommendation Systems Model as a solution to oversaturation of data, and would like to explore the realm of personal relevancy through implementation of Information Recommendation Systems, and information visualization techniques through 3d graphic rendered metaphors. The Glass Onion project seeks to shed light on human association pathways, and through our interaction with a visual recommendation system, develop a personalized search and navigation method which may be used across multiple sets of data. We hope that by interacting with the Glass Onion 3D visualization recommendation system, guests will benefit from their own personal lens or onion, which can then be borrowed, rated, and utilized by others. Keywords: Recommendation Systems, 3d Information Visualization, HCI.
1
Digitized ”Word of Mouth”
Recommendation Systems are lending significance to the power of ”Word of mouth” by using the Internets bidirectional communication capabilities. [1] Although digital Word of mouth seems to be more of a feature, implemented in the realm of e-commerce. Along with these emerging commercial recommendation systems we also have the the public utilizing Social Networking platforms that help facilitate and support ongoing dialogues between internet social circles. [2], [3] In recognizing the value of Digitized Word of mouth and our increasing reliance on it in our casual interaction with online media, its easy to see how our society is moving towards a more active contributing role with our media. Personal Blogs, Last.fm, Amazon.com, Facebook... all tools to encourage the people to generate new ideas, develop new relationships, and meaning to our media and information: from the realm of the commercial, to academic, to even government institutions [3]. As the demand for personal recommendations grows, and we listen to others to see what to read or write, and dictate those subjects which are of immediate importance to our lives, can we lean on the model of the L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 130–133, 2011. c Springer-Verlag Berlin Heidelberg 2011
Glass Onion: Visual Reasoning and Mnemonic Metaphors
131
Recommendation System to help us navigate the growing sea of information in a more meaningful way, and with clarity? [4] ”We envision information in order to reason about, communicate, document, and preserve that knowledge activities nearly always carried out on two-dimensional paper and computer screen”” Galileo’s first precious observation of the solar disk, to small multiple images, to dimensionality and data compression, and finally to micro/macro displays combining pattern and detail, average and variation. Exactly the same design strategies are found, again and again, in the work of those faced with a flood of data and images, as they scramble to reveal, within the cramped limits of flatland, their detailed and complex information. These design strategies are surprisingly widespread, albeit little appreciated, and occur quite independently of the content of the data.”[4] Currently our interaction with online media, and the delivery method of returned results occupies a flat representation of separator columns, text, and graphics using traditional design layouts.[4] But can we do more to make that data transparent? Can we utilize 3d graphics to tackle the problem of Micro and Macro vision so important in 2d Design elements so that the audience can discern and perceive through multiple layers of context? [4] There are multiple challenges that arise when attempting to tackle this kind of data visualization and data interaction, and there are multiple projects [5],[6],[7] that have experimented in 3d Information Visualization; particularly in trying to address the Micro/Macro relationships in 3d imaging. We understand that for humans, visual clutter can reduce our ability to make visual sense, or define context for the data we are in contact with; and by understanding these elements of design we can design for ”seeing”. [4] Some projects took on the challenge of mapping network relationships through a visual metaphor of city planning to create a geographic map. This brief paper aims to explore the challenges in attempting to visualize the user-defined values and assigned relationships between multiple sets of data including images, text, video, etc, and online links in visual metaphor. As indicated above, the visual language of Micro/Macro relationships of data have been repeated and redefined and perfected in terms of flat imaging. [4] In designing an Information Visualization application that uses 3d graphics for use with Recommendation systems, the visual metaphor must also be chosen that supports the Macro/Micro visual environment. For this reason, we have turned to older illustrations which depicted multiple layers of data to see how they applied visual grounding to complex ideas. These illustrations functioned as a teaching tool and ”Mnemonic” [9]device for conceptual grounding. In many of these Alchemical illustrations we see simplified images to represent multiple sets of data, both descriptive in the qualities of data sets, and explanatory in their relationships between each other. In being constrained by the tools available to them, the artists worked on conveying the interconnectedness of data as clearly as possible to avoid over-saturation of data[8], [9].
132
2
M.-A. Wallace
Disorientation in Information Visualization and Metaphors for Visual Reasoning
The use of Memory Systems and mnemonic devices are some of the best examples of illustrations that tackle the problem of disorientation in disembodied data through the act of visualizing the Micro and Macro [8], [9] We would like to take these mnemonic devices into consideration as the goal of our project is making data more transparent, and facilitating through 3d visualization ,that very human need of visual reasoning [4], [9] The Glass Onion seeks to experiment and identify the best working visual metaphor for the purposes of studying data relationships and visualization for human centered computing. Ultimately the project would rely on the 3d Rendering of dynamic modes. The functionality of Micro to Macro visualization and categorizing of information as seen above, our visual graphics need to be dynamic, and may (in the case of the image above) require nested hollow 3d spheres whose volume changes dynamically depending on the amount of returned data within it. Objects that consist of segments relating to a specific category would be color coded, and as a guest manipulates the object, or travels within it to view the nested pieces, a visual language is then applied for consistency, so that at a glance, the guest can avoid feeling overwhelmed by the data each container returns. We will attempt to set the visual categorization of data according to a library classification scheme. Guests can determine the position of data within the nested objects and by their actions, whether manipulation, adding to or taking away from the environment, their contribution (and profile) may be recorded for fast retrieval of their workspace. In an empty environment, if a guest starts out a search query through manipulation of a visual object, returned results will at first display data from popular search engines; according to common statistics. Through interaction, the system
Fig. 1. Robert Fludd (1574-1637). Utriusque cosmi maioris scilicet et minoris metaphysica, physica atque technica histori. Oppenheim, 1619[10].
Glass Onion: Visual Reasoning and Mnemonic Metaphors
133
may reference a ”friends’ list” or highlight other work nodes of interest, which can then be added to , built upon and shared. One benefit in representing those results in a 3d visual metaphor is the ability to navigate through various objects without the threat of loosing context. The guest may see visual-cues and categories where one result falls in relationship to another, and through the use of gestures be engaged in a creation/craft of mind mapping and information gathering. The interaction and manipulation of this data in the case of breaking and creating relationships relies heavily on the nesting and subsection capabilities of these dynamic objects and its segmentation to uphold the onion metaphor. If the user can save their work, then this personalized search path may then be shared and utilized by others to influence -their- information gathering process.
References 1. Dellarocas, C.: The Digitization of Word of Mouth: Promise and Challenges of Online Feedback Mechanisms. Manage. Sci. 49(10), 1407–1424 (2003), http://dx.doi.org/10.1287/mnsc, doi:10.1287/mnsc.49.10.1407.17308 2. Farnham, S.D., Brown, P.T., Schwartz, J.L.K.: Brown, and Jordan L.K. Schwartz. 2009. Leveraging social software for social networking and community development at events. In: Proceedings of the Fourth International Conference on Communities and Technologies (C&T 2009), pp. 235–244. ACM, New York (2009), http://doi.acm.org/10.1145/1556460.1556495, doi:10.1145/1556460.1556495 3. van Wamelen, J., de Kool, D.: Web 2.0: a basis for the second society? In: Proceedings of the 2nd International Conference on Theory and Practice of Electronic Governance (ICEGOV 2008), pp. 349–354. ACM, New York (2008), http://doi.acm.org/10.1145/1509096.1509169, doi:10.1145/1509096.1509169 4. Tufte, E.: Envisioning Information. Graphics Press, Cheshire (1990) 5. Cox, K.C., Eick, S.G., He, T.: 3D geographic network displays. SIGMOD Rec. 25(4), 50–54 (1996), http://doi.acm.org/10.1145/245882.245901, doi:10.1145/245882.245901 6. Ag, V., Beck, M.: Real-Time Visualization of big 3D City Models, isprs.org (2003) 7. Russo Dos Santos, C., Gros, P., Abel, P., Loisel, D., Trichaud, N., Paris, J.P.: Experiments in Information Visualization Using 3D Metaphoric Worlds. In: Proceedings of the 9th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE 2000), pp. 51–58. IEEE Computer Society, Los Alamitos (2000) 8. Synaptic for a real world in a virtual world, http://www.synaptic.ch/infoliths/ textes/arsmem.htm 9. Yates, F.A.: The Art of Memory. The Guernsey Press Co. Ltd., Guernsey (1966, 2001) 10. Princeton University, History Robert Fludds Great Chain of Being, http://www.princeton.edu/~his291/Fludd.html
Visualizing Geospatial Co-authorship Data on a Multitouch Tabletop Till Nagel1,2, Erik Duval1, and Frank Heidmann2 1
Abstract. This paper presents Muse, a visualization of institutional coauthorship of publications. The objective is to create an interactive visualization, which enables users to visually analyze collaboration between institutions based on publications. The easy to use multitouch interaction, and the size of the interactive surface invite users to explore the visualization in semi-public spaces. Keywords: geo-visualization, tabletop interfaces, human computer interaction.
1 Introduction There has been a vast amount of research in the areas of bibliometry and scientrometry to extract and specify the metrics of scientific publication and citation networks. Several approaches to visualize these networks have been reported on ([1], [2], [3]). The objective of our visualization is not to study individuals and their personal co-authorship networks, but rather to enable analyzing the connection network of their affiliations. More specifically, our aim is to direct attention to the spatial relations, to allow users the visual exploration of their scientific neighborhood. For this, we created an interactive geo-visualization with an emphasis on the relations between universities and research centers, and their geographical origins. We do not intend to enable the visual analytics of massive amount of publication data, but rather to support the exploration of scientific collaboration in a domain-specific field. The paper introduces Muse, a working prototype, whose main purpose is to ease the exploration of collaborations between institutions. In addition, the use of a large display tabletop, as well as the aimed-for simplicity of visualization and interaction intend to invite attendees to participate, and engage in discussions at a conference location.
Visualizing Geospatial Co-authorship Data on a Multitouch Tabletop
135
single user system, but operates in a collaboratively created and used information space. The inter-institutional relationships are based on co-author data, as “co-authorship seems to reflect research collaboration between institutions, regions, and countries in an adequate manner” [4]. For this prototype, we harvested the data from the publishers of conference proceedings directly, as other publication services seldom provide address data. After cleaning and aggregating, we geo-coded the affiliation data. A single large world map is displayed, with all institutions and their relations based on co-authorship (Figure 1a). Cartographic information on the map comes from OpenStreetMap. We chose a reduced map style, which displays only few geographical features. The aim of the prototype is to allow exploring and understanding the visualization of geo-spatial relations. Thus, the objective of the map is to support general recognition, while being discreet enough to not hinder the display of the data and interface layers. The map can be navigated freely, while institutions can be selected to get background information as well as their relations to other institutions. Countries can be selected to retrieve basic statistics on the publications of their authors and affiliations. Interactions & Visualizations. The users are able to select the region they are interested in by panning and zooming the map through slide and pinch finger gestures (Figure 1b). Even though more complex map manipulations are possible, we chose this simple interaction approach, in order to enable the user to concentrate on the map, with less their efforts. Affiliations are represented by markers at their geo-location. The size of a circle indicates the overall amount of papers written by authors from that institution. In the lower left corner a legend explaining the size of the circles is shown (Figure 1a). By tapping on a circle the name of its institution is displayed atop, and relations to other institutions are shown. Users can select a country by tapping on it. That country is selected, and additional information is shown in a data widget in the lower right corner, similar as in the first prototype. In detail, the number of papers, authors, and institutions over the years are displayed as bar diagrams (Figure 1a). As long as the user has not tapped anywhere, the widget displays a subtle hint to communicate this interaction pattern. When two countries are selected the prototype displays the diagrams besides each other, allowing the user to compare them. Connections. Relations between institutions are visualized by connecting lines between the two markers. We adhered to the schema most visualizations of social networks use, in that “points represent social actors and the lines represent connections among the actors.” [5] The visual lines connect two institutions transparently, to not obstruct the underlying map or markers. A visual connection is shown if authors from the two institutions published at least one paper together, but there is no indication of the amount of collaboratively written
136
T. Nagel, E. Duval, and F. Heidmann
Fig. 1. (a) Map with one selected institution and its co-authorship connections. (b) A user doing the pinch gesture on the tabletop to zoom-in the map.
papers. Instead, the visual style of the connections does vary depending on the overall amount of published papers of both the selected institution and the related institutions.
3 Evaluation We performed a formative user study with a working prototype on an interactive tabletop. The aim was to gather feedback on the intelligibility of the visualization, and the usability of the interactions. The study was designed as pluralistic usability walkthrough, with a semi-structured interview guiding it. We conducted the user study at a conference setting, in order to report on users doing real tasks, while measuring in-context usefulness of the prototype. We recruited nine male and three female participants, aged 27 to 52 years, from the attendees of the EC-TEL 2010 conference. We asked the participants to execute selected tasks, and to answer questions concerning the legibility and understandability of the visualized information. In the post-test, we asked the participants to fill out a questionnaire on their opinions and preferences to determine the perceived usefulness of the visualization. We used a 5-point Likert scale with items ranging from “strongly disagree” (1) to “strongly agree” (5) to seven given statements. This survey was done privately and anonymously, with 11 out of 12 participants responding. The participants had great fun (median: 5), and were strongly satisfied using Muse (median: 5). Most agreed or strongly agreed (median: 4) to the statement, that the visualization helped them to better understand research collaboration, but only few found that it supported them to be more effective in research collaboration (median: 3.5). Overall, the participants strongly found the prototype to be useful (median: 5) and easy to use (median: 5).
Visualizing Geospatial Co-authorship Data on a Multitouch Tabletop
137
4 Conclusion We presented Muse, a working prototype for exploring collaborations between institutions. Our main objectives were to create a tool with simple interaction mechanisms, and a comprehensible and aesthetically pleasing geo-visualization, so that interested stakeholders can use it without much effort. Visualization of spatial properties supports users to understand geographical patterns. The plain display of locations of institutions on a map helped users to see real-world clusters in a scientific field. The geographical distribution of the institutions, as well as the visualization of the amount of their publications has been found to be easily understandable. Through interactive filtering, the users are able to explore the relations between their affiliations and other institutions, and could gather insights into the collaborations in their research field. We see the Muse prototype with the used data set as beneficial case study. The results of our usability study, and the feedback gathered in expert interviews ascertain the real need for such a tool.
References 1. Chen, C.: Trailblazing the Literature of Hypertext: Author Co-Citation Analysis (19891998). In: Proc. HT 1999, ACM, New York (1999) 2. Henry, N., Goodell, H., Elmqvist, N., Fekete, J.-D.: 20 Years of Four HCI Conferences: A Visual Exploration. International Journal of HCI 23(3), 239–285 (2007) 3. Ponds, R., Van Oort, F.G., Frenken, K.: The geographical and institutional proximity of research collaboration. Papers in Regional Science 86(3), 423–443 (2007) 4. Glänzel, W., Schubert, A.: Analyzing Scientific Networks through Co-authorship. In: Moed, H.F., et al. (eds.) Handbook of Quantitative Science and Technology Research, pp. 257–276. Kluwer Academic Publishers, Dordrecht (2004) 5. Freeman, L.: Visualizing Social Networks. Journal of Social Structure 1 (2000)
ElasticSteer – Navigating Large 3D Information Spaces via Touch or Mouse Hidir Aras, Benjamin Walther-Franks, Marc Herrlich, Patrick Rodacker, and Rainer Malaka Research Group Digital Media TZI, University of Bremen, Germany
Abstract. The representation of 2D data in 3D information spaces is becoming increasingly popular. Many different layout and interaction metaphors are in use, but it is unclear how these perform in comparison to each other and across different input devices. In this paper we present the ElasticSteer technique for navigation in 3D information spaces using relative gestures for mouse and multi-touch input. It realises steering control with visual feedback on direction and speed via a rubber band metaphor. ElasticSteer includes unconstrained and constrained navigation specifically designed for the wall, carousel or corridor visualisation metaphors. A study shows that ElasticSteer can be used successfully by novice users and performs comparably for mouse and multi-touch input. Keywords: 3D, navigation, constraints, information spaces, multitouch.
1
Introduction
With the recent success of applications like Bumptop, Cooliris, SpaceTime 3D, iTunes Cover Flow and many others, 3D presentations of 2D data are becoming mainstream. As ease of use and user experience are often foremost to these applications, touch interfaces present a natural choice as an input device. Yet, while touch is quasi-standard in the mobile world and growing for desktop use, the support for 3D informations browsers is sparse and rather ad-hoc. We present the steering technique ElasticSteer for 3D information browsers on multi-touch and mouse input devices. ElasticSteer is based on a vehicle metaphor and multiplies 2 degrees of freedom (DOF) input with modifier multi-finger gestures or mouse buttons. It provides easy feedback on steering direction and speed with a rubber band metaphor. We designed and implemented two variants: free, unconstrained 6DOF steering and path-constrained 2DOF steering geared toward specific spatial metaphors used in 3D information browsing. An evaluation of our technique shows that constrained 2DOF navigation beats unconstrained navigation regarding task completion time, number of gestures used and view resets. However, the latter allows the user to develop personal navigation strategies and often feels more immersive. We also show that touch input is almost as fast and efficient to use as the mouse while providing the better user experience. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 138–141, 2011. c Springer-Verlag Berlin Heidelberg 2011
ElasticSteer – Navigating Large 3D Information Spaces
139
Fig. 1. Navigating via ElasticSteer on an interactive surface (constrained mode) through corridor, carousel and wall visualisations
2
Related Work
Several research and commercial applications use 3D information spaces to explore collections of 2D data, such as web pages, search results, pictures, or videos. Cooliris is a firefox plugin that visualises different forms of search results, primarily pictures and videos, horizontally along a 2D plane. Users can interact with the horizontally arranged items by flying along the wall. SpaceTime 3D visualises search results in form of a stack in 3D. It allows browsing in 3D, zooming to particular areas of a page, following links, etc. The Sphereexplorer provides a stack and wall visualisation of web pages. It allows accessing web pages using translation along one of the axes from the center point and rotation along the axes x, y an z, while constraining translations and rotations. Other similar applications are the Giraffe Semantic Web Browser [3] and knowscape [1]. Research on 3D travel techniques for multi-touch devices has so far focused on direct manipulation [2,4,5]. Yet these do not scale well to large information spaces, since covering large distances requires the user to constantly move fingers/hand/arm back and fourth which is physically straining and imprecise. The steering control metaphor known from Virtual Reality so far has hardly been used for touch-based devices.
3
ElasticSteer
In 3D information spaces the user often has to cover large distances for navigation and search tasks. Thus, a control technique that scales well from small precise navigation to covering large distances is crucial. While one can minimise this problem using distinct gestures for switching scales we strive for a more integrated approach using continuous interaction and a steering metaphor. ElasticSteer is based on a rubber band metaphor. This means the speed and direction is based on the relative direction and distance of the starting point of the gesture compared to the current point. This has three advantages: it scales well to large distances without requiring much movement, while still allowing precise small movements. Second, it visualises speed and direction. Third, it provides a physically inspired interaction style that is easy to understand. In order to reduce motor problems and keep mouse compatibility, we separated the control
140
H. Aras et al.
of 6 DOF into three 2 DOF input techniques. Multi-finger gestures or mouse buttons switch between different DOFs: One finger (LMB) controls camera yaw and camera depth. Two fingers (MMB) allow translation within the view plane. Three fingers (RMB) control roll and pitch. For multi-finger gestures the geometric centre is used for calculating speed and direction. In order to determine the impact on navigation efficiency versus user experience and immersion, we further augmented our steering technique with path-constraints. These reduce the control DOF to 2, but need to be tailored to the employed spatial visualisation metaphor. We developed path constraints for three state-of-the-art visualisation metaphors: corridor, carousel, and wall. In the corridor objects are arranged in two rows. Supported navigation constrains movement to two axes. Horizontal input movement is mapped to movement to the side while vertical movement is mapped to the depth axis. The carousel arranges objects in a circle. Supported navigation constrains movement along a circular path. The radius of the camera path, i.e. the distance from the information objects, can be adjusted by the user. Thus camera movement is constrained to two dimensions, horizontal input moving the camera on the circle, vertical movement changing the distance to the carousel. The wall arranges objects on a 2D plane. In supported navigation movements take place in two dimensions within a rectangular area. Horizontal movement is mapped to moving sideways while vertical movement is mapped to moving towards/away from the wall.
4
User Study
We evaluated ElasticSteer by conducting a task-based user study with our own 3D browser application. Our objective was to compare the input methods mouse vs. touch and unconstrained vs. constrained navigation. Our setup was a 3D space (Figure 1) with 45 2D rectangles with coloured geometric shapes depicted on them. The task was a naive search task “find the page with the green star”. The three independent variables to be examined were input device (mouse, multitouch), control mapping (constrained, unconstrained) and spatial metaphor (corridor, carousel, wall). For each task we measured completion time (ct), number of gestures used (g), and the number of times the camera position was manually reset (r). Each subject had to perform 36 task overall in four steps for each of the three views. The subjects were divided into different groups (mu, mc, tu, tc) depending on the tested input method and navigation metaphor. Subsets of these were then formed to independent groups regarding the independent variable. Each participant filled out a questionnaire after the evaluation. Besides logging the user actions, a video recording was done. 16 subjects between 26-39 in age (9 female and 7 male), all right-handed, 7 classified as experts and 9 as novice participated. We evaluated the three spatial metaphors separately. As navigation paths may differ depending on the used view we normalised the task completion time based on the distance to the initial camera position. The results of our quantitative analysis (Table 1) show that the input methods mouse and touch were quite comparable for corridor and wall. Significant differences can be observed for the number of gestures used in the carousel, which were
ElasticSteer – Navigating Large 3D Information Spaces
141
Table 1. Mean values for tested parameters ct, ng, r: mouse vs. touch (left), free vs. constrained (right) mouse touch free constrained corridor 5.70/19.38/0 5.71/13.88/0 corridor 5.73/22/0 4.99/10.13/0 carousel 11.23/30.5/1.13 17.51/87/2.88 carousel 14.31/60.38/1.63 5.81/21.5/0.63 wall 4.67/26/0.13 5.59/40.88/0.63 wall 5.1/49.5/1.13 3.82/16.63/0
far less for mouse input. The results further show that the constrained navigation was more efficient for all three tested parameters and each of the views. Giving subjective feedback on the input methods, participants stated that the time to complete a task was comparable, but touch gave them more freedom of movement, fun and a better feedback over speed and direction of the navigation. In particular, they liked the control and sensory feedback using touch. A few users had the impression that in contrast to mouse input there is a more direct mapping of gestures to the virtual space. They felt more immersed into the virtual space navigation and the browsing scenario. Concerning the tested navigation alternatives, constrained navigation was regarded the faster, easier and most preferred method. Nevertheless, free navigation was attested the advantages of freedom of movement and more possibilities to develop own strategies.
5
Conclusion
We presented the 3D navigation technique ElasticSteer, a steering control with visual feedback on direction and speed via a rubber band metaphor. We show it to be successfully used for browsing large information spaces. Since 3D information spaces often follow certain spatial metaphors, we further reduced the control DOF to 2 via path-constraints geared toward either the wall, carousel or corridor display. A study showed it to work almost equally well for mouse and multi-touch input devices, with a clear benefit for touch regarding user satisfaction and immersion.
References 1. Babski, C., Carion, S., Keller, P., Guignard, C.: knowscape, a 3d multi-user experimental web browser. In: Proc. SIGGRAPH, pp. 315–315. ACM, New York (2002) 2. Hancock, M., Cate, T.T., Carpendale, S.: Sticky tools: Full 6dof force-based interaction for multi-touch tables. In: Proc. ITS (2009) 3. Horner, M.: The giraffe semantic web browser. In: Proceedings of the 12th International Conference on Entertainment and Media in the Ubiquitous Era, MindTrek 2008, pp. 184–188. ACM, New York (2008) 4. Martinet, A., Casiez, G., Grisoni, L.: The design and evaluation of 3d positioning techniques for multi-touch displays. In: Proc. 3DUI, pp. 115–118. IEEE, Los Alamitos (2010) 5. Reisman, J.L., Davidson, P.L., Han, J.Y.: A screen-space formulation for 2d and 3d direct manipulation. In: Proc. UIST, pp. 69–78. ACM, New York (2009)
Proxy-Based Selection for Occluded and Dynamic Objects Marc Herrlich, Benjamin Walther-Franks, Roland Schr¨ oder-Kroll, Jan Holthusen, and Rainer Malaka Research Group Digital Media TZI, University of Bremen Bibliothekstr. 1, 28359 Bremen, Germany
Abstract. We present a selection technique for 2D and 3D environments based on proxy objects designed to improve selection of occluded and dynamic objects. We explore the design space for proxies, of which we implemented the properties colour similarity and motion similarity and tested them in a user study. Our technique significantly increases selection precision but is slower than the reference selection technique, suggesting a mix of both to optimise speed versus error rate for real world applications. Keywords: selection techniques, proxy objects, computer graphics.
1
Introduction
Efficient and precise selection is a fundamental requirement of many graphical 2D and 3D applications. In applications that deal with many occluded and/or moving objects selection can be a very difficult task. In this paper, we present a selection technique based on proxy objects for fast and precise selection of occluded and moving objects. It is designed to scale across input devices, from mouse input and single-touch to multi-touch. We describe the design space of proxy properties and explore a subset of these, colour similarity and motion similarity in a user study. While the reference technique is faster, our technique significantly increases selection precision under difficult conditions, suggesting a mix of both techniques for real-world applications.
2
Related Work
The Taptap technique deals with finger occlusion and selection in cluttered situations utilizing zooming [5]. The Handle Flags technique uses miniature representation for the selection of dense/occluding pen strokes [2]. Splatter temporarily separates occluded objects for direct manipulation, original objects and remaining scene are rendered differently [4]. Introducing constraints to reduce the degree of freedom of the interaction is done by Bier [1], we apply constraints to the proxy objects. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 142–145, 2011. c Springer-Verlag Berlin Heidelberg 2011
Proxy-Based Selection for Occluded and Dynamic Objects
143
Fig. 1. Screenshots of proxy-based selection for occluded (left) and dynamic objects (right – object/proxy are moving in the direction indicated by the arrow). In both cases colour similarity is applied, for the dynamic object motion similarity is also employed.
Interacting with dynamic objects has been investigated less extensively. Gunn et al. do this with the Target Lock technique: the selection sticks to the last object, touched by the pointer [3]. Several techniques for the selection of dynamic targets are introduced by Schr¨ oder-Kroll et al. [6].
3
Proxy-Based Selection
Our technique is designed for selection of static and dynamic objects, especially in cluttered scenes with occlusion, in mouse, single, and multi-touch and single and multi-user environments. Although we only implemented and tested our method in a 2D application it is easily generalizable to 3D. 3.1
Proxy Algorithm
The proxy algorithm is as follows: 1. 2. 3. 4. 5. 3.2
The user touches/clicks near the object she wants to select If only one object is within radius the object is directly selected If several objects are within radius a proxy is generated for every object The user moves finger/cursor over a proxy and releases it to select an object If the user releases finger/mouse when not over a proxy they will disappear Proxy Properties
We identified the following properties for the proxy design: Threshold radius is used to determine the set of objects proxies should be generated for. Connecting visual clues such as a line between the proxy and the referenced object indicate the proxy relation. Colour similarity – similar colour of the proxy object and the reference object can be a very strong visual clue. Motion similarity – proxies that reference dynamic objects incorporate the motion from the referenced object using a slower speed.
144
M. Herrlich et al.
Directed vibration instead of constant motion as motion indicator constrains the movement of the proxy. We excluded this technique from evaluation as we received unclear feedback in our informal tests. Collision detection and physics ensure that proxy objects do not overlap each other, i.e., proxy objects push each other out of their way. Spatial layout of proxies is important to facilitate quick and precise selection. We opted for a circular layout. Relative size – the size of a proxy is adjusted to be proportional to the reference object’s size (2D) or the reference object’s distance (3D). Since normalizing the proxy sizes in a dynamic situation is not trivial we currently exclude this. We further exclude the shape dimension in order to get a clearer understanding of the other properties, our proxies are always of circular shape.
4
User Study
Our study pitted proxies against the widely-used nearest-target selection for 3D scenes. Eight participants (5 m, 3 f) participated. The independent variables were motion similarity and colour similarity, resulting in four technique variants, plus a control condition. We employed a within-subjects design using a latin square for counter-balancing. The task consisted of selecting a target object distinguishable by its colour among a number of other objects. We tested 3 different scenarios, with 4 positions of the target object per scenario. In the static scenario the target was neither moving nor occluded by other objects. In the occluded scenario the target had 2/3 of its area occluded by other objects. In the dynamic scenario the target was moving on a trajectory from one side of the screen to the other. Each scenario was repeated 4 times, resulting in 240 trials per participant (5 techniques x 4 repetitions x 3 scenarios x 4 positions). We conducted all tests on a 3M M2256PW multi-touch monitor. We used a 16 second timeout after which a trial was counted as unsuccessful and counted falsely selected objects within this limit. ANOVA revealed a significant learning effect between the first run-through and the following repetitions. Therefore we excluded the first run of each session from all further analysis. Paired t-tests on the selection time revealed the reference technique without proxies to be overall significantly faster than all proxy techniques. Within the group of proxy techniques colour was significantly faster than the other proxy techniques, colour and motion in combination being the overall fastest of the proxy techniques if not significantly so. However, all proxy techniques significantly outperformed the reference technique regarding error rate. Colour was clearly the most influential factor for precision and colour and motion in combination performed best, although not significantly better than just colour. Qualitative observation showed five participants to have been occasionally distracted by removing their fingers too soon from the surface letting the proxy objects vanish immediately. In the dynamic scenario four participants automatically tried to “catch” the target object directly on several occasions, completely
Proxy-Based Selection for Occluded and Dynamic Objects
145
ignoring the proxies. On the other hand, every participant tended to choose and successfully use proxies for occluded objects. To our surprise, even the slow motion of the proxy objects with the motion technique posed a problem for one participant.
5
Discussion
While the evaluation shows proxy selection to be successfully put to use by novice participants, we have to revise our speed assumption compared to the reference technique. We were able affirm our initial assumption that proxies are less errorprone, which makes proxy selection very useful for many real world applications such as 3D modeling and animation software. Errors were also greatly reduced for the selection of dynamic targets. But the visual clue of the moving target object seems so strong as to distract especially novice users from using the proxies, suggesting some learning to be required. In our experiments colour was by far the most important visual clue, however, motion also had a positive effect and so could be used in addition to colour or when colour is not an option. In real world applications a mix of direct and proxy-based selection might yield the best results and multi-touch specific layout optimisations are still to be tested.
6
Conclusion
We presented a technique for selection in 2D and 3D environments based on proxy objects. We proposed different properties and design dimensions in order to help users in the selection task and explored a subset of these properties in a user evaluation. Our technique significantly increases selection precision under difficult conditions such as heavy occlusion. The reference automatic selection still is the faster technique thus suggesting a mix of both for real world applications to optimise speed versus error rate.
References 1. Bier, E.A.: Snap-dragging in three dimensions. In: ACM SIGGRAPH Computer Graphics, vol. 24, pp. 146–157. ACM, New York (1990) 2. Grossman, T., Baudisch, P., Hinckley, K.: Handle flags: efficient and flexible selections for inking applications. In: Proc. GI 2009, pp. 167–174. Canadian Information Processing Society, Toronto (2009) 3. Gunn, T.J., Irani, P., Anderson, J.: An evaluation of techniques for selecting moving targets. In: Proc. CHI 2009, pp. 3329–3334. ACM, Boston (2009) 4. Ramos, G., Robertson, G., Czerwinski, M., Tan, D., Baudisch, P., Hinckley, K., Agrawala, M.: Tumble! splat! helping users access and manipulate occluded content in 2D drawings. In: Proc. AVI 2006, pp. 428–435. ACM, New York (2006) 5. Roudaut, A., Huot, S., Lecolinet, E.: TapTap and MagStick: improving one-handed target acquisition on small touch-screens. In: Proc. AVI 2008, pp. 146–157. ACM, New York (2008) 6. Schr¨ oder-Kroll, R., Blom, K., Beckhaus, S.: Interaction techniques for dynamic virtual environments. In: Schumann, M., Kuhlen, T. (eds.) Virtuelle und Erweiterte Realit¨ at, 5. Workshop der GI-Fachgruppe VR/AR, pp. 57–68. Shaker Verlag (2008)
Integrated Rotation and Translation for 3D Manipulation on Multi-Touch Interactive Surfaces Marc Herrlich, Benjamin Walther-Franks, and Rainer Malaka Research Group Digital Media TZI, University of Bremen Bibliothekstr. 1, 28359 Bremen, Germany
Abstract. In the domain of 2D graphical applications multi-touch input is already quite well understood and smoothly integrated translation and rotation of objects widely accepted as a standard interaction technique. However, in 3D VR, modeling, or animation applications, there are no such generally accepted interaction techniques for multi-touch displays featuring the same smooth and fluid interaction style. In this paper we present two novel techniques for integrated 6 degrees of freedom object manipulation on multi-touch displays. They are designed to transfer the smooth 2D interaction properties provided by multi-touch input to the 3D domain. One makes separation of rotation and translation easier, while the other strives for maximum integration of rotation and translation. We present a first user study showing that while both techniques can be used successfully for unimanual and bimanual integrated 3D rotation and translation, the more integrated technique is faster and easier to use. Keywords: 3d, multi-touch, integrated manipulation, gestures.
1
Introduction
In the domain of 2D graphical applications multi-touch input is often characterized by smoothly integrated translation and rotation of objects providing powerful and expressive, yet easy to use and understand interaction metaphors. However, for 3D applications, no such generally accepted interaction techniques for multi-touch displays exist yet, although multi-touch input potentially offers additionally degrees of freedom (DOF) to be exploited by these applications. While interaction with a 2D projection of a 3D space is an entirely different thing than pure 2D interaction in the first place, we think that some of the smoothness of 2D interaction can be transferred to the 3D domain. We present two variant multi-touch full 3D object manipulation techniques for integrated 6DOF object manipulation based on the smooth 2D interaction capabilities provided by multi-touch input. We use approximations of 2D affine transformations mapped to 3D space and integrate them with well established control techniques like the common “turntable” metaphor used in many 3D L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 146–154, 2011. c Springer-Verlag Berlin Heidelberg 2011
Integrated 3D Manipulation on Multi-Touch Surfaces
X
Rotation Y
Z
Turn&Roll
PieRotate
Translation X/Y Z
147
Fig. 1. 3D translation and rotation with our proposed techniques. Two-finger controls also always work with more than two fingers. Although we portray unimanual control here, fingers must not be from the same hand.
modeling tools. The two variants are computationally efficient, allow flexible use of one or two hands and simultaneous as well as separate translation and rotation. The PieRotate variant allows a better separation of individual axis rotation and translation, while Turn&Roll strives for maximum integration (see fig. 1). A first user study comparing the two techniques shows that while the latter outperforms the former in a 3D docking task, there are also arguments for a better separation such as more precise control. Qualitative analysis also provide further insights into common manipulation strategies for 3D content on large multi-touch displays.
2
Related Work
2D manipulation techniques on multi-touch have been well researched. Kruger et al. [6] integrate 2D rotation and translation using a physically-inspired approach based on virtual friction depending on the relative position of the touch and the object center. In a user study they find their integrated RNT approach outperforms separated manipulation using corner handles. Moscovich and Hughes [9] present multi-finger cursors that integrate 2D rotation, translation, and scaling by calculating the approximated affine transformations determined by the contact points. We use the same approximation scheme as Moscovich and Hughes but map the resulting transformations into 3D space. For a good discussion of simple 2D translation and rotation schemes, we refer to Hancock et al. [2]. Research on 3D manipulation only recently started to receive more attention from the research community. Hancock et al. [1] describe several multi-finger techniques extending Kruger’s RNT, theoretically enabling 5 and 6 DOF manipulation. However, they only explore these in a very limited depth range. They use dedicated areas and a fixed mapping of the first three touches. Furthermore, their approach seems to work best with bimanual control. More recently Hancock et al. [3] extended these techniques, also proposing a pinch gesture for depth translation. However, the mapping of the touches is still fixed on the order and limited to three fingers. Reisman et al. [13] use a non-linear least squares
148
M. Herrlich, B. Walther-Franks, and R. Malaka
energy optimization scheme to calculate 3D transformations directly from the screen space coordinate of the touch contacts. While their approach seems powerful, it is computationally very expensive and more importantly has intrinsic ambiguities they resolved by introducing empirically determined biases. Finally, Martinet et al. [7] study techniques that use an additional finger for 3 DOF translation, but do not investigate rotation. In the area of physically-based interaction Wilson et al. [15] and Wilson [14] present techniques for integrating touch input into a physics simulation and for simulating grasping of 3D objects. However, in their described form both approaches are limited to more or less 2D interaction only. The question whether integrated rotation and translation is desirable or not has been extensively discussed. However, to our knowledge it is still undecided or seems at least highly dependent on the task as noted by Jacob et al. [5]. Hancock et al. discuss the seemingly contradicting results of research in this area [1], while results of Moscovich and Hughes [10] suggest that unimanual integrated control of position and orientation might be desirable and possible. On the other hand, Nacenta et al. [11] present different techniques for separation between rotation and translation manipulations.
3
Proposed Manipulation Techniques
The design space for 3D manipulation strategies on multi-touch screens can be roughly divided into three approaches. Simply porting handles (or gizmos), the manipulation tools used in single-point control CAD and animation software, to interactive screens is problematic, as they are designed for high-resolution precise 2DOF control, where touch is characterized by less precise control of more DOFs. Full 3D affine mappings are hard to compute and require empirical biases for solvers in ambiguous situations [13]. Specific multi-finger mappings can use the additional DOFs provided by more than one touch point in different ways: The number of fingers can be used to modally switch between manipulation states/tools, thus multiplying the DOFs [8], which is unsuitable for integrated manipulation. Assigning individual DOF control to each finger, e.g. dependent on order [1] results in interaction that is increasingly complex and hard to control simultaneously. We approach 3D manipulation with 2D affine mappings which we map to 3D space and devise them in a way that integrate further degrees of freedom. Our approach has three distinct advantages no existing approach offers: – Computationally simple and robust. Affine 2D mappings are easy to compute and work very well in 2D space. They are robust to the use of more fingers and do not rely on any special order of touches. The more fingers used, the more stable the interaction and the more fine grained the control becomes because of the shared influence of each touch. – Flexible regarding manuality. Our techniques can be operated with one hand, keeping the other free for additional interactions such as camera manipulation. They should easily incorporate bimanual input where beneficial, such as to achieve more fine-grained or coarse control.
Integrated 3D Manipulation on Multi-Touch Surfaces
149
– Simultaneous and independent control. Our claim is that the techniques are integrated, meaning that they allow either rotation, translation or both in a consistent metaphor. The assumption being that simultaneous control is more efficient, while separate control can also be desired [11] Our compound techniques are made up of a common translation technique we call Z-Pinch and two different rotation controls which we named Turn&Roll and PieRotate (see fig. 1). The combination of Z-Pinch with Turn&Roll is designed for an easier integration of rotation and translation, while the combination with PieRotate is geared toward more separate rotation and translation. The controls are activated when touches occur in the object control area, which is the (invisible) projection of the bounding sphere of an object. 3.1
Z-Pinch
Our screen z or depth manipulation is based on existing research, by employing an additional finger to determine change in depth [1,7]. We conceptualize this as the equivalent of a pinch gesture often used to scale objects in a 2D scenario. Thus the change in finger span is mapped to depth: Decrease in finger span moves the object away from the screen, increase toward it. By conceptualising it as Z-Pinch, we emphasize that it can be easily used unimanually, typically with the span between index finger and thumb. Nevertheless, a different finger or more than two fingers from the same or other hand can be also used. 3.2
Turn and Roll
The Turn&Roll technique combines the turntable rotation metaphor commonly used in 3D editing with z-axis rotation or roll from affine 2D transforms. The two different modes are realized with a change in number of fingers touching the object: Horizontal movement of one finger rotates around screen y, Vertical movement around screen x and a 2+ finger twist gesture rotates around screen z. This is combined with a 2+ finger pan for screen x/y translation and pinch for z translation (see fig. 1). Turn&Roll with Z-Pinch thus enables simultaneous 4DOF control of 3D position and z rotation, with easy switching to 2DOF x/y rotation control with only one finger. 3.3
PieRotate
PieRotate uses regions in the object control area to rotate around a specific axis. A twist gesture rotates the object around the x, y, or z axis, depending on where it is executed. The regions are divided in a three-part pie, giving the technique its name. This is combined with a 1+ finger pan for screen x/y translation and 2+ pinch for z translation (see fig. 1). Translation ignores the pie areas. PieRotate with Z-Pinch thus enables simultaneous 4DOF control of 3D position and any one rotation axis. The rotation axis can be switched by registering the twist gesture in a different part of the object control area.
150
4 4.1
M. Herrlich, B. Walther-Franks, and R. Malaka
User Study Goals
The two techniques are similar in their inheritance of simple manipulation concepts from 2D affine mappings and how it is applied to position control but differ in the rotation metaphor and how this is combined with translation. We performed a small evaluation to study the impact on performance and user satisfaction in a full 3D docking task. We also wanted to observe docking strategies regarding simultaneous and independent translation and rotation and the use of unimanual vs. bimanual control. We decided against a mouse/keyboard/tablet baseline because our goal is not to replace the mouse/keyboard/tablet interfaces used by expert users in their dedicated workplaces. 4.2
Participants
Five males and three females, most from a computer science background, participated in our study. All were right handed. Two had limited experience with 3D applications, three advanced and three highly advanced experience. Five had passing experience with touch interfaces such as mobile phones and interactive displays, three full experience. We welcomed this diversity in order to gain an impression on how prior experience reflects in use of touch control for 3D. 4.3
Apparatus
Participants performed the tasks standing at an impressx xdesk, which uses the diffused illumination technique. The table has a height of 90 cm and a screen diagonal of 52 inch. The rear-projected image has a resolution of 1280 × 800 pixels. It uses two cameras with 640 × 480 pixels resolution that overlap to give an approximate virtual camera resolution of 640 × 900. The tracker provides TUIO messages that were fed to our custom-built XNA application. 4.4
Task
The 3D docking task is the one used in previous 3D manipulation experiments [1,4]. A tetrahedron with unique corners/vertices has to be placed in a transparent tetrahedron. The target tetrahedron turns yellow when distances between corresponding vertices is below a threshold corresponding to 10% tetrahedron edge length. After successful docking the next task begins with a different position and orientation of the transparent object. The setup offers depth cues of perspective, occlusion, and shadow on a ground plane. 4.5
Design
We used a repeated measures within-subjects design. The independent variables were technique (Turn&Roll and PieRotate), position (top left front, bottom left back, top right back, bottom right front in a cubic layout around the target object), and orientation (180◦ rotation around x, y, and z axis). The order of techniques was counter-balanced between participants. Each participant performed
Integrated 3D Manipulation on Multi-Touch Surfaces
151
Fig. 2. Mean task completion times for each technique
two repetitions of the same random order of all position/orientation combinations, resulting in 48 trials per participant. At the beginning of each technique block pair, the technique was explained and participants had time to get used to technique and task. We made sure not to indicate which or how many hands to use.
5 5.1
Results Task Completion Times
Participants completed 57% of the tasks with PieRotate and 85% of the tasks with Turn&Roll. We observed slightly lower mean times in the second block for the PieRotate technique, but there was no significant learning effect across blocks. A paired samples t-test showed the mean times for the tasks performed with Turn&Roll to be significantly lower (t97 = 3.166, p = 0.002). A repeated measures analysis of variance also showed a significant effect for position (F3,30 = 3.971, p = 0.017) and a significant interaction between position and technique (F3,24 = 3.815, p = 0.023). A paired samples t-test showed the aggregated times for positions in the upper screen half to be significantly higher than for positions in the lower half with the PieRotate technique (t25 = 3.403, p = 0.002). 5.2
Manipulation Strategies
The general strategy taken by participants was to start by roughly orienting the target to match the other object, then positioning it on the other object, then doing fine adjustment for rotation and position until the objects matched. Five participants only performed rotation and translation operations serially, while three also considerably mixed serial with parallel control of orientation and position. There was no preference for manuality between subjects, three performed two-finger gestures with the index fingers of each hand, three with thumb and index or middle finger and index of one hand, and two took a mainly unimanual approach, one using the index of the second hand only for z-pinch, the other for more fine-grained or coarse control in rotation or translation. Most
152
M. Herrlich, B. Walther-Franks, and R. Malaka
participants used shadow as main depth cue during z-translation, sometimes not even looking at the objects themselves. In general, the usage of different control zones in the PieRotate technique posed no problem. Some switched their hands for the top left/right areas of the pie. One participant fundamentally didn’t understand the mapping and did not complete a single task. Only one participant combined translation and rotation with this technique. Turn&Roll encouraged three participants to perform translation and rotation in parallel, two of these even with 4DOF control of 3D position and z rotation. The fast switching between one and two fingers for 3DOF rotation control was widely used by two participants. 5.3
Qualitative Feedback
Participants in general preferred the Turn&Roll variant. The main reason given was that it was easier to use and more intuitive. they also found it easier to correct mistakes, and one explicitly praised the fluid change between rotation and translation. Participants complained about the cognitive load of memorizing the axis mappings in the PieRotate technique. However, many judged PieRotate to offer more precise control. Some even assumed it might take longer to master but in the end might outperform the other technique.
6
Discussion
Task completion rates and times as well as subjective feedback clearly favor the Turn&Roll technique. It was judged more intuitive and was easier to master. In some cases it even encouraged fast input switching between one- and two-finger gestures, which was not observable for the other technique. Nevertheless we believe that PieRotate has its benefits. As participants stated, it is harder to learn, but might provide better control when fully mastered. This raises the question in how far the “walk up and use” assumption [12] holds for experiments on complex tasks such as 3D manipulation, for which a certain specialized expertise is necessary. PieRotate could likely benefit from visual feedback. This would emphasize its widget character – for a comparison of more direct approaches like Turn&Roll with widget-based controls more research on 3D widgets for touch interaction is needed. While sequential manipulation strategies dominate for the 3D docking task, parallel control does occur, especially with the Turn&Roll method. This possibly also contributed to the improved task completion times. It might even be beneficial to enforce parallelized input as we did (two-finger control allows simultaneous translation and rotation, depending on gesture) as our study showed that it was even used when not intended. All in all the results favor integrated rotation and translation. However, it is necessary to further optimize integration to also allow clean rotation or translation, such as with magnitude filtering [11]. The general problem of dissonance between 3D perception and 2D control also holds for our setup, and more depth cues such as parallax should be supported. In future we intend to add camera controls for a real 3D manipulation environment.
Integrated 3D Manipulation on Multi-Touch Surfaces
7
153
Conclusion
We presented techniques for 3D manipulation on multi-touch screens based on 2D affine mappings well known from 2D multi-touch interaction. This approach makes them computationally robust, usable with one or two hands and allows integrated translation and rotation. We presented two variant techniques, one of which we assumed to favor sequential, the other more parallel use of translation and rotation. A first experiment favored the latter technique regarding task completion times and perceived intuitiveness. We thus have a simple, robust and flexible solution for full 3D manipulation on interactive tables, which future research on 3D applications on multi-touch interactive screens can build on.
Acknowledgements This work was partially funded by the Klaus Tschira Foundation.
References 1. Hancock, M., Carpendale, S., Cockburn, A.: Shallow-depth 3d interaction: design and evaluation of one-, two- and three-touch techniques. In: Proc. CHI, pp. 1147– 1156. ACM, New York (2007), http://dx.doi.org/10.1145/1240624.1240798 2. Hancock, M., Carpendale, S., Vernier, F., Wigdor, D., Shen, C.: Chia Shen: Rotation and translation mechanisms for tabletop interaction. In: Proc. TABLETOP, pp. 79–88. IEEE, Los Alamitos (2006), http://dx.doi.org/10.1109/TABLETOP.2006.26 3. Hancock, M., Cate, T.T., Carpendale, S.: Sticky tools: Full 6dof force-based interaction for multi-touch tables. In: Proc. ITS (2009) 4. Hancock, M., Hilliges, O., Collins, C., Baur, D., Carpendale, S.: Exploring tangible and direct touch interfaces for manipulating 2d and 3d information on a digital table. In: Proc. ITS (2009) 5. Jacob, R.J.K., Sibert, L.E., McFarlane, D.C., Mullen, M.P.: Integrality and separability of input devices. Comput.-Hum. Interact. 1(1), 3–26 (1994), http://dx.doi.org/10.1145/174630.174631 6. Kruger, R., Carpendale, S., Scott, S.D., Tang, A.: Fluid integration of rotation and translation. In: Proc. CHI, pp. 601–610. ACM, New York (2005), http://dx.doi.org/10.1145/1054972.1055055 7. Martinet, A., Casiez, G., Grisoni, L.: The design and evaluation of 3d positioning techniques for multi-touch displays. In: Proc. 3DUI, pp. 115–118. IEEE Computer Society, Los Alamitos (2010), http://dx.doi.org/10.1109/3DUI.2010.5444709 8. Matejka, J., Grossman, T., Lo, J., Fitzmaurice, G.: The design and evaluation of multi-finger mouse emulation techniques. In: Proc. CHI, pp. 1073–1082. ACM, New York (2009), http://dx.doi.org/10.1145/1518701.1518865 9. Moscovich, T., Hughes, J.F.: Multi-finger cursor techniques. In: Proc. GI, pp. 1–7. Canadian Information Processing Society (2006), http://portal.acm.org/citation.cfm?id=1143081 10. Moscovich, T., Hughes, J.F.: Indirect mappings of multi-touch input using one and two hands. In: Proc. CHI, pp. 1275–1284. ACM, New York (2008), http://dx.doi.org/10.1145/1357054.1357254
154
M. Herrlich, B. Walther-Franks, and R. Malaka
11. Nacenta, M.A., Baudisch, P., Benko, H., Wilson, A.: Separability of spatial manipulations in multi-touch interfaces. In: Proc. GI, pp. 175–182. Canadian Information Processing Society (2009), http://portal.acm.org/citation.cfm?id=1555919 12. Olsen, D.R.: Evaluating user interface systems research. In: Proc. UIST, pp. 251– 258. ACM, New York (2007), http://dx.doi.org/10.1145/1294211.1294256 13. Reisman, J.L., Davidson, P.L., Han, J.Y.: A screen-space formulation for 2d and 3d direct manipulation. In: Proc. UIST, pp. 69–78. ACM, New York (2009), http://dx.doi.org/10.1145/1622176.1622190 14. Wilson, A.D.: Simulating grasping behavior on an imaging interactive surface. In: ITS (2009) 15. Wilson, A.D., Izadi, S., Hilliges, O., Mendoza, A.G., Kirk, D.: Bringing physics to the surface. In: Proc. UIST, pp. 67–76. ACM, New York (2008), http://dx.doi.org/10.1145/1449715.1449728
Left and Right Hand Distinction for Multi-touch Displays Benjamin Walther-Franks, Marc Herrlich, Markus Aust, and Rainer Malaka Research Group Digital Media TZI, University of Bremen, Germany
Abstract. In the physical world we use both hands in a very distinctive manner. Much research has been dedicated to transfer this principle to the digital realm, including multi-touch interactive surfaces. However, without the possibility to reliably distinguish between hands, interaction design is very limited. We present an approach for enhancing multitouch systems based on diffuse illumination with left and right hand distinction. Using anatomical properties of the human hand we derive a simple empirical model and heuristics that, when fed into a decision tree classifier, enable real-time hand distinction for multi-touch applications. Keywords: multi-touch, hand distinction, diffuse illumination.
1
Introduction
In the physical world we quite naturally use both our hands to their full potential. In the digital world users have long been restricted to quite limited interaction regarding bimanuality. Ever since the work of Guiard on the asymmetric use of hands in bimanual interaction [4], researchers have investigated the use of bimanual interaction in the digital world, for example making use of two mice for navigation and manipulation in 3D environments [1]. The advancement of interactive surfaces such as multi-touch displays provides new opportunities for high-degree-of-freedom interaction with both hands. In order to create interfaces that fully use the potential of bimanual interaction it is necessary for the system to be able to assign touches to a user’s left and right hand. Hardware devices and software frameworks currently in use are not able to provide this kind of information. We investigated heuristics and classification methods for distinguishing between the left and right hand on optical interactive surfaces. The result is an empirical model based on anatomical properties of the hand and lower arm and on observations of typical user behavior that allows us to train a decision tree-based classifier for hand detection. The main idea is to take the whole armhand-chain and its position and orientation into account. Preliminary tests using a small sample data set already yield promising results. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 155–158, 2011. c Springer-Verlag Berlin Heidelberg 2011
156
2
B. Walther-Franks et al.
Related Work
While some work has been done in the area of user detection and distinction for multi-touch displays and tabletops, e.g., using electrical properties [3] or additional super-sonic and infrared sensors [6], hand distinction for multi-touch hardware is still mostly an unsolved problem. Closely related to this paper is the work of Dang et al. [2], who investigated simple heuristics for hand detection, i.e., identifying touches belonging to the same hand(s), but not for hand distinction. Other recent works use additional hardware or heavy instrumentation of the user such as gloves with attached fiducial markers [5]. In contrast, our approach does not require any additional hardware for a standard diffuse illumination setup.
3
Properties of the Human Hand
The human hand is a very versatile tool due to its ability of precise articulation. But it cannot be used beyond the limits of its inner structure. The question is to what extent the hand can be articulated in touch interactions. In ulnar-radial direction, fingers posses a triangle-shaped interaction range, while the interaction range of the thumb is elliptic. Its touch blob size is usually slightly bigger than those of the fingers due to a smaller touch angle. The wrist joint can be bent 30◦ in ulnar-radial direction which produces a joint rotation of all distal following hand and finger parts accordingly. In palmar-dorsal direction each finger can be flected and expanded while touching or not touching the surface. This means fingers will be independent phenomena in the input images. When jointly inflected, they can form one single big bright blob in the input image. The wrist joint is responsible for the overall posture of the hand. In dorsal-palmar sense it articulates up to 60◦ . When touching, the hand can be held far from the surface which darkens the palm so that it becomes invisible at almost every illumination level. Or it is sustained parallel to the surface, rendering the palm almost as bright as the touches. To make things worse, the wrist can rest on the surface producing a huge bright spot itself. All these factors complicate the fine tuning of the threshold levels.
4
An Empirical Model for Hand Distinction
We make five assumptions for the algorithmic distinction of the previously described hand properties. For this we assume a double thresholded (with a bright level for objects closest to the surface and a dark level for those at a certain distance), contrast normalized input image from an infrared acquisition device. At first, we assume that without a touch there is no interaction. This means the whole processing chain for hand distinction starts after a first touch has been recognized. When nothing touches the surface there is simply not enough discriminating detail to recognize hand, arm or even finger parts. Secondly, we state that within double thresholded images a bright level touch blob always resides within a bigger dark level blob of the whole hand-armstructure. This allows us to assign (finger-) touches to their hands. Obviously,
Left and Right Hand Distinction for Multi-touch Displays
157
whenever a finger touches the surface the respective hand must be at least nearby, so it becomes visible at darker illumination levels too. Thirdly, finger touches do not have a reliable elliptic orientation. Finger tips produce bright distinctive spots in an infrared image when touching. But the angle between the finger tip and the interaction surface can differ greatly depending on the user and the interaction situation, causing blob shapes to vary between circular and highly elliptic. Precise measurement of touch ellipse angles requires a sufficient acquisition resolution, which is not always guaranteed. Next, we assume that the enclosing big dark blobs present a reliable orientation. Mainly because of their size, they eliminate all the disadvantages of finger touch blobs in terms of retrieving an orientation. Our last assumption concerns the relative position of finger touches: finger blobs always reside at the end of their corresponding dark blob. This allows us to eliminate false touch blobs if they are too far away from the proper hand end.
5
Implementation and Results
Based on double thresholded greyscale images and standard blob detection the following data is used as input for training/detection in a decision tree: blob sizes, blob positions and the orientation of the arm blob. All coordinates are arranged and normalized relative to the arm with the origin at the corner of the hand blob. All data is sorted by touch position (right to left) and composed into input vectors for the decision tree algorithm. Missing data is filled up with zeros. We used a standard implementation of a decision tree from the Weka library. Decision trees provide a good match for the hierarchical nature of the input data, they are fast, and able to handle the absence of data. To train the classifier we selected a number of simple yet often used finger touch postures for testing our framework. These postures were: touch with index finger, double touch with index finger and thumb (e.g. used in the pinch to zoom gesture) and touch with all 5 fingertips. Because of the great effect the wrist joint articulation has for the illumination of the hand, all three finger postures where performed with the wrist joint either resting on the surface or held high above the surface. Then these 6 combined postures were repeated for both hand sides on 12 evenly distributed locations on the interaction plane, resulting in multiple recordings of an overall number of 144 postures and locations. In first tests we achieved around 80% detection rate after relatively short training periods (figure 1). This suggests that in a broader evaluation and training period we could achieve an even better performance. Of course, 80% detection rate is only a first step in the right direction. The remaining problems seem to be related mostly to border-case events. Illumination at the image borders but also distorted proportions when only a small part of the arm is visible are remaining challenges. When only a small part of the arm is visible the usual elliptic shape is not observable and even the origin may be outside the view area. Also the training set needs to be enlarged in the future to be more representative of different users. Currently, no consistency checks
B. Walther-Franks et al. detection rate in %
158
number of training samples Fig. 1. Detection rate versus training samples for three test runs
are performed on the data over time but could potentially increase detection performance quite a bit. For example, if the system already identified the hands successfully in the past and only one hand is lifted from the surface, there should be no unnecessary (an possibly erroneous) redetection.
6
Conclusion and Future Work
In this paper we presented heuristics and a classification approach for hand distinction on state of the art optical interactive surfaces. Our approach easily integrates into existing image processing pipelines. We take into account not only touches but the complete arm-hand-chain above the surface. We presented preliminary tests suggesting a great potential for our method.
References 1. Balakrishnan, R., Kurtenbach, G.: Exploring bimanual camera control and object manipulation in 3d graphics interfaces. In: Proc. CHI, pp. 56–62. ACM, New York (1999) 2. Dang, C.T., Straub, M., Andr´e, E.: Hand distinction for multi-touch tabletop interaction. In: Proc. ITS (2009) 3. Dietz, P., Leigh, D.: Diamondtouch: a multi-user touch technology. In: Proc. UIST, pp. 219–226. ACM, New York (2001) 4. Guiard, Y.: Asymmetric division of labor in human skilled bimanual action: The kinematic chain as a model. Journal of Motor Behaviour 19, 486–517 (1987) 5. Marquardt, N., Kiemer, J., Greenberg, S.: What caused that touch?: expressive interaction with a surface through fiduciary-tagged gloves. In: Proc. ITS, pp. 139–142. ACM, New York (2010) 6. Walther-Franks, B., Schwarten, L., Teichert, J., Krause, M., Herrlich, M.: User detection for a multi-touch table via proximity sensors. In: IEEE Tabletops and Interactive Surfaces 2008. IEEE Computer Society, Los Alamitos (2008)
Visual Communication in Interactive Multimedia Ren´e B¨ uhling, Michael Wißner, and Elisabeth Andr´e Human Centered Multimedia, Augsburg University, Germany {buehling,wissner,andre}@informatik.uni-augsburg.de
Abstract. Careful selection of graphical design can push the narrative strength of graphical projects by adjusting the visual statements to the content-wise statements. While many projects of computer science lack of consequent implementation of artistic principles, graphic designers tend to neglect user interaction and evaluation. In a recent project we therefore started a successful approach to combine both sides. In future work we plan to research on further integration of visual narration into interactive storytelling. Keywords: Virtual Characters, Graphical Design, Cinematography.
1
Motivation
It is a well known principle of practice for professional visual designers to appreciate the psychological impact and subtile transport of information through graphical elements. Artist carefully think about the consequences of their decisions for building graphical elements like shape, color and relation between visual elements to emphasize the story they intend to tell through their artwork. As it is an essential idea behind the work of artists the awareness for visual impact and expressiveness is described in various art literature. Especially authors who come from fields like cartoon animation business where narration is tightly connected to graphics appreciate to carefully think about object appearance to transport the element’s meaning in the told story [1,2]. Application of these concepts is proofed by many examples of practice in illustration, movie making or interactive media. Frisky genres like fantasy and adventure qualify for noticeable visual adjustments of this kind in particular. While it is a tool of the movie makers to subtly adjust the audience’s mood to the mood of the characters on screen other examples like the famous role playing game “World of Warcraft” utilize visual narration in an even more obvious and content-related way. In an artbook about making the game [3] it is shown how similar environmental objects differ in appearance according to the mood and theme of the location they are placed in. Figure 1 resembles a tree layout used in the game’s artistic concept where the trees look corpulent and healthy in peaceful environments. The graphical design changes towards disturbed and noisy shapes the more negative and depressive the dramaturgical atmosphere gets, up to whithered plants in the most evil landscapes. Summarizing this practice bascially means that the inner world coins the outer world where inner is meant to L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 159–162, 2011. c Springer-Verlag Berlin Heidelberg 2011
160
R. B¨ uhling, M. Wißner, and E. Andr´e
Fig. 1. Tree shapes related to environment dramaturgy concept similar to artwork used in “World of Warcraft”. In good-natured environments trees look healthy and bulky. Evil settings reflect in less healthy tree appearance.
be any disembodied state like emotion, mood, dramaturgy or fate. Outer means any physical or expressive manifestation like visual decoration, shapes of motion, acoustic variations or any dynamic target value in common. It is possible to apply this principle to various aspects of graphical and contentwise authoring like camera, lighting and staging as known from cinematography as well as character design interpreted in a way of personality design. Although this is a common practice in arts, work presented at conferences like ”International Conference on Intelligent Virtual Agents (IVA)” suggest that members of other disciplines like informactics often seem to disregard the graphical impressiveness in their work. In consequence the characters’ overall appearance is inconsistent from which the user’s engagement may suffer. However our teaching experience shows that it is not just a matter of insufficient artistic talent but in big parts simply the lack of awareness for visual impact which keeps people from utilizing purposefully design decisions to improve the user perception of their work. The appeal of a virtual character strongly increases when designers plan not only random aesthetics but when they also construct a personality which is strongly affected by living conditions like fate, education, experience or habitat. Everything that influences the life form may also have effects on the artistic design of the virtual character which in turn may help to transport the spirit and increase the impression of liveliness. For example the age and life experience of a creature must not only result in gray hair but can also characterize the way the creature moves, reacts and talks in aesthetical as well as contentwise way. When users are confronted with the virtual artwork process of time may also change inner states reflected by outer changes. For example when the artwork is part of a movie, the storyline proceeds parallel to playback time. The acting characters experience happenings of the storyline the way that their personal fates are affected and changed. Passing years or shocking experiences may lead a person to age visually for example. Dramaturgical elements leave there marks this way and deepen the expressiveness of the whole narration by weaving past events into present and future situations. In our upcoming work we intend to do further research on connecting such visual dramaturgy with interactive storytelling. Basically the movie’s principles of influence by proceedings can be applied to interactive applications too. The main difference is that influence is
Visual Communication in Interactive Multimedia
161
no longer linear, defined by time, but branched and bound to user input or any dynamic source in common.
2
Exemplary Application
Within the DynaLearn project we aimed to implement visual suggestion by creating a set of virtual characters which goes beyond being just a decorative feature. Each character fills a specific role on screen and acts as a contact person for the user to reach program functions like help, feedback and model diagnosis. The different functions als well as the different virtual personalities which build a social dramaturgy where strengthened by deliberate graphical design decisions to suggest the underlaying meaning visually. Our adjustmens included various aspects of visual modeling like shape, colors and motion space for gestures and animations. To proof our approach of visual suggestion we did three evaluations with 283 international students. We first compared our set of cartoonish hamster characters directly to other virtual character designs (Character Compare), evaluated usecases of our characters in software context compared to similar usecases with characters of other projects (System Compare) and asked finally for a comparison of our agents against each other to test the perception of different roles inside the same graphical set (Role Compare). We asked participants to agree or disagree on a seven point scale for various attributes of impression for each character like “[the character seemed to me] open for new experiences”, “conventional, uncreative” or “dependable, self-disciplined” and compared it to a ranking of these values previously given by the designer. This way of asking for impression ranking gave us a much better idea of how our characters were perceived compared to asking simple yes/no questions on subjective popularity which basically gives not much conclusion of the reasons or general perception. The participant’s impressions indeed matched the designer’s intensions very well, confirming the transportation of non-verbal signals. Yet this project applied visual suggestion mainly to physical attributes so that effects could be deepened in future projects by additionally influencing environmental properties of the stage and scenery. Furthermore realtime 3D technology should be used to enable adjustments in expressivity in realtime which was not possible when using prerendered images for the hamster characters.
3
Future Work
We attempt to incorporate the non-linear nature of interactive media actively influenced by users which requires the connection of graphical artwork with dynamic processes, done by utilization of AI concepts. There is already research on adaption techniques for bringing methods of cinematography to interactive media [4] which deals with separate aspects like camera management [5,6,7], lighting [8,9,10] and soundtrack adjustments [11]. Starting from these study results visual dramaturgy methods as inspired by Shim and Kang [12] can be used in interactive applications that deal with non-linear, interactive dramaturgy and emotion in influencing or perceptive ways.
162
R. B¨ uhling, M. Wißner, and E. Andr´e
Acknowledgments. Parts of the work presented in this paper are co-funded by the EC within the 7th FP, Project no. 231526, and Website: www.DynaLearn.eu.
References 1. Eisner, W.: Graphic Storytelling: The Definitive Guide to Composing a Visual Narrative. North Light Books (2001) 2. Bancroft, T.: Creating Characters with Personality: For Film, TV, Animation, Video Games, and Graphic Novels. Watson-Guptill (2006) 3. BradyGames: The Art of World of Warcraft. BradyGames (2005) 4. Rougvie, M., Olivier, P.: Dynamic editing methods for interactively adapting cinematographic styles. In: Adjunct Proceedings of the 5th European Conference on Interactive TV (Doctoral Consortium) (2007) 5. Christie, M., Olivier, P.: Camera control in computer graphics: models, techniques and applications. In: ACM SIGGRAPH ASIA 2009 Courses, pp. 3:1–3:197 ACM, New York (2009) 6. Kennedy, K., Mercer, R.E.: Planning animation cinematography and shot structure to communicate theme and mood. In: Proceedings of the 2nd International Symposium on Smart Graphics SMARTGRAPH 2002, pp. 1–8. ACM, New York (2002) 7. Kardan, K., Casanova, H.: Virtual cinematography of group scenes using hierarchical lines of actions. In: Proceedings of the 2008 ACM SIGGRAPH Symposium on Video Games, Sandbox 2008, pp. 171–178. ACM, New York (2008) 8. Olivier, P., Ha, H.N., Christie, M.: Smart approaches to lighting design. IT - Information Technology 50, 149–156 (2009) 9. Barzel, R.: Lighting controls for computer cinematography. J. Graph. Tools 2, 1–20 (1997) 10. Zupko, J., El-Nasr, M.S.: System for automated interactive lighting (sail). In: Proceedings of the 4th International Conference on Foundations of Digital Games, FDG 2009, pp. 223–230. ACM, New York (2009) 11. Eladhari, M., Nieuwdorp, R., Fridenfalk, M.: The soundtrack of your mind: mind music - adaptive audio for game characters. In: Proceedings of the 2006 ACM SIGCHI International Conference on Advances in Computer Entertainment Technology, ACE 2006. ACM, New York (2006) 12. Shim, H., Kang, B.G.: Cameo - camera, audio and motion with emotion orchestration for immersive cinematography. In: Proceedings of the 2008 International Conference on Advances in Computer Entertainment Technology, ACE 2008, pp. 115–118. ACM, New York (2008)
Communicative Images Ivan Kopecek and Radek Oslejsek Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno, Czech Republic {kopecek,oslejsek}@fi.muni.cz
Abstract. This paper presents a novel approach to image processing: images are integrated with a dialogue interface that enables them to communicate with the user. This paradigm is supported by exploiting graphical ontologies and using intelligent modules that enable learning from dialogues and knowledge management. The Internet is used for retrieving information about the images as well as for solving more complex tasks in this online environment. Simple examples of the dialogues with the communicative images illustrate the basic idea. Keywords: Ontologies, dialogue systems, images.
1 Introduction Current technologies enable us to associate many relevant pieces of information to images. For example, the date and time of a snapshot, GPS information, and in some cases recorded sounds are often directly associated to photographs by the camera. Typically, this information is saved in the format of the image and can be exploited for image classification and semantics retrieval, see e.g. [1,2,3,4]. Nevertheless, there is usually much more information which is relevant and interesting for the user, but which stays outside or is not directly feasible. Let us imagine a photo from a holiday ten years ago: the woman in the middle is my wife, but who is the guy standing behind her? It is apparently somewhere in the Alps, but what place? What is that peak in the background? Such pieces of information are virtually inaccessible. However, classical images contain many pieces of relevant information hidden amid the complexity of their pixel structure. Yet, many additional relevant pieces of information can be retrieve using current information technologies. GPS coordinates applied to electronic maps allow us to determine where they was taken. Face recognition [5,6,7] may help reveal who the guy in the picture was. Knowing the orientation of the picture, it might be possible to guess what the peak in the background was. Unfortunately, image recognition is still far from being able to help us to fully describe an analyzed picture. If this were so, we would be able to associate plenty of information with the picture, using the knowledge retrieved from the web and large electronic databases. This would result in a picture becoming a complex ontology consisting not only of graphical, but also of other relevant information. Because the volume
This work has been supported by the Mistry of Education, Youth and Sports, Czech Republic, under the research program LC-06008.
L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 163–173, 2011. c Springer-Verlag Berlin Heidelberg 2011
164
I. Kopecek and R. Oslejsek
of relevant information might be very large, the only way to handle such intelligent images is to enable the user to communicate with them. The goal we have set ourselves is making it possible for images to communicate with the user. It is clear that we are currently facing a problem, that nowadays we are still unable to automatically derive enough information about a certain image. However, enabling images to be communicative helps us to overcome with this obstacle, because we can exploit communication also for assigning relevant information to the images. In what follows, the term ”communicative image” stands for a two dimensional graphical object (a picture, a photograph, a graph, . . . ) integrated with a dialogue interface and being equipped with an associated knowledge database. Such an image can communicate with the user and it can also learn from the communication and enlarge its knowledge database. The following examples illustrate the idea of communicative images. Example 1 – Retrieving the Information about the Image Let us assume, that the communicative image in this example is based on the photograph we mentioned above: the holiday snap ten years old taken in some alps with your wife, some guy behind her, and a forgotten peak in the background. Assume further, that we convert this classical photo into the communicative one by adding the dialogue interface, learning capability and other necessary abilities. The following example illustrates a possible dialogue between the user and the communicative image. User: When the snapshot was taken? Image: July 12, 2001 User: What place? Where? Image: GPS coordinates say its in Austria Oetztaller Alps, near the chalet Breslauer Htte. I can you provide with more information about the place, chalet, region etc., if you are interested. User: No, thank you. Who is the guy behind my wife? Image: Comparison with other photos in your album says it is probably someone called John Smith. I have no more information about him. You can tell me something about him, if you want. User: Yes, I think he is a colleague of my wife. What is the peak in the background? Image: The peak is Wildspitze, the second highest mountain in Austria, with an altitude 3,772 m. I can provide you with more information about the Wildspitze peak. User: Is it easy to get on the top? Image: The easiest way needs some experiences in mountaineering. Should I give you a detailed information? User: No, thank you. Example 2 – Assigning Relevant Information to the Image In this example, we assume that the conversational image was converted from a ”classical image” and that it is performing its first communication with the user. This means that the image is provided only with the pieces of information that have been obtained
Communicative Images
165
Fig. 1. Photo of Genoa Lake from its shores
automatically. At the beginning of the dialogue, the image asks the user to confirm the unconfirmed information obtained by graphics recognition methods. User: What is this place? Image: At the moment, I can only provide you with a non-confirmed information obtained by GPS and graphics recognition. You can help me to confirm the information and provide me with other relevant information. This can substantially enhance my ability to provide you with many pieces of non-visual information. Do you agree? User: Yes, I agree. Image: Thank you. Now, I will ask you to confirm or reject some assumptions. The image seems to be a landscape photo. Confirm please. User: Correct. Image: There is an object located in the left part of the photo which looks like a fire or an exotic animal, but I am not sure. Can you help me? User: The object is a fire. Image: Thank you. There is a lake or sea approximately in the right hand part of the photo. Confirm please. User: Correct. Image: Two swans are floating on the lake. In the background there are visible mountains peaks partially covered by clouds. The fire is located near the lake shore. Some bushes and trees are seen in the left hand part of the photo. Confirm please. User: Correct. Image: Thank you. I now have some important and relevant facts that can help me in the next communication with you. Now I can answer your question. The photo shows
166
I. Kopecek and R. Oslejsek
the shore of the Genoa lake near Lausanne in Switzerland. The mountains visible at the background are Alps. More questions? These two examples correspond to the two basic modes of the dialogue interface. In the first example, the user obtains the information about the image and in the second, the user either confirms the automatically obtained pieces of information, or directly contributes it which simultaneously enriches the image ontology. In cooperation with the knowledge management module, the dialogue interface has to be able to combine both modes in order to achieve high efficiency in managing the image ontology and in providing suitable dialogue strategies. The crucial issue is how much this approach will be successful in involving of the users in confirming the unsure information and contributing it to the image ontology. It seems to be realistic to assume that many users will be interested in making their own collections of images communicative, as it highly increases their intelligibility and attractiveness. The social networks, where many of the images are shared between the network community, are the obvious places for sharing communicative images and spreading the idea and technology. Another field, which could benefit from communicative images is e-learning. And for the visually impaired, this approach is promising, as the communicative images are directly accessible for them. In the following sections we outline a more detailed and more technical description of the approach.
2 Image Semantics In order to enable images to be communicative, we need to be able to process the pieces of information describing the image semantics. Therefore, our first considerations are devoted to this issue. Image semantics can be described in either a structured or unstructured way. Unstructured annotation simply assigns semantics to the graphical content in the form of textual description, list of keywords, etc. This is adequate for searching for relevant images in collections as well as for getting some approximate information about the graphical content, but it does not provide enough data for more subtle dialogue-based image investigation. The structured annotation supported by the ontologies describes the semantics of the image using semantic categories, their properties and relationships. This prevent any chaos from creeping into the terminology and it keeps the semantics consistent. It also classifies the depicted objects by linking them with concrete semantic categories and assigning concrete values to properties. For example, if the ontology defines a car semantic category with brand property then the annotation can state that there is a Chrysler in the picture by confirming the presence of a car with the ”Chrysler” value assigned to the brand property. Of considerable value is the fact that ontologies standardize the description of terms in different languages which supports a multilingual approach.
Communicative Images
167
2.1 Coupling Ontology-Based Semantics with Images Web Ontology Language, OWL [8], and Scalable Vector Graphics, SVG [9], can be used to provide a technical framework for coupling the ontology-based semantics with vector and raster images. Both SVG and OWL are based on XML technology. The SVG format organizes the graphical content into a hierarchical scene graph using several container and graphical elements, e.g. , , <path>, etc. Although the SVG is primarily a vector format, it supports raster images as well. Raster images can be directly embedded and they can also be inserted as links to external files. Annotation data can be separated from the graphical content within the <metadata> section at the top of the SVG file, as shown in the following fragment: <svg> <metadata> ... graphical content ...
The upper part represents the annotation. In this example, the annotation section does not include a complete ontology definition but only a link to the external OWL file. The Head and Hair elements in the fragment represent semantic terms that are defined in the referenced ontology, and their presence in the annotation section implies the presence of these objects in the picture. The occurrence of head and hair together with the absence of other parts of a body in the annotation indicate that we are probably dealing with a portrait. Because the ontology defines a color property in the Hair semantic category, the annotator has assigned a concrete color in order to say that the person in the portrait is brown-haired. The ontology can also store additional knowledge which is not evident from the annotation section of the image. For example, the ontology can contain the aggregatepart relationship between semantic categories specifying that a head usually consists of hair, two eyes, one nose, etc. Then we can automatically infer that the portrait in the previous fragment probably contains two eyes although this peace of information is not directly included in the annotation section. The SVG-OWL integration can go even further. We can link semantic terms with concrete graphical objects by assigning them the same unique identifier, as illustrated in the following fragment. This approach enables us to map semantic information with
168
I. Kopecek and R. Oslejsek
concrete graphical and topological information stored in the scene graph without needing to reorganize the scene graph. <metadata> <ellipse ...head geometry definition... /> <path ...hair geometry definition... /> ... graphical content continues here ...
To apply this kind of mapping to raster images, it is possible to mark out the geometry of the referenced objects using invisible SVG graphical elements, e.g. transparent polygons, points, rectangles, etc., overlaying the original area of pixels. It should be noted that a dialogue-based investigation of images would not depend on this mapping. Without this mapping, t he retrievable information such as the approximate location of the object in the picture or its relative size, can be handled by the ontology. This enables the user to specify the attributes of the object ”is on the left”, for instance, without the necessity of marking out a concrete geometry. 2.2 Granularity and Accuracy of Semantic Information Ontologies can be structured on various levels of detail and granularity. For instance, let us consider a photo of an airplane. Let us suppose that the supporting ontology defines the most general semantic category Object with property description and sub-category Airplane with properties type and airlines. The picture can be annotated in several ways. We can state that in the picture there is – an Object with description ”Boeing 747 of Korean airlines that carried us to Seoul”, – an Airplane with type set to ”Boeing 747” and description ”Airplane of Korean airlines that carried us to Seoul”, – an Airplane with type set to ”Boeing 747”, airlines set to ”Korean” and description ”The airplane that carried us to Seoul”. – ... The granularity of the supporting ontology and the level of detail of the annotation significantly affect the dialogue strategy. The more information that is hidden in the free text, the more difficult it is to generate dialogues, and vice versa.
Communicative Images
169
3 Communication with Images During the clarification of the concept of communicative images we followed several rules and principles that should be satisfied during the user-image interaction (Fig. 2): – Communication tools should be user-friendly and easily accessible. Nowadays, most pictures can be found on the Internet. We therefore focused on web technologies and direct interaction with images on web pages by means of web browsers. – The communication must be applicable to all images. – The knowledge should be sharable, i.e. the semantics laboriously assigned to one image should be reusable for similar images to make their annotation simpler and more straightforward.
Fig. 2.
3.1 Starting the Communication To allow interactive communication with images from standard web pages it is necessary to launch a code which is capable of processing image semantics and interacting with the user. If the investigated image is an annotated SVG picture, then the code can be embedded in the SVG format, since SVG supports the interpretation of ECMAScript/Javascript. Unfortunately, this situation only arises when the user is browsing a web archive of annotated pictures. To investigate common raster images, i.e. JPG, PNG, GIF, etc., the communication code must be independent of the concrete pictures. Plug-ins to web browsers offer a reasonable option, as they can handle the initial interaction action, e.g. double-clicking on the picture then drives the communication task. The dialogue strategy of the installed plug-in can support various types of interactions. This is important because various devices and users may prefer different communication channels. For example, on a mobile device with a small display users tend to prefer stick/finger input and spoken output to writing questions and displaying textual description. On the other hand, users working on PCs may find it more comfortable to interact in natural language by writing full sentence questions and reading the answers. Visually impaired people prefer written input and synthesized output to mouse-based
170
I. Kopecek and R. Oslejsek
interaction. The plugin-based approach enables the user to choose the strategy that suits their current environment. An image investigation strategy in natural language is discussed in [10]. This approach is based on What-Where Language, WWL, a simple fragment of English that supports questions in the form WHAT is WHERE or WHERE is WHAT, e.g. ”What is in the top left hand corner?”, ”Where is the dog?”, etc. Sound-based feedback during the picture investigation is discussed in [11,12,13]. 3.2 Transforming Image to be Communicative After the user initiates the process by, for example clicking on the image, the communication plug-in automatically asks a remote server to convert the image into a communicative format which then drives the dialogue. During the transformation of the images into the communicative form, the server performs two steps. First it wraps the image in SVG, as discussed above. And then it tries to acquire as much information about the image as possible using image archives and ontologies. The initial semantics thus obtained are inserted into the SVG picture and sent back to the client. The communication plug-in exploits these unconfirmed semantics during this first introductory dialogue step, and enables the user to repair or supplement them. The automatic gathering of initial semantics relies on an intelligent auto-detection and image recognition techniques on the server side. Nowadays, many photos include extra data fields exploitable for this task. EXIF information with creation date and GPS location being one example of this. Domain-specific techniques for image classification and recognition such as face recognition and object detection can be applied to images from relevant domains. Another approach is based on similarity searches. There exist effective algorithms and search engines supporting the image classification and similarity searching from large databases, see e.g. [14,15,16,17,18]. Although the pictures from such databases tend not to contain ontology-based semantics directly exploitable for communicative images, they often contain at least simple annotation data in the form of keywords or brief textual descriptions that can be analyzed and mapped onto the ontology-based semantics. And because these databases are typically very large, the probability of finding relevant data is high. The knowledge about a picture can grow and became more accurate as the user interacts with the picture. To make this data re-usable in the future in situations such as closing the web browser and reopening the web page later on, the communicative image must remain accessible in some way. The plug-in therefore enables the user to store the annotated SVG picture on a local disk. In this case, the original image is directly embedded in the SVG to make the communicative picture independent of the location of the image. But the annotated SVG file is automatically stored on the server. This is important because if the user re-opens the web page and wants to communicate with the image, it is reasonable to start with the semantics gathered during the previous sessions. However, to respect copyright laws, the original image cannot be stored directly on a central server. Instead, the copy of the communicative SVG file simply refers to the original image via its URL. Thus, the server is still able to recognize whether the picture is known from the past interactions or not, and if so, to provide the user with its gathered and inferred semantics.
Communicative Images
171
3.3 Sharing and Extending the Knowledge Base The overall concept of communicative pictures requires dynamic changes and the progressive expansion of the knowledge base that ontologies represent. For example, a user communicating with a picture with content that is not yet covered by the existing knowledge base has to be able to extend the ontology with new terms and facts. On the other hand, the knowledge base is shared by many users and thus changes made by one user impact on many others. The problem is that ontologies do not restrict the level of abstraction used to describe real world objects. For instance, having a domain of animals, different ontologies can describe animals from different points of view, such as anatomy, biotopes, characters, appearance and markings. All of these views are correct and they differ only in the context of their application. However, these views may not be equally useful in the context of communicative images. Communicative images therefore must be supported by carefully designed ontologies that prevent wild ad hoc changes. The graphical ontology [19] is an instance. This OWL ontology prescribes important global visual characteristics and thus guides users to create abstraction suitable for a dialogue-based investigation of graphical content. But the graphical ontology does not handle the non-visual information which is required to understand the meaning of the depicted objects.
Fig. 3. Example of UML use case diagram
Let us consider e-learning study materials for software engineering, for instance. The UML diagram in Fig. 3 consists of several elements, e.g. actors, use cases, etc. Having this picture supported by the graphical ontology, the answer to the question ”what is use case?” can look like ”use case is named oval which can be linked with other use cases and actors by four types of relationships”. This answer is unlikely to satisfy the user because it does not contains any information about the meaning of use cases, rules for their creation, methods of documentation, etc. This is because the graphical ontology is designed to restrict the abstraction to valuable visual aspects. Additional information related to object meaning must be supported by domain-specific ontologies, and the communication module has to mix several ontologies to provide efficient communication.
172
I. Kopecek and R. Oslejsek
3.4 Scalability of Knowledge The knowledge base represented by graphical and domain-specific ontologies may create a huge amount of information. It is reasonable to embed part of this knowledge in the annotated picture. Each piece of knowledge embedded into the communicative picture increases its size as well as the independence from a shared knowledge base. The minimal semantic information required by any communicative image consists of a list of semantic categories (objects in the picture) and the values assigned to prescribed properties, as shown in the first fragment of SVG in Section 2.1. The ontologies responsible for the interpretation of this semantic information can be completely moved to a server where they can operate as a global shared knowledge base. In this case, the communication module has only minimal direct information available. If necessary, the module can connect to the server and ask for help. On the other hand, communicative pictures can have the complete knowledge base embedded inside them, i.e. instead of referring to external ontology, the ontology could be directly included in the SVG. Although the communication module has all the required information for building dialogues, the picture contains a huge amount of information which would make for a very large file. Moreover, most of the information might never be used because it would belong to the domains that are remote from that of the picture. Between these extreme cases, there is a wide range of other possibilities. In one scenario, for example, only the pieces of global knowledge that are relevant to the visual characteristics of the objects are embedded in the SVG. The rest of the semantics, mainly domain-specific semantics, can remain on the server to be available on demand. In this case, the communication module can discuss basic visual aspects without connecting to the server until the user asks for the meaning of objects. 3.5 Conclusions and Future Work Communicative images represent a new and challenging approach utilizing and integrating current technologies, especially graphical ontologies and AI-based modules. The goal is to provide users with a new dimension of exploiting images, enabling the users to communicate with them. In addition, this approach promises important applications for e-learning and in assistive technologies, making the images accessible for the visually impaired people. In this paper, we have outlined basic framework to solve the related problems. Our next work is aimed at enhancing dialogue strategies, inference methods, ontology management methods and testing the technologies in real online and offline environments.
References 1. Sandnes, F.: Where was that photo taken? deriving geographical information from image collections based on temporal exposure attributes. Multimedia Systems, 309–318 (2010) 2. Boutell, M., Luo, J.: Photo classification by integrating image content and camera metadata. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 4, pp. 901–904 (2004)
Communicative Images
173
3. Yuan, J., Luo, J., Wu, Y.: Mining Compositional Features From GPS and Visual Cues for Event Recognition in Photo Collections. IEEE Trans. on Multimedia 7, 705–716 (2010) 4. R´acˇ ek, J., Lud´ık, T.: Development of ontology for support of crisis management processes. In: Informaˇcn´ı technologie pro praxi 2008, pp. 106–111. Technical University of Ostrava (2008) 5. Bartlett, M., Movellan, J., Sejnowski, T.: Face recognition by independent component analysis. IEEE Transactions on Neural Networks 6, 1450–1464 (2002) 6. Haddadnia, J., Ahmadi, M.: N-feature neural network human face recognition. Image and Vision Computing 12, 1071–1082 (2004) 7. Rowley, H., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 23–38 (1998) 8. Lacy, L.W.: Owl: Representing Information Using the Web Ontology Language, Trafford Publishing (January 2005) 9. Eisenberg, J.: SVG Essentials. O’Reilly Media, Inc., Sebastopol (2002) 10. Kopeˇcek, I., Oˇslejˇsek, R.: GATE to accessibility of computer graphics. In: Miesenberger, K., Klaus, J., Zagler, W.L., Karshmer, A.I. (eds.) ICCHP 2008. LNCS, vol. 5105, pp. 295–302. Springer, Heidelberg (2008) 11. Daunys, G., Lauruska, V.: Maps sonification system using digitiser for visually impaired children. In: International Conference on Computers Helping People with Special Needs, pp. 12–15. Springer, Berlin (2006) 12. Kopeˇcek, I., Oˇslejˇsek, R.: Hybrid approach to sonification of color images. In: The 2008 Int. Conf. on Convergence and Hybrid Information Technologies, pp. 722–727. IEEE Computer Society, Los Alamitos (2008) 13. Mathis, R.M.: Constraint scalable vector graphics, accessibility and the semantic web. In: Southeast Con. Proceedings, pp. 588–593. IEEE Computer Society, Los Alamitos (2005) 14. Batko, M., Dohnal, V., Novak, D., Sedmidubsky, J.: MUFIN: A Multi-Feature Indexing Network. In: Skopal, T., Zezula, P. (eds.) SISAP 2009: Second Int. Workshop on Similarity Search and Applications, pp. 158–159. IEEE Computer Society, Los Alamitos (2009) 15. Jaffe, A., Naaman, M., Tassa, T., Davis, M.: Generating summaries and visualization for large collections of geo-referenced photographs. In: Proceedings of the 8th ACM Internat. Workshop on Multimedia Information Retrieval, pp. 89–98. ACM, New York (2006) 16. Abbasi, R., Chernov, S., Nejdl, W., Paiu, R., Staab, S.: Exploiting Flickr Tags and Groups for Finding Landmark Photos. In: Boughanem, M., Berrut, C., Mothe, J., SouleDupuy, C. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 654–661. Springer, Heidelberg (2009) 17. Muller, H., Michoux, N., Bandon, D., Geissbuhler, A.: A review of content-based image retrieval systems in medical applications - clinical benefits and future directions. International Journal of Medical Informatics 1, 1–23 (2004) 18. Bohm, C., Berchtold, S., Keim, D.: Searching in high-dimensional spaces - Index structures for improving the performance of multimedia Databases. ACM Computing Surveys 3, 322– 373 (2001) 19. Oˇslejˇsek, R.: Annotation of pictures by means of graphical ontologies. In: Proc. Int. Conf. on Internet Computing ICOMP 2009, pp. 296–300. CSREA Press (2009)
A Survey on Factors Influencing the Use of News Graphics in Iranian Online Media Maryam Salimi1 and Amir Masoud Amir Mazaheri2 1
Islamic Azad University, Journalism and Graphics, Tehran, Iran [email protected] 2 Islamic Azad University, Sociology, Tehran, Iran am [email protected]
Abstract. News Graphic is a kind of Infographic which reports the news visually. The difference between news graphic and infographic lies in the content and the speed with which they are presented. Although this kind of graphic is frequently used in media around the world, its use is limited in Iran. The present article aims to study the influential factors in the use of news graphics in Iranian media by means of descriptive methods (interviews and analysis). It has found five deterrent factors including high cost of producing news graphics, low familiarity of media managers with news graphics, limited experience and competence of the Iranian graphic designers in this field, technical and communicational limitations, and the problem of producing and supporting Persian graphical software due to lack of professional groups in creating such software. Keywords: Infographic (Information Graphic), News Graphic (Infographic News), breaking news graphic, Online Media & Visual Journalism.
1
Introduction
Infographics1 and News Graphics2 are some of the tools that the media use to present large amounts of news and information in a simple, understandable, comprehensible, believable, most visually attractive way and as fast as possible to the audience. These types of graphics are capable of moving and interacting in the digital and online media and would give more choice to the readers to interact with the graphic and access to the hidden news and information in it. The use of this type of graphics, and especially news graphics is very limited 1
2
Based on Wikipedia’s definition, infographics are visual representation of information, data, or knowledge. In terms of representation, this type of graphic falls into three categories of static, motion, and interactive (Rajamanickam, 2005, 9).In the recent years, two other kinds, celled multimedia and combinational have also emerged. News graphic is a kind of infographic with the difference in functions, content and production speed. The print version is produced in one working day (Cairo, 2005, 38), but the online version is completed and published gradually (Cairo, winter 2005, 16). In this study, the focus is on online version of news graphics used in online media.
L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 174–178, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Survey on Factors Influencing
175
in the Iranian media and particularly in the online ones. To study the reasons for this limitation, five major questions are designed and summarized as follows: Do factors such as the cost of preparation and production of news graphics, the editors’ familiarity with this kind of graphic, the experience and mastery of graphic designers over the news graphics, the availability of Persian graphic software to produce news graphics, and the availability of technical and communicational infrastructures influence the use of this kind of graphic in the Iranian online media? Attempts have been made to use descriptive methods (survey and interview) along with library research to answer all these questions. In the survey method, the respondents consisted of editors and graphic designers of seven news agencies, and twenty-seven Persian newspapers. The news agencies were chosen by the numbering and the systematic sampling method, (based on the top ten newspapers ranked on the Alexa website in three visits on three different dates. Therefore, the respondents included thirty-four graphic designers and editors working in ten online newspapers, and seven news agencies. Questionnaire was used to collect the data, and the SPSS software to extract and analyze the collected data. The descriptive method was also used. In the current method, one hundred of the respondents were active managers, specialists, and news and infographic designers in the world’s giant news agencies, and also some news and infographic designers in Iran. The statistical population also consisted of forty respondents. They were all selected by the available sampling method.
2
Requirements for Preparing and Producing News Graphics and Infographics
Professional manpower: Some of the specialists and activists in this area such as Luis Chumpitaz (editor of the department of news graphics and infographics in Arab Media Group) consider the arrangement of this team as follows: specialist, technicians, feeders, one-man band, Driving Force and builders (Chumpitaz, 2009, 9-10). Iran faces a limited number of designers competent in news graphics and infographics. According to 34 chief editors and designers working in 7 news agencies and 10 online newspapers whom were given questionnaires, Iranian graphic designers’ familiarity with news graphics is below average. So, in questionnaire, 20 people chose average which is 58.8 percent of all, and 12 chose low. Based on this assessment, the main reason why Iranian designers are not competent is lack of necessary competitive motivation for designers to gain experience and competence, and also lack of workshops in universities related to news graphics. Software: In most media of the world, for making news graphic and infographic, such conventional software as Photoshop, Adobe Illustrator, Flash, Flex, Adobe Air, and 3D Max are used. Technologies like PHP and MySQL are also used. However, Iranian graphics designers work with the following programs: Photoshop, Corel Draw, In Design and 3D Max. One of the reasons why Iranians do not use conventional programs used across the world is that these programs are not compatible with Persian language. According to the respondents, the main
176
M. Salimi and A.M.A. Mazaheri
obstacle for production and technical support of Persian graphic software (news graphics in particular) is lack of an integrated and professional software team to produce powerful graphic software in Iran, 14 people agreed on this which is 41.2 percent of all. Required capital with regard to costs: In order to produce news graphics in different media, sufficient capital is needed the amount of which varies across different media. Based on the interviewees’ opinions, the cost of preparing and producing news graphics depends on the kind of medium, the size of its distribution (local, regional, national), style of graphic, its complication and the time spent on it. Based on the results from the survey, another problem facing the news graphics and infographics services is the rather high costs of manpower, equipments, facilities, and the cost of preparing and producing news graphics and infographics. According to the respondents, financial costs of preparing, producing and using news graphics and infographics in Iran is above average. Most of them chose average (between $200 and $1500) with a frequency of 14 and relative frequency of 41.2, and 10 of them, which is 29.4 percent of all, chose high (between $1500 and $10000). Technical and communicational infrastructures: According to the report by the monitoring company of Akamai in the last quarter of 2009, the average speed of the bandwidth across the world is 1.7Mbps, while the speed of the Internet in Iran is 512Kbps for companies and 128Kbps for home users. The minimum standard speed to connect to the Internet via Dial-up is considered to be 56Kbps, but currently Iranian users connect to the Internet with speeds ranging from 30 to 46 Kbps3 . One of the main problems causing low bandwidth is its costs. “The cost of bandwidth in Iran is ten times greater than the world average4.” According to the interviewees, most media embark on producing online news graphics and infographics with regard to their bandwidth, and the emphasis has been put on a speed of 1Mb or more. In order to watch and receive interactive online news graphics without disconnection, a speed of more than 512Kb is needed. According to the results of the survey, limitation of bandwidth can be considered as one of the main technical and communicational problems in preparing and producing news graphics and infographics. According to the respondents, the main problem in technical and communicational infrastructures for Iranian news agencies and online newspapers to produce and use news graphics is low bandwidth which is not compatible with large files of news graphic works, 28 interviewees, which is 82.4 percent of all, agreed on this. Managerial determination: Managerial determination is one of the most important factors in establishing a service or group in order to produce the news graphics and infographics. Such requirements include familiarity of the media managers with the news graphics and infographics and their capabilities, investing funding, employing the necessary workers, dedicating enough space and 3 4
necessary facilities for this purpose, etc. According to the survey results, the familiarity of managers with this kind of graphics is limited and itself has become a deterrent factor in developing news graphics. Managers’ familiarity with news graphics and infographics is also low (with a frequency of 23 and relative frequency of 67.6), and the main reason for this is the managers’ holistic views and concentration on media policies and written contents, and their disregard for new visual effects (with a frequency of 19 and relative frequency of 55.9).
3
Conclusion
According to what has been stated so far, despite numerous uses of the news graphic in various worlds media, and as a kind of infographic with the capability of visually describing the news, it is hardly used in the Iranian media and specially the online media. Five general questions were asked in this way, that whether factors such as the cost of preparing and producing news graphic, the familiarity of the editors with this kind of graphic, the experience and expertise of graphic designers on the news graphic, the availability of Persian graphical software to create news graphics, and the availability of technical and telecommunication infrastructures influence the use of this kind of graphic in the Iranian Online media or not? The answer to these questions with the use of descriptive methods, (like survey, and interview) concludes that the above factors are influential in using news graphics in online media. According to this, the relatively high cost of preparing and producing the news graphics (including designers’ payment, facilities and opportunities), low familiarity of media managers with this kind of graphic, (due to generalization, and their focus on the media policies and written content), the limited competence and experience of Iranian graphic designers, (in the field of news graphics) due to lack of the necessary competitive provocative to gain experience), technical and communicational limitations (specially the restrictions on the bandwidth) for the audience to gain access to the news graphic on the online media, the limitation on the production and support of the Persian graphic software (due to lack of a professional group to produce these software), it relates to deterrent factors in the path of using news graphic (as a kind of graphic) in Iranian online media. There has been more emphasis on the managerial factors more than others, because the most important contributing factor to use the news graphics and infographics is the existence of a determined manager for its production. According to 34 interviewees, the main factor to improve news graphics in Iran is managing determination to produce news graphics (frequency 21, relative frequency 61.8). Eliminating the problems ahead for the use of news graphics fulfills the needs of the audiences and satisfies them by providing simple, quick and objective access to the news with more selective power. It also gives high competing power to the media. Based on the survey results, the influence of news graphics in online media on the readers’ satisfaction is either great (with a frequency of 17 and relative frequency of 50) or very great (with a frequency of 13 and relative frequency of 38.2).
178
4
M. Salimi and A.M.A. Mazaheri
Suggestions
The following are some solutions to solve the aforementioned problems: 1. Familiarity of the managers with infographics can be increased by holding related seminars and workshops for the media managers by the Central Media Studies (Cultural Ministry) and other organizations and active associations in the realm of the print media, the graphic designer’s attempt to familiarize the managers with these kinds of graphics. 2. Competence and experience of news graphic designers can be improved by encouraging graphic designers to learn more about this realm, providing the entrance of designers to the competitive field to produce this kind of graphic and gaining the necessary experiences, convincing the managers to establish a news graphic and infographic department and attracting graphic designers to produce this kind of graphic. 3. The cost of producing news graphics can be decreased by producing and selling samples of produced news graphics about some news agencies and newspapers that are supported financially by the governmental centers and sources and political parties, attempt to convince and attract their financial investment and participation in the media and private sectors and using the governmental support in the form of providing facilities about it. 4. We can solve the problems of software by establishing software teams to produce special Persian news graphic and infographic software with governmental aids, the investment of foreign or private sectors, respecting the copyright, providing the domestic and foreign markets, like (Afghanistan and Tajikistan) with the products, and presenting this kind of software. 5. Technical and Communicational limitations can be removed by the governmental support to the Information Ministry, issuing certificates to the broadband internet providers in the private sector, and their competition to offer cheaper services.
References 1. Cairo, A.: Sailing to the future infographics in the internet era (1.0), p. 38. University of North Carolina, Chapel Hill (2005) 2. Cairo, A.: The future is now. Design Journal 97, 16 (2005) 3. Chumpitaz, L.: Information Graphic (for infographics world summit 17), pp. 9–10. Spain, SNDE (2009) 4. Rajamanickam, V.: Infographics Seminar Handout, Bombay, Industrial Design Center, p. 9. Indian Institute of Technology (2005)
Palliating Visual Artifacts through Audio Rendering Hui Ding and Christian Jacquemin LIMSI–CNRS - University of Paris-Sud, France {hui.ding,christian.jacquemin}@limsi.fr
Abstract. In this paper, we present a pipeline for combining graphical rendering through an impostor-based level of detail (LOD) technique with audio rendering of an environment sound at different LODs. Two experiments were designed to investigate how parameters used to control the impostors and an additional audio modality can impact the visual detection of artifacts produced by the impostor-based LOD rendering technique. Results show that in general, simple stereo sound hardly impact the perception of image artifacts such as graphical discontinuities.
1
Introduction
Cross-modal dependency deals with the role of sensory perceptions (e.g., vision, audio, etc.) and their combinations for generating and improving the multimodal perception of virtual environments. Major works on user perception for interactive applications have been done on spatialized sound rendering [2][5], auditory facilitation on visual modality [4][3], and audio-visual rendering driven by crossmodal perception[1]. These works have confirmed that a multi-modal virtual environment able to synchronize efficiently the audio and visual renderings can provide users with an enhanced visual perception of the virtual environment. In this paper, we are also interested to evaluate to what extent the audio modality can compensate imperfections in the visual modality. Our purpose is to consider these issues from another point of view, i.e., to investigate whether the audio modality can impact the perception of visual artifacts due to the perspective distortion of a 2D image onto which are projected the 3D geometry.
2
Implementation of Audio-Graphical Rendering
In our graphical implementation, we design a real time impostor-based method at five LODs for representing a tree in the clipping volume (frustum). LOD1 is detailed geometric rendering. LOD2 to LOD5 are impostor-based rendering. The selection of LOD is controlled by distance as the selection factor. Once the tree reaches an area corresponding to a different value of LOD selection factor, we use a detailed rendering of the tree and we take snapshots of different sub-parts of rendered tree according to the selected LOD. The snapshots (impostors) will then be used to construct the whole tree during the following steps/frames as L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 179–183, 2011. c Springer-Verlag Berlin Heidelberg 2011
180
H. Ding and C. Jacquemin
Fig. 1. Left snapshot is the LOD2; right snapshot is the LOD3
long as the tree remains in the same value of LOD selection factor. For LOD2, the branches of terminal level of tree are rendered through impostors instead of detailed geometry; and for LOD3, the branches of the last two levels of tree are rendered through impostors (see Fig.1), and so on. As was expected, the impostors produce visual artifacts, mostly graphical discontinuities in our case. The perception of this kind of visual artifacts depends on human vision capacities. In our case, only a recorded sound and its impact on the perceived image quality are considered as audio rendering to be combined with graphical rendering. So, sound is rendered in a non-realistic manner through stereo and level defined by the distance between 3D graphical object and the viewer. The 3D static graphical tree is accompanied by a continuous stereo sound simulation of a tree in the wind. The stereo sound volume changes according to the user’s motion in the scene.
3
Experiment Design
The discontinuity artifacts obviously depend on the parameters that control the impostor rendering such as distance and and angle of view. We have designed two experiments: the first experiment is to illustrate how the parameters impact the perception of visual artifacts, and the second experiment is to test if an additional audio modality will impact the perception of visual artifacts. Since the purpose is the perception of artifacts in the rendered image, the experiments are designed as a test of perception of image quality. Thus, the subjects are required to observe the rendered images instead of blindly navigating in the scene. Besides, since we purposely only present static scenes, the experiment of additional audio modality is also performed by the observation of the same rendered images but combined with stereo sound. Twenty subjects participated in the experiment and each was presented with two series of snapshots captured from the renderings of the same 3D audio-visual tree scene at different LODs varying from LOD2 to LOD5. The snapshots were captured at ten different angles of view for every LOD. Each LOD snapshot has a reference twin tree snapshot with full geometry, i.e., this tree has the same view but is rendered in full detail. The pairs of snapshots with and without LOD are selected randomly in the first series of snapshots, we test if and when the subjects distinguish the impostor and full detail rendering by perceiving visual artifacts. The second series of snapshots are the same as the first but are accompanied by
Palliating Visual Artifacts through Audio Rendering
181
stereo sound simulation with different volume based on the distance to viewer. After observing each snapshot, the subjects had to score the level of perception of visual artifacts ranging from 1 to 5, in which ’1’ and ’5’ refer to the highest (i.e., obvious visual artifacts) and lowest (i.e., no perception of visual artifacts) levels, respectively.
4
Statistical Analysis of Experiment
We applied statistical method on the score data for analyzing the two similarities firstly between impostor-based rendering and full detail rendering and secondly between impostor-based rendering with sound and without sound. 4.1
Analysis for Perception of Visual Artifact in One Modality
We performed analysis of variance (ANOVA) for every view angle on similarity between impostor (with LOD/impostor) and reference (noLOD/withoutimpostor) as within subject factors. When P -value of ANOVA approaches to 1, there is a great similarity between impostor-based rendering (LOD) and full detail rendering (noLOD). After the ANOVA, we get the table of P -value about every angle of view of impostor. To summarize, the results show that: 1. For LOD2, P > 0.9 when angle of view is smaller than approx. 7◦ ; 2. For LOD3, P > 0.9 when angle of view is smaller than approx. 2◦ ; 3. For LOD4, P > 0.9 when angle of view is smaller than approx. 5◦ ; 4. For LOD5, P = 1.0 with all the angle of view. We conclude that subjects cannot notice the difference between LOD and noLOD for LOD5; and for LOD2, LOD3 and LOD4, subjects observe the visual artifacts when the angle of view exceeds threshold of 7◦ , 2◦ and 5◦ for LOD2, LOD3 and LOD4, respectively. We deduce that under certain thresholds of angle of view and distance, the discontinuity artifacts cannot be noticed by human vision. Besides, these two parameters can be manipulated for controlling LOD selection in real time rendering. 4.2
Analysis for Perception of Visual Artifact with Audio-Graphical Effect
Here, we performed one analysis of variance on similarity between all impostors without sound and with sound as within subject factors. The result of this ANOVA is that P -value = 0.76172 for all data, which means that in general subjects cannot perceive the difference between the snapshot of impostors with sound and without sound. In other words, the stereo sound simulation does not obviously impact the perception of visual artifacts. We give a supplementary figure about average scores (see Fig. 2). From this figure, we can see that the average scores slightly decrease from snapshots with sound to snapshots without sound for LOD2 (Dist1) and for LOD3 (Dist2), and slightly increase from snapshots with sound to snapshots without sound for
182
H. Ding and C. Jacquemin
Fig. 2. The average scores for LOD2 to LOD5 with and without sound
LOD5 (Dist4). Based on the first analysis on LOD2 and LOD3 in the previous section, we analyzed that an additional stereo sound will somehow slightly aggravate the feeling of perception of visual artifacts; conversely, users do not notice the visual artifacts caused by impostors at all for LOD5, therefore sound simulation slightly improves the quality of visual perception.
5
Conclusion and Future Work
This paper has investigated the impact of audio modality on visual perception of artifacts. The experiments show that in general, simple stereo sound hardly impact the perception of image discontinuity artifacts. However, there is a tendency that sound enhances the visual perception when there is no image artifact perceived and on the other hand, sound slightly aggravates the perception sense of defects when the image artifacts have been perceived. The restriction to only one type of discontinuity artifacts might be a reason that sound does not help reducing the visual perception of artifacts. In the future, we will consider different type of artifacts and integrate graphical rendering with a realistic spatialized sound. Acknowledgements. The work presented here is partially funded by the Agence Nationale de la Recherche within the project Topophonie, ANR-09-CORD-022. We thank the project partners for the many fruitful discussions.
References 1. Bonneel, N., Suied, C., Viaud-Delmon, I., Drettakis, G.: Bimodal perception of audio-visual material properties for virtual environments. ACM Trans. Appl. Percept. 7, 1:1–1:6 (2010) 2. Funkhouser, T.A., Min, P., Carlbom, I.: Real-time acoustic modeling for distributed virtual environments. In: Proceedings of SIGGRAPH 1999, pp. 365–374 (August 1999) 3. Moeck, T., Bonneel, N., Tsingos, N., Drettakis, G., Viaud-Delmon, I., Aloza, D.: Progressive perceptual audio rendering of complex scenes. In: ACM SIGGRAPH Symp. on Interactive 3D Graphics and Games (April 2007)
Palliating Visual Artifacts through Audio Rendering
183
4. Tsingos, N., Gallo, E., Drettakis, G.: Perceptual audio rendering of complex virtual environments. In: ACM SIGGRAPH 2004 Papers, pp. 249–258. ACM, New York (2004) 5. Tsingos, N., Funkhouser, T., Ngan, A., Carlbom, I.: Modeling acoustics in virtual environments using the uniform theory of diffraction. In: Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH, pp. 545–552 (2001)
A Phong-Based Concept for 3D-Audio Generation Julia Fr¨ ohlich and Ipke Wachsmuth Bielefeld University - Artificial Intelligence Group {jfroehli,ipke}@techfak.uni-bielefeld.de
Abstract. Intelligent virtual objects gain more and more significance in the development of virtual worlds. Although this concept has high potential in generating all kinds of multimodal output, so far it is mostly used to enrich graphical properties. This paper proposes a framework, in which objects, enriched with information about their sound properties, are being processed to generate virtual sound sources. To create a sufficient surround sound experience not only single sounds but also environmental properties have to be considered. We introduce a concept, transferring features from the Phong lighting model to sound rendering. Keywords: Artificial Intelligence, Virtual Reality, Intelligent Virtual Environments, Multimodal Information Presentation.
1
Introduction
In order to improve user experiences and immersion within virtual environments auditory experience has long claimed to be of notable importance [1]. Still today, current virtual reality projects have a strong focus on realistic graphics rendering and user experience, but acoustics is rarely considered. One approach to store further information in virtual worlds is to semantically enrich virtual objects. This concept has proven to be good and efficient to create smart objects [2]. But until now this has mostly been used to store additional knowledge about graphical representation. As an example intelligent objects were used to enable smart connections and parametric modifications [3]. In order to generate realistic spatial sound, many factors have to be considered. These include the position in the virtual space, distance to the user as well as an appropriate soundfile [4]. Defining all these factors for every virtual object is a complex and time-consuming task. To achieve a realistic auditory experience it is not sufficient to only generate multiple independent sound sources, yet it is necessary to create a virtual world where objects that interact with each other acoustically influence the environment. L. Dickmann et al. (Eds.): SG 2011, LNCS 6815, pp. 184–187, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Phong-Based Concept for 3D-Audio Generation
2
185
Using Intelligent Virtual Objects for Audio Generation
Our framework enables the assignment of semantic information with regard to audio properties to virtual objects using so called metadata. Figure 1 shows the semantic enrichment of such objects by assigning descriptive values. The metadata are read in by an information processing step and are then compared to a database, which contains many sound files, that are semantically annotated. The idea behind this is to create objects that ’know’ how they have to sound. This knowledge is not stored in a separate knowledge base, but is embedded directly inside the object itself. If a database entry is found, which matches a certain audio file, a sound node is created inside the scenegraph with the corresponding object as its parent node. By this means, the prerequisites for the creation of a spatial auditory experience are complied with. The position and direction of the sound node in relation to the user can be calculated directly through the traversal of the scenegraph. So far these steps only generate multiple independent sound nodes. To improve the auditory experience, a method to combine these sound nodes is needed, in order to create an acoustic ’atmosphere’.