Search-Based Applications: At the Confluence of Search and Database Technologies

Search-Based Applications At the Confluence of Search and Database Technologies Synthesis Lectures on Information Conc...

Author: Gregory Grefenstette | Laura Wilber | Gary Marchionini

140 downloads 552 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Search-Based Applications At the Confluence of Search and Database Technologies

Synthesis Lectures on Information Concepts, Retrieval, and Services Editor Gari Marchionini, University of North Carolina, Chapel Hill Synthesis Lectures on Information Concepts, Retrieval, and Services is edited by Gary Marchionini of the University of North Carolina. The series will publish 50- to 100-page publications on topics pertaining to information science and applications of technology to information discovery, production, distribution, and management. The scope will largely follow the purview of premier information and computer science conferences, such as ASIST, ACM SIGIR, ACM/IEEE JCDL, and ACM CIKM. Potential topics include, but not are limited to: data models, indexing theory and algorithms, classification, information architecture, information economics, privacy and identity, scholarly communication, bibliometrics and webometrics, personal information management, human information behavior, digital libraries, archives and preservation, cultural informatics, information retrieval evaluation, data fusion, relevance feedback, recommendation systems, question answering, natural language processing for retrieval, text summarization, multimedia retrieval, multilingual retrieval, and exploratory search.

Search-Based Applications - At the Confluence of Search and Database Technologies Gregory Grefenstette and Laura Wilber 2010

Information Concepts: From Books to Cyberspace Identities Gary Marchionini 2010

Estimating the Query Difficulty for Information Retrieval David Carmel and Elad Yom-Tov 2010

iRODS Primer: Integrated Rule-Oriented Data System Arcot Rajasekar, Reagan Moore, Chien-Yi Hou, Christopher A. Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, Paul Tooby, and Bing Zhu 2010

iv

Collaborative Web Search: Who, What, Where, When, and Why Meredith Ringel Morris and Jaime Teevan 2009

Multimedia Information Retrieval Stefan Rueger 2009

Online Multiplayer Games William Sims Bainbridge 2009

Information Architecture: The Design and Integration of Information Spaces Wei Ding and Xia Lin 2009

Reading and Writing the Electronic Book Catherine C. Marshall 2009

Hypermedia Genes: An Evolutionary Perspective on Concepts, Models, and Architectures Nuno M. Guimarïes and Luïs M. Carrico 2009

Understanding User-Web Interactions via Web Analytics Bernard J. ( Jim) Jansen 2009

XML Retrieval Mounia Lalmas 2009

Faceted Search Daniel Tunkelang 2009

Introduction to Webometrics: Quantitative Web Research for the Social Sciences Michael Thelwall 2009

Exploratory Search: Beyond the Query-Response Paradigm Ryen W. White and Resa A. Roth 2009

v

New Concepts in Digital Reference R. David Lankes 2009

Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in Surrogates, and Evaluation Michael G. Christel 2009

Copyright © 2011 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.

Search-Based Applications - At the Confluence of Search and Database Technologies Gregory Grefenstette and Laura Wilber www.morganclaypool.com

ISBN: 9781608455072 ISBN: 9781608455089

paperback ebook

DOI 10.2200/S00320ED1V01Y201012ICR017

A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON INFORMATION CONCEPTS, RETRIEVAL, AND SERVICES Lecture #17 Series Editor: Gari Marchionini, University of North Carolina, Chapel Hill Series ISSN Synthesis Lectures on Information Concepts, Retrieval, and Services Print 1947-945X Electronic 1947-9468

Search-Based Applications At the Confluence of Search and Database Technologies

Gregory Grefenstette and Laura Wilber Exalead, S.A.

SYNTHESIS LECTURES ON INFORMATION CONCEPTS, RETRIEVAL, AND SERVICES #17

M &C

Morgan

& cLaypool publishers

ABSTRACT We are poised at a major turning point in the history of information management via computers. Recent evolutions in computing, communications, and commerce are fundamentally reshaping the ways in which we humans interact with information, and generating enormous volumes of electronic data along the way. As a result of these forces, what will data management technologies, and their supporting software and system architectures, look like in ten years? It is difficult to say, but we can see the future taking shape now in a new generation of information access platforms that combine strategies and structures of two familiar – and previously quite distinct – technologies, search engines and databases, and in a new model for software applications, the Search-Based Application (SBA), which offers a pragmatic way to solve both well-known and emerging information management challenges as of now. Search engines are the world’s most familiar and widely deployed information access tool, used by hundreds of millions of people every day to locate information on the Web, but few are aware they can now also be used to provide precise, multidimensional information access and analysis that is hard to distinguish from current database applications, yet endowed with the usability and massive scalability of Web search. In this book, we hope to introduce Search Based Applications to a wider audience, using real case studies to show how this flexible technology can be used to intelligently aggregate large volumes of unstructured data (like Web pages) and structured data (like database content), and to make that data available in a highly contextual, quasi real-time manner to a wide base of users for a varied range of purposes. We also hope to shed light on the general convergences underway in search and database disciplines, convergences that make SBAs possible, and which serve as harbingers of information management paradigms and technologies to come.

KEYWORDS search-based applications, search engines, semantic technologies, natural language processing, human-computer information retrieval, data retrieval, online analytical processing, OLAP, data integration, alternative data access platforms, unified information access, NoSQL, mash-up technologies

ix

Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1

Search Based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 1.5 1.6

2

Changing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Need for High Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Need for Unified Access to Global Information . . . . . . . . . . . . . . . . . . . . . . . . . The Need for Simple Yet Secure Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 8 8 9

Origins and Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 3.2 3.3

4

1 2 4 4 5 6 6 6

Evolving Business Information Access Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 2.2 2.3 2.4

3

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 What is a Search Based Application? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High Impact, Low Risk Solution for Businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fertile Ground for Interdisciplinary Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Valuable Tool for Database Administrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Opportunities for Search Specialists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Flexibility for Software Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Lecture Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed Recently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Search Engines Enter the Enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Databases Go Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Structural and Conceptual Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 13 13 14 15 16

Data Models & Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1

Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

x

4.2

4.3

5

5.2

5.3

Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Creation/Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 29 29 30 30 31 31 31 32

Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1

6.2 6.3

7

18 19 19 20 21 23 23 23

Data Collection/Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1

6

4.1.3 Storage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Storage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed Recently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Relevancy Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35 35 37 37 37 37 42

Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.1

7.2

7.3

Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What’s Changed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 43 44 44 44 45 45

xi

7.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

8

Data Security, Usability, Performance, Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8.1 8.2 8.3

9

9.2

57 58 58 58 58 59 59 59

What is an SBA Platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Access Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBA Platforms: Market Leaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBA Platforms: Other Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBA Vendors: COTS Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 62 62 64 66

SBA Uses & Preconditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 11.1 11.2

12

SBA-Enabling Search Engine Evolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.5 Data Retrieval & Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.6 Data Security, Usability, Performance, Cost . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SBA Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 10.1 10.2 10.3 10.4 10.5

11

51 51 52 52

Summary Evolutions and Convergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 9.1

10

Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

When Are SBAs Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 How Are SBAs Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Anatomy of a Search Based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 12.1

12.2

SBAs for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Data Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.4 Data Retrieval & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBAs for Unstructured Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71 71 72 73 74 77

xii

12.3

13

83 83 83 84 85 86 87

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Urbanizer Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How Urbanizer Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What’s Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90 91 91 91

Case Study: National Postal Agency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 15.1

15.2

15.3

16

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Track & Trace Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Existing Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Opting for a Search Based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Case Study: Urbanizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 14.1 14.2 14.3 14.4

15

77 77 78 78 79

Case Study: GEFCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 13.1 13.2 13.3 13.4 13.5 13.6 13.7

14

12.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 Data Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.4 Data Retrieval & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBAs for Hybrid Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Customer Service SBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operational Business Intelligence (OBI) SBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sales Information SBA for Telemarketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95 95 96 97 97 98 98 98 98

Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 16.1

16.2

The Influence of the Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Surfacing Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.2 Opening Access to Multimedia Content . . . . . . . . . . . . . . . . . . . . . . . . . . . The Influence of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104 104 105 105

xiii

16.3

16.4

The Influence of the Mobile Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Mission-Based IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 Innovation in Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...And Continuing Database/Search Convergence . . . . . . . . . . . . . . . . . . . . . . . . .

106 106 106 106

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Acknowledgments We would like to thank Gary Marchionini and Diane Cerra for inviting us to participate in this timely and important lecture series, with a special thank you to Diane for her assistance and patience in guiding us through the publication process. We would also like to thank Morgan & Claypool’s reviewers, including Susan Feldman, Stephen Arnold and John Tait, for their thoughtful suggestions and comments on our manuscript. Ms. Feldman and Mr. Arnold are constant sources of insight for all of us working in search and information access-related disciplines, and we welcome Mr. Tait’s remarks based on his long IR research experience at the University of Sunderland and his more recent efforts at advancing research in IR for patents and other large scale collections at the Information Retrieval Facility. In addition, we are grateful to our colleagues and managers at Exalead for allowing us time to work on this lecture, and for providing valuable feedback on our draft manuscript, especially Olivier Astier, Stéphane Donzé and David Thoumas. We would also like to thank our partners and customers. They are the source of the examples provided in this book, and they have played a pioneering role in expanding the boundaries of applied search technologies, in general, and searchbased applications, in particular. Finally, we would like to thank our families.Their love sustains us in all we do, and we dedicate this book to them.

Gregory Grefenstette and Laura Wilber December 2010

Glossary Glossary ACID

Constraints on a database for achieving Atomicity, Consistency, Isolation and Durability

Agility

The ease with which a computer application can be altered, improved, or extended

API

Application Programming Interface, specifies how to call a computer program, what arguments to use, and what you can expect as output

Application layer

Part of the Open System Interconnection model, in which an application interacts with a human user, or another application

Atomicity

The idea that a database transaction either succeeds or fails in its entirety

Availability

The percentage of time that data can be read or used.

Batch

A computer task that is programmed to run at a certain time (usually at night) with no human intervention

B2C

Business to Customer; B2C websites offer goods or services directly to users

B+ tree

A block-oriented data structure for efficient insertion and removal of data nodes

BI

Business Intelligence, views on data that aid users with business planning and decision making

BigTable

An internal data storage system used by Google, handles multidimensional key-value pairs

BSON

Binary JSON

xviii

GLOSSARY

Business application

Any information processing application used in running a business

Cache

A rapid computer memory where frequently or recently used data is temporarily stored

CAP theorem

One cannot achieve Consistency, Availability, and Partition tolerance at the same time

Category

A flat or hierarchic semantic dimension added to a document, or part of a document

Categorization

Assigning, usually through statistical means, one or more categories to text

CDM

Customer Data Management

Cloud services

Computer applications that are executed on computers outside the enterprise rather than in-house. Examples are SalesForce, Google Apps, Yahoo mail, etc.

Clustering

Grouping documents according to content similarity

CMS

Content Management System

Consistency

A quality of an information system in which only valid data is recorded; that is, there are not two conflicting versions of the same data

Connector

A program that extracts information from a certain file format, or from a database

Consolidation

Making all the data concerning one entity available in one output

COTS

Commercial off-the-shelf software

Crawl

Fetching web pages for indexing by following URLs found in each page

CRM

Customer Relationship Management, applications used by businesses to interact with customers

GLOSSARY

CSIS

Customer Service Information System

Data integration

Merging data from different data sources or different information systems

Data mart

A subset of data found in an enterprise information system, relevant for a specific group or purpose

Data warehouse

A database which is used to consolidate data from disparate sources

DBA

Database administrator, the person who is responsible for maintaining (and often designing) an organization’ database(s)

Deep Web

Web pages that are dynamically generated as a result of form input and/or database querying

Directory

A listing of the files or websites in a particular storage system

DIS

Decision Intelligence System, a computer-based system for helping decision making

Document model

A model of seeing a database entity as a single persistent document, composed of typed fields and categories corresponding to the entity’s attributes

Dublin Core Metadata

A standard for metadata associated with documents, such as Title, Creator, Publisher, etc.

Durability

A database quality that means that successfully completed transactions must persist (or be recoverable) in the case of a system failure

EDI

Electronic Data Interchange, an early database communication system

ETL

Extract-Transform-Load, any method for extracting all or part of a database and storing it in another database

Enterprise Search

Searching access-controlled, structured and unstructured data found within the enterprise

xix

xx

GLOSSARY

ERP

Enterprise Resource Planning

Evolutive Data Model

Model that can be easily extended with new fields or data types without rebuilding the entire data structure

Facet

A dimension of meaning that can be used for restricting search, for example shirts and coats are two facets that could be found on a shopping site

Field

A labeled part of a document in a search engine. Fields can be typed to contain text, numbers, dates, GPS coordinates, or categories

Firewall

A computer-implemented protection that isolates internal company data from outside access

File server

A service that provides sequential or direct access to computer files

Full-text engine

A system for searching any of the words found in documents, rather than just a set of manually assigned keywords

Garbage collection

A process for recovering memory, usually by recognizing deleted or out-ofdate data

Gartner

An information technology research and advisory firm that reports on technology issues

GPS

Global Positioning System, a system of satellites for geolocating a point on the globe

Hash table

Hashing converts a data item into a single number, and the hash table maps this number to a list of items

Heuristics

Methods based more on demonstrated performance than theory, weighting words by their inverse frequency in a collection is an example

HTTP

HyperText Transfer Protocol, an application layer protocol for accessing web pages

IDC

International Data Corporation, a global provider of market intelligence and analysis concerning information technology

GLOSSARY

ILM

Information Lifecycle Management

IMAP

Internet Message Access Protocol, a format for transmitting emails

Index, inverted

A data structure that contains lists of words with pointers to where the words are found in documents

Index slice

One section of an inverted index which can be distributed over many different computer stores

Intranet

A secure network that gives authorized users Web-style access to an organization’s information assets (e.g., internal documents and web pages)

IR

Information Retrieval, the study of how to index and retrieve information, usually from unstructured text

IS

Information System, a generic term for any computer system for storing and retrieving information

Isolation

The database constraint specifying that data involved in a transaction are isolated from (inaccessible to) other transactions until the transaction is completed to avoid conflicts and overwrites

IT

Information Technology, a generic term covering all aspects of using computers to store and manipulate information

JDBC Join

Java Database Connectivity, a Java version of ODBC In a relational database, gathering together data contained in different tables

JSON

JavaScript Object Notation, a standard for exchanging data between systems

Key-value store

A data storage and retrieval system in which a key (identifying an entity) is linked to the one or more values associated with that entity. This allows rapid lookup of values associated with an entity, but does not allow joins on other fields

Mash-up

A software application that dynamically aggregates information from many different sources, or output from many processes, in a single screen

xxi

xxii

GLOSSARY

MDM

Master Data Management, a system of policies, processes and technologies designed to maintain the accuracy and consistency of essential data across many data silos

Metadata

Typed data associated with a document, for example, Author, Date, Category

Mobile Web

Web pages accessible through a mobile device such as a smartphone

MySQL

A popular open source relational database

Normalized relational schema

A model for a relational database that is designed to prevent redundancies that can cause anomalies when inserting, updating, and deleting data

NoSQL

Not Only SQL, an umbrella term for large scale data storage and retrieval systems that use structures and querying methodologies that are different from those of relational database systems

OBI

Operational Business Intelligence, data reporting and analysis that supports decision making concerning routine, day-to-day operations

OCR

Optical Character Recognition, a technology used for converting paper documents or text encapsulated in images into electronic text, usually with some noise caused by the conversion

ODBC

Open Database Connectivity, a middleware for enabling and managing exchanges between databases Extracting information from a database application and storing it in a search engine application

Offloading

OLAP

Online Analytical Processing, tools for analyzing data in databases

OLTP

Online Transaction Processing

Ontology

A taxonomy with rules that can deduce links not necessarily present in the taxonomy

GLOSSARY

Partition tolerance

Means that a distributed database can still function if some of its nodes are no longer available

Performance

The measure of a computer application’s rapidity, throughput, availability, or resource utilization

PHP

PHP: Hypertext Preprocessor, a language for programming web pages

PLM

Product Lifecycle Management, systems which allow for the management of a product from design to retirement

Plug-and-play

Modules that can be used without any reprogramming, “out of the box”

POC

Proof of concept, an application that proves that something can be done, though it may not be optimized for performance

Portal

A web interface to a data source

Primary key

In a relational database, a value corresponding to a unique entity, that allows tables to be joined for a given entity

RDBMS

Relational database management system

Redundancy

Storing the same data in two different places in a data base, or information system.This can cause problems of consistency if one of the values is changed and not the other

Relational model

A model for databases in which data is represented as tables. Some values, called primary keys, link tables together

Relevancy

For a given query, a heuristically determined score of the supposed pertinence of a document to the query

REST

Representational State Transfer, protocol used in web services, in which no state is preserved, but in which every operation of reading or writing is self sufficient

RFID

Radio Frequency Identification, systems using embedded chips to transmit information

xxiii

xxiv

GLOSSARY

RSS

Really Simple Syndication, an XML format for transmitting frequently updated data

R tree

An efficient data structure for storing GPS-indexed points and finding all the points in a given radius around a point

RDF

Resource Description Framework, a format for representing data as sets of triples, used in semantic web representations

SBA

Search Based Applications, an information access or analysis application built on a search engine, rather than on a database.

SCM

Supply Chain Management

Scalability

The desirable quality of being able to treat larger and larger data sets without a decrease in performance, or rise in cost

Search engine

A computer program for indexing and searching in documents

Semantic Web

Collection of web pages that are annotated with machine readable descriptions of their content

Semistructured data

Data found in places where the data type can be surmised, such as in explicitly labeled metadata, or in structured tables on web pages

SEO

Search engine optimization, strategies that help a web page owner to improve a site’s ranking in common web search engines

SERP

Search engine results page, the output of a query to a search engine

Silo

An imagery-filled term for an isolated information system

SMART system

An early search engine developed by Gerald Salton at Cornell

GLOSSARY

SOAP

Simple Object Access Protocol, a format for transmitting data between services

Social media

Data uploaded by identified users, such as in YouTube, FaceBook, Flickr

SQL

Structured Query Language, commonly used language for manipulating relational databases

Structured data

Data organized according to an explicit schema and broken down into discrete units of meaning, with units represented using consistent data types and formats (databases, log files, spreadsheets)

SVM

Support vector machine, used in classification

Table

Part of a relational database, a body of related information. Each row of the table corresponds to one entity, and each column, to some attribute of this entity

Taxonomy

A hierarchically typed system of entities, such as mammals being part of animals being part of living beings

TCO

Total cost of ownership, how much an application costs when all implicit and explicit costs are factored in over time

Timestamp

A chronological value indicating when some data was created

Top-k

The k highest ranked responses in a database system that can rank answers to a query

Transaction

In databases, a sequence of actions that should be performed as an uninterruptable unit, for example, purchasing a seat on a flight

Unstructured data

Data that is not formally or consistently organized, such as textual data (email, reports, documents) and multimedia content

URL

Universal Resource Locator, the address of a web page

xxv

xxvi

GLOSSARY

Usability

The desirable quality of being able to be used by a large population of users with little or no training

Vertical application

An application built for a specific domain, such as pharmaceuticals, finance, or manufacturing. A horizontal application could be used in a number of different domains.

XML

eXtended Markup Language, a standard for including metadata in a document

W3C

World Wide Web Consortium

WYSIWYG

What You See Is What You Get

YPG

Yellow Pages Group, Canada

1

CHAPTER

1

Search Based Applications 1.1

INTRODUCTION

Figure 1.1: Can you see the search engine behind these screens?

Management of information via computers is undergoing a revolutionary change as the frontier between databases and search engines is disappearing. Against this backdrop of nascent convergence, a new class of software has emerged that combines the advantages of each technology, right now, in Search Based Applications. Until just a short while ago, the lines were still relatively clear. Database software concentrated on creating, storing, maintaining and accessing structured data, where discrete units of information (e.g. product number, quantity available, quantity sold, date) and their relation to each other were well defined. Search engines were primarily concerned with locating a document or a bit of information within collections of unstructured textual data: short abstracts, long reports, newspaper articles, email, Web pages, etc. (classic Information Retrieval, or IR; see Chap. 3). Business applications were built on top of databases, which defined the universe of information available to the end user, and search engines were used for IR on the Web and in the enterprise.

2

1. SEARCH BASED APPLICATIONS

Figure 1.2: Databases have traditionally been concerned with the world of structured data; search engines with that of unstructured data (some of these data types, like HTML pages and email messages, contain a certain level of exploitable structure, and are consequently sometimes referred to as "semi-structured").

Such neat distinctions are now falling away as the core architectures, functionality and roles of search engines and databases have begun to evolve and converge. A new generation of non-relational databases, which shares conceptual models and structures with search engines, has emerged from the world of the Web (see Chapter 4), and a new breed of search engine has arisen which provides native functionality akin to both relational and non-relational databases (described in Chapters 3-9 and listed in Chapter 10). It is this new generation engine that supports Search Based Applications, which offer precise, multi-axial information access and analysis that is virtually indistinguishable at a surface level from database applications, yet are endowed with the usability and massive scalability of Web search.

1.1.1

WHAT IS A SEARCH BASED APPLICATION?

We define a Search Based Application (SBA) as any software application built on a search engine backbone rather than a database infrastructure, and whose purpose is not classic IR, but rather mission-oriented information access, analysis or discovery.1

1This new type of application has alternately been referred to as a "search application," "search-centric application," "extended

business application," "unified information access application" and "search-based application." The latter is the label used by IDC’s Susan Feldman, one of the first industry analysts to identify SBAs as a disruptive trend and an influential force in the SBA label being adopted as the industry standard. Feldman has recently moved toward a more precise definition, limiting SBAs to "fully packaged applications" supplying "all the tools that are commonly needed for a specific task or workflow," that is to say, commercial-off-the-shelf (COTS) software [Feldman and Reynolds, 2010]. However, we prefer a broader definition to underscore one of the great benefits of the SBA model: the ability for anyone to rapidly and inexpensively develop highly specific solutions for unique contexts, and, following the same pattern as database applications, we expect both custom and COTS SBAs to flourish over the next decade.

1.1. INTRODUCTION

Definition: Search Based Application A software application that uses a search engine as the primary information access backbone, and whose main purpose is performing a domain-oriented task rather than locating a document. Examples: Customer service and support Logistical track and trace Contextual advertising Decision intelligence e-Discovery SBAs may be used to provide more intuitive, meaningful and scalable access to the content in a single database, hiding away the complexity of the database structure as data is extracted and re-purposed by search engine techniques. They may also be used to autonomously and intelligently gather together massive volumes of unstructured and structured data from an unlimited number of sources (internal or external) and to make this aggregate data available in real time to a wide base of users for a broad range of purposes. While search engines in the SBA context complement rather than replace databases, which remain ideal tools for many types of transaction processing, this ’re-purposing’ of search engines nonetheless represents a major rupture with a 30-year tradition of database-centered software application development. In spite of the significance of this shift, the SBA trend has been unfolding largely under the radar of researchers, systems architects and software developers. However, SBAs have begun to capture the focused attention of business.2 "The elements that make search powerful are not necessarily the search box, but the ability to bring together multiple types of information quickly and understandably, in real time, and at massive scale. Databases have been the underpinning for most of the current generation of enterprise applications; search technologies may well be the software backbone of the future." —Susan Feldman, IDC LINK, June 9, 2010

2 SBAs are fueling a significant portion of the growth in the search and information access market, which IDC estimates grew at

double digit rates in 2007 and 2008, and at a healthy 3.9% (to $2.1 billion) in 2009 [Feldman and Reynolds, 2010]. Gartner, Inc. estimates an compound annual growth rate of 11.7% from 2007 to 2013 for the enterprise search market [Andrews , 2010].

3

4

1. SEARCH BASED APPLICATIONS

1.2

HIGH IMPACT, LOW RISK SOLUTION FOR BUSINESSES

SBAs offer businesses a rapid, low risk way to eliminate some of the peskiest and most common information systems (IS) problems: siloed data, poor application usability, shifting user requirements, systemic rigidity and limited scalability.

Figure 1.3: Search engine-based Sourcier makes vast volumes of structured water quality data accessible via map-based search and visualization, and ad hoc, point-and click-analysis.

Even though SBAs allow business to clear these hurdles and bring together large volumes of real time information in an immediately actionable form—thereby improving productivity, decision making and innovation—too many in the business community are still unaware that search engines can serve as an information integration, discovery and analysis platform. This is the reason we have written this book.

1.3

FERTILE GROUND FOR INTERDISCIPLINARY RESEARCH

We have also undertaken this project to introduce SBAs to a wider segment of the data management research community. Though the convergence of search and database technologies is gradually being recognized by this community3 , many researchers are still unaware of the pragmatic benefits of SBAs and the mutually beneficial evolutions underway in both search and database disciplines. 3 See, for example, recent workshops like Using Search Engine Technology for Information Management (USETIM’09) that was held

in August 2009 at the 35th International Conference on Very Large Data Bases (VLDB09), which examines whether search engine technology can be used to perform tasks usually undertaken by databases. http://vldb2009.org/?q=node/30

1.4. A VALUABLE TOOL FOR DATABASE ADMINISTRATORS

However, as a group of prominent database and search scientists recently noted, exploding data volumes and usage scenarios along with major shifts in computing hardware and platforms have resulted in an "urgent, widespread need for new data management technologies," innovations that will only come about through interdisciplinary research.4

Figure 1.4: This Akerys portal generates personalized, real-time real estate market intelligence based on unstructured online classifieds and in-house databases.

1.4

A VALUABLE TOOL FOR DATABASE ADMINISTRATORS

Like their research counterparts, many Database Administrators (DBAs) are also unfamiliar with SBAs. We hope this book will raise awareness of SBAs among DBAs as well, because SBAs offer these professionals a fast and non-intrusive way to offload overtaxed systems5 and to reveal the full richness of the data those systems contain, opening database content up for free-wheeling discovery and analysis, and enabling it to be contextualized with external Web, database and enterprise content.

4 From the The Claremont Report on Database Research, the summary report of the May, 2008 meeting of a group of leading database

and data management researchers who meet every five years to discuss the state of the research field and its impacts on practice: http://db.cs.berkeley.edu/claremont/claremontreport08.pdf 5 Offloading a database means extracting all the data that a user might want to access and indexing a copy of this information in a search engine. The term offloading refers to the fact that search requests no longer access the original database, whose processing load is hence reduced.

5

6

1. SEARCH BASED APPLICATIONS

1.5

NEW OPPORTUNITIES FOR SEARCH SPECIALISTS

For search specialists who are not yet familiar with SBAs, we hope to introduce them to this significant new way of using search technology to improve our day-to-day personal and professional lives, and to make them aware of the new opportunities for scientific advancement and entrepreneurship awaiting as we seek ways to improve the performance of search engines in the context of SBA usage.

1.6

NEW FLEXIBILITY FOR SOFTWARE DEVELOPERS

We also hope to make software developers aware of the new options SBAs offer: one doesn’t always need to access an existing database (or create a new one) to develop business applications or to meticulously identify all user needs in advance of programming, and one need not settle for applications that must be modified every time these needs or source data change.

1.6.1

LECTURE ROADMAP

While this diversity of audiences and the short format of the book necessitate a surface treatment of many issues, we will consider our mission accomplished if each of our readers walks away with a solid (if basic) understanding of the significance, function, capabilities and limitations of SBAs, and a desire to go forth and learn more. To begin, we’ll first take a look at the ways in which information access needs have changed, then provide a comparative view of ways in which search engines and databases work and how each has evolved. We’ll then explain how SBAs work and how and when they are being used, including presenting several case studies. Finally, we will situate this shift within the larger context of evolutions taking place on the Web, including conceptions of the Deep Web, the Semantic Web, and the Mobile Web, and what these evolutions may mean for the next generation of SBAs.

7

CHAPTER

2

Evolving Business Information Access Needs 2.1

CHANGING TIMES

Figure 2.1: The 1946 ENIAC, arguably the first general-purpose electronic computer, weighed in at 30 tons and consumed 63 sq. meters of floor space. To celebrate ENIAC’s 50th birthday, a University of Pennsylvania team integrated the whole of ENIAC on a 7x5 sq. mm chip. (U.S. Army Photo, courtesy Harold Breaux.)

Before we examine search and database technologies in more detail (paying particular attention to recent evolutions giving rise to Search Based Applications), it’s important to first understand the changes in the business information landscape which are driving these evolutions.

8

2. EVOLVING BUSINESS INFORMATION ACCESS NEEDS

Globalization, the Internet, new data capture technologies (e.g., barcode scanners, RFID), GPS, Cloud services, mobile computing, 3D and virtualization...a whole host of evolutions over the past two decades have resulted in a veritable explosion in the volume of data businesses must manage, and near-runaway complexity in enterprise information ecosystems. Data silos are mushrooming at an impossible pace, and the number and types of users and devices interacting with organization’s information systems are proliferating. While opinions may vary as to specific recommendations for addressing these challenges, it’s clear that, at a minimum, organizations need: • Better ways to manage large data volumes (improved performance, scalability and agility) • More data integration (physical or virtual) • Easier (yet secure) access for more types of users and devices

2.2

THE NEED FOR HIGH PERFORMANCE AND SCALABILITY

According to a recent IDC estimate, the amount of digital information created and replicated in 2009 was 800,000 petabytes, enough to fill a stack of DVDs reaching from the earth to the moon and back. The estimate for 2020? A 44-fold increase to 35 zetabytesâŁ”or enough to fill a stack of DVDs reaching halfway to Mars (Gantz and Reinsel, 2010). Businesses must not only find some way to locate accurate information in these massive stores, but find a way to make it understandable and useful to a broad range of audiences. To make matters worse, they must do so faster than ever before. In today’s hyper competitive, interconnected global economy, yesterday’s data simply will no longer do.

2.3

THE NEED FOR UNIFIED ACCESS TO GLOBAL INFORMATION

Similarly, making businesses decisions based on only a small fraction of available data will also no longer suffice. Up to 90% of these massive corporate information assets now exists in unstructured (or semi-structured) format1 , like text documents, multimedia files, and Web content. Information systems have to date done a poor job of exploiting this ‘messy’ content, resulting in a very limited perspective of a business and its market. Some data, like Web and social media content, is simply not leveraged in a business context (not in any formal way, at least): already strained systems simply can’t digest such voluminous, ‘dirty’ data. 1 Seth Grimes (http://www.clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551) investigates

the origins of this commonly cited approximation of the amount unstructured data. He concludes that, whatever the real proportion “53%/80%/85%/90%”, these figures “make concrete – they focus and solidify – the realization that unstructured data matters." Susan Feldman traces the 85% figure back to an IBM study from the 1990s.

2.4. THE NEED FOR SIMPLE YET SECURE ACCESS

Figure 2.2: Skyrocketing data volumes result in information overload in the workplace. (Based on source image in the IDC Digital Universe Study, sponsored by EMC, May 2010, [Gantz and Reinsel, 2010].)

At the same time, current information systems are ill-suited to bringing data from multiple data sources together in an understandable, pertinent way, even when source systems contain coherent versions of a consistent type of data (for example, unifying access to multiple databases). Data integration is urgently needed, yet it remains for most organizations a prohibitively complex, costly undertaking, if not a completely Sisyphean task given the proliferation of data sources, the complexity of contemporary information ecosystems, and the high rate of mergers and acquisitions.

2.4

THE NEED FOR SIMPLE YET SECURE ACCESS

Even if a business could surmount these back-end scaling and integration challenges, making this data easy for both human beings and other systems to use would remain a challenge. The evolution of virtually all work into “information work” to some degree, globalization, and the increasing depth, complexity and interdependence of supply chains2 (fueled in large part by the growing influence of a consumer-led model of demand), mean Information Technology (IT) is under tremendous pressure to extend access to business information assets to an ever greater range of users and systems. Consequently, Information Technology no longer has the luxury of developing applications for a known group of trained professionals; they are being tasked with creating applications that can 2 See [Dedrick et al., 2008] for an interesting study on how available information technology relates to implementable supply chain

models

9

10

2. EVOLVING BUSINESS INFORMATION ACCESS NEEDS

be used by people whom they do not know, cannot train, and who may not even speak the same language, much less have a predictable set of technical skills3 . Moreover, long experience with the consumer Web has made even highly trained, highly skilled workers impatient with difficult to use information systems. Accordingly, this instant usability has assumed a mission-critical role. While IT struggles to respond to this clamor for Web-style simplicity, unity, and scalability, they must still meet security demands that are growing more complex and stringent by the day. All in all, it’s a formidable set of challenges to say the least, one that points to a need for a unified information access platform that can handle both unstructured and structured data—both inside the firewall and out on the Internet, scales easily to any volume, is completely secure, and utterly simple to use. In short, it points to the need for an infrastructure that combines the capacities and benefits of both search engines and databases. This convergence, satisified at a functional level by SBAs, has been fueled by these shifts in information access needs by a dissolution of the boundaries between the Web and the enterprise, with an attendant incursion of search engines and databases into other’s once exclusive domains. Let’s now look more closely at search and database technologies, and the efffects of this crossincursion, in particular the SBA enabling transformation of IR-focused search engines into general information access platforms.

3 One early attempt to circumvent the complexity of accessing databases was to use a natural language interface

[Copestake and Sparck Jones, 1990] In such interfaces, a well-structured, grammatical,“ordinary” language query was transformed into a classical database request. Unfortunately, such systems remained brittle and could not sufficiently hide the complexity of the information model used by the database.

11

CHAPTER

3

Origins and Histories At A Glance Characteristic Origin Primary Usage Target content Number of users Data volume

3.1

Search Engine Web Information retrieval Unstructured text Unlimited Billions of records

Databases Enterprise Transaction processing Structured data Limited Thousands/millions of records

SEARCH ENGINES

The best known search engine today is Google, but research into search engines that could rank documents relevant to a query began in the late 1950s with Cyril Cleverdon’s Cranfield studies [Cleverdon and Mills, 1963], Margaret Masterman and her team at the Cambridge Language Research Unit [Masterman et al., 1958], and Gerald Salton’s SMART system [Salton, 1971]. Arising from the world of library science and card catalogs, these early search engines were built to index and retrieve books and other documents, moving from older systems using Boolean retrieval over controlled indexing terms to relevance ranking and fuzzy matches over the full text of documents. With the appearance of the Internet in the early 1990s, Web-based search engines began to emerge as an efficient way to help users locate information in a huge, organic, and linked collection of documents. The very first Internet IR systems compiled searchable lists of file names found on Internet servers [Deutsch and Emtage, 1992], and the next generation (e.g., Yahoo!) sought to build directories of Web sites [Filo and Yang, 1995] or, alternately, to build a searchable key word index of the textual content of individual Web pages (not unlike a book index). Instead of users navigating a directory menu to locate information, these latter “full-text" engines enabled users to simply enter a key word of interest in a text box, and the engine would return a ranked list of pages containing that term.1 This full-text model remains the dominant force in information retrieval (IR) from Web sources2 1The first such full-text web search engine was WebCrawler [Pinkerton, 1994], which came out in 1994. 2 Another tradition of Boolean search engines gave rise to IBM’s Storage and Information Retrieval System (STAIRS) and

Lockheed’s Dialog which survives as ProQuest’s Dialog, www.dialog.com.

12

3. ORIGINS AND HISTORIES

Search engine developers were concerned about processing great quantities of varied data, with little human intervention, into indexes that could be rapidly searched in response to a user query. When the Web began to grow at an exponential rate, the concern of search engine developers was to gather as much information as possible in the processing time allocated. Rather than providing all possible responses to a query, techniques were developed to try to provide the best answers, using word weighting, web page weighting and other heuristics. Gradually a variety of approximate matching techniques, involving language processing techniques, and more and more exposure of whatever metadata or category information might be retained from input documents, were implemented in search engines. Today, millions of people worldwide have become familiar with using Web search engines to find at least some of the information they need.

Figure 3.1: Two early Web search engines, WebCrawler and Lycos. Screenshots from October 1996 (courtesy of the Internet Archive).

Not intended for transaction processing and focused initially on IR for textual data only, these Web engines were designed from inception to execute lightning fast read operations against a massive volume of data by a vast number of simultaneous users [Brin and Page, 1998].

3.2. DATABASES

3.2

13

DATABASES

Available commercially for more than 50 years, databases were born and bred inside the enterprise, long before the Web existed. Simply stated, they are software programs designed to record and store information in a logical, pre-defined3 structure [Nijssen and Halpin, 1989]. They were initially designed to support a limited number of users (typically well-trained), and process a limited volume of data (data storage was very expensive during the period in which databases emerged). From their earliest days, the primary role of databases was to capture and store business transactions (orders, deliveries, payments, etc.) in a process known as Online Transaction Processing (OLTP), and to provide reporting against those transactions [Claybrook, 1992]. Secondarily, though equally important, they were used to organize and store non-transactional information essential to a business’s operations, for example, employee or product data. The overriding concern of database developers was that the data contained in the databases remain accurate and reliable, even if many people were manipulating the database, reading it, and writing into it, at the same time. If you consider the problem of reserving an airline ticket, it becomes clear how important and difficult this is. Airline agents and consumers all over the world might be trying to reserve seats simultaneously, and it is the job of the database software to make sure that every available seat is immediately visible, in real time, and that no seat is sold twice. These worries led database designers to concentrate on how to make transactions dealing with data consistent and foolproof. Around the mid 1970s, databases began to be used to consolidate data from different business units to produce operational reports, representing the first generation of decision intelligence systems (DIS). From the early 1990s on, DIS evolved from simple reporting to more sophisticated analysis enabled by special tools (called OLAP tools for Online Analytical Processing) capable of combining multiple dimensions (time, geography, etc.) and indicators (sales, clients, inventory levels, etc.). To support the consolidated views needed for DIS, databases began to be aggregated into ‘mega-databases’ known as data warehouses [Chaudhuri and Dayal, 1997], which, in turn, had to be re-divided into smaller, task-driven databases know as data marts [Bonifati et al., 2001] to skirt complexity and performance issues.

3.3

WHAT HAS CHANGED RECENTLY

First, let’s look at what hasn’t changed: Search engines still handle textual data exceptionally well and generally provide far faster, more scalable IR than databases. Databases remain exceptional tools for recording transactions (OLTP) and for the deep analysis of structured data (complex OLAP) [Negash and Gray, 2008]. What has changed, however, is a blurring of the boundaries between the Web and the enterprise, and the attendant incursion of each into the other’s once exclusive domains. 3The term “pre-defined” is not meant to imply that the data schemas employed by relational databases are static: they are rather

constructed with a base model that typically evolves over the course of usage. This evolution, however, requires largely manual adaption of the schema and stored data.

14

3. ORIGINS AND HISTORIES

Figure 3.2: The Database contains the reference version of enterprise data. Data warehouses collect data from different databases to achieve a consolidated view, and/or to offload access requests to protect transaction processing capacity. Data marts contain smaller slices of this consolidated data.

3.3.1

SEARCH ENGINES ENTER THE ENTERPRISE

In the 1990s, search engines entered the world of searching enterprise data4 . Beginning with the release of Verity’s Topic engine5 in 1988, a new type of engine, the enterprise search engine, was tasked with IR on internal business systems (file servers, email servers, intranets, collaboration systems, etc.) instead of on the Web. This shift meant these full-text engines had to tackle a new set of challenges unique to a business environment: processing a wider range of data formats, enforcing security, and developing different conceptions of relevancy and ranking (Web notions of ‘popularity’ being largely meaningless in an enterprise environment). These engines also sought to provide alternative ways of navigating search results, such as categorical clustering (called faceted search, see Tunkelang [2009]), rather than leaving users to rely solely on ranked lists. At the same time, similar experiments with faceted search and navigation were 4 Personal Library Software (PLS), later acquired by AOL, provided search capabilities over CD-ROMs and some enterprise data

in the mid 1980s, and Lockheed’s Dialog system provided enterprises access to external data and text databases from the early 1970s. 5 http://www.answers.com/topic/verity-inc

3.3. WHAT HAS CHANGED RECENTLY

15

Figure 3.3: Enterprise search required new strategies for determining relevancy, navigating results, and managing security.

taking place on the Web.6 (Search-based DIS was not yet on the radar, but the foundations were being laid.)

3.3.2

DATABASES GO ONLINE

During the same period, databases entered the world of the Web. First, application layers were constructed to tie Web interfaces into back end database systems to enable e-commerce [Hasselbring, 2000].7 This drive rapidly expanded to other business functions as well (customer support, knowledgebases, etc.). Shortly thereafter, databases began to be used to manage, search and present the entirety of a website’s content, a role their enterprise counterparts had already begun to play in internal content management systems (CMS). These expanded IR functions meant databases needed to become more adept at manipulating textual information, and the incursion into the Web, along with escalating corporate datastores, placed pressure on databases to improve their scalability and performance to meet the demands of large volumes of users and data.

6 Alta Vista Live Topics, etc. 7 Before the advent of Web e-commerce, databases were already connecting with one another via EDI (Electronic Data Interchange)

systems, first connected via dedicated channels, later connecting via the Internet, which was conceived in 1969 by the U.S. Department of Defense’s Advanced Research Projects Agency, or DARPA. The World Wide Web, and consequently Web-based ecommerce, emerged in the mid-1990s.

16

3. ORIGINS AND HISTORIES

Figure 3.4: Search Based Applications introduce the affordances of search engines into the information access and business intelligence domains.

3.3.3

STRUCTURAL AND CONCEPTUAL CHANGES

Overall, the cumulative effect of these shifts as well as changes in business IR needs, led to important structural and conceptual evolutions in both databases and search engines, touching on foundational areas such as: • Conceptual Data Models • Logical Storage Structures • Data Collection/Population Procedures • Data Processing Methods • Data Retrieval Strategies We’ll now compare the traditional approaches and recent evolutions of each technology in each of these areas for databases and search engines, showing what has changed recently that allows for the realization of Search Based Applications.

17

CHAPTER

4

Data Models & Storage At A Glance Characteristic Basic semantic model Logical storage structure Representational state Storage architecture

Search Engine Document model Index De-normalized Distributed

4.1

SEARCH ENGINES

4.1.1

CONCEPTUAL DATA MODEL

Databases Relational data model Relational table Normalized Centralized

Search engines use a “document model” to represent information. In the earliest days of Web search, a ‘document’ was a Web page, and that document consisted of keywords found in the page as well as descriptive information like page title, content headings, author, and modification data (collectively known as metadata,1 or information about information). The first enterprise engines likewise conceived of a document in a fairly literal sense: a Word document, a presentation, an email message, etc.

4.1.2

DATA STORAGE

The search engine index provides the primary structure for storing information about documents, including the information required to retrieve them: file system location, Web address (URL), etc. Some engines would store part of the documents crawled in a cache, a compressed copy of select text, with associated metadata. Other engines would store cached copies of complete documents. For efficiency, some search engines (particularly Web engines) would build and maintain indexes using these cached copies, though users would be linked through to the ‘live’ page when they clicked on a result. 1 See the Dublin Core Metadata Initiative at dublincore.org for an elaborate standard of metadata markup for documents in the

sense used here.

18

4. DATA MODELS & STORAGE

Let’s look at the example of a simple full-text Web engine.2 To construct an index for this type of engine, the engine first creates a forward index which extracts all the terms used in a single document (Figure 4.1). Next, an inverted index is created against the contents of the forward index which reverses this pairing and lists first words, then all document(s) in which a word appears, facilitating key wordbased retrieval. To return results ranked by relevancy, such indexes also incorporate information like the number of times a terms is used in a document (frequency), an indication of the position of these occurrences (if a user’s search term occurs often and at the top of a document, it is probably a good match) and, for Web engines, factors such as the number and credibility of external links pointing to the source pages (Figure 4.2).

Figure 4.1: A forward index compiles all pertinent words in a given document.

Figure 4.2: An inverted index extracts all occurrences of a word from the forward index, and associates it with other data such as position and frequency. Here, the term “dogs” appears in Document 1 two times, at positions 3 and 6.

4.1.3

STORAGE FRAMEWORK

To cope with the performance, scalability and availability demands of the Web, search engines from inception were designed with the distributed architectures suited to grid computing models. In large 2 For an in-depth description of building a search engine, see Büttcher et al. [2010].

4.2. DATABASES

19

volume environments, search platforms distribute processing tasks, indexes and document caches across multiple servers [Councill et al., 2006]. For example, an engine may distribute slave copies of index slices across multiple servers (with all writes written to the master, all reads from slaves), with a ‘meta-index’ used to direct queries to the right slice.These meta-indices may take the form of bitmaps, hash tables (equality searches), B+ trees, a multi-level index for block oriented storage capable of range searches like <, >, <=, >=, or between [Comer, 1979], R-trees, for multi-dimensional data such geospatial coordinates [Guttman, 1984], or some combination of the above3 .

Figure 4.3: Distributed architectures with load balancing, partitioning and replication are used for improved performance and availability in large volume environments

4.2

DATABASES

4.2.1

CONCEPTUAL DATA MODEL

While search engines represent information using a document model, databases employ a ‘data model,’ or more specifically, for relational databases (the dominant database type since the 1980s), a normalized relational schema [Codd, 1990]. The goal of this schema is to enable complete, accurate representations of business entities, like products or customers, that can be used in a multitude of ways. Unlike conventional search engines in which the index serves as a directory for retrieving 3 http://en.wikipedia.org/wiki/Tree_(data_structure) provides a good introduction to each of these tree data struc-

tures.

20

4. DATA MODELS & STORAGE

externally stored documents, a database’s normalized relational schema serves as both conceptual data model and physical storage framework.

4.2.2

DATA STORAGE

The primary logical storage structure in a database is a table (also called, logically enough, a relation). Each individual table contains a body of related information, for example product details grouped in a ‘Products’ table. Each column (attribute) represents a single category of information for the entities represented in the table, for example, ‘Price’ in a ‘Products’ table. Each cell (field ) in a column contains a uniform representation of that attribute (for instance, text values like “one hundred,” or numeric values like “100” to represent prices). Each row (called tuples or records) in the table constitutes a complete representation of the entity the table treats (for example, a specification for a product). As each row represents a unique entity, it is assigned a unique identifier (the primary key). Consider for instance, the simple table of products for a company in Figure 4.4.

Figure 4.4: Database tables store information in a well-defined, highly consistent form. The row is the logical unit for representing a complete view of an individual instance of the entity the table treats.

Figure 4.5 shows a Manufacturers table. It contains information related to the Products table (Who manufactures a given product?).To logically bind these two tables, the manufacturer’s primary key is inserted into the Products table. When a primary key appears in an external table, it is called a foreign key. It is through this series of primary and foreign keys that the relationships between tables are established. The individual structure of tables and their relationships to one another constitute the database’s overall relational schema, a schema which is precisely defined in advance of data creation/collection by the database’s architect. To retrieve a list of all products displaying the manufacturers’ names and addresses, these two tables have to be combined (joined). Why store data in separate tables if it has to be pieced back

4.2. DATABASES

21

Figure 4.5: Relationships between tables are established via a system of Primary and Foreign Keys.

together in this fashion? The answer is to avoid repetition and anomalies through a process known as normalization. In normalization, all data is broken down into the smallest, non-repeated units possible. The goal of normalization is to avoid having multiple values entered in a single cell, to avoid repeating data in multiple rows, and to repeating data across multiple tables. It is a means to ensure data consistency (i.e., two version of the same piece of information are the same across the system) and integrity (i.e., the data remains the same as it was entered) as well as to avoid redundancy, the latter in part being a legacy from the early days of database development, when, as pointed out previously, storage was extremely expensive. Figure 4.7 shows a simple non-normalized table, and the same table normalized. Exceptions encountered over the course of usage (for instance, adding a manufacturer producing the same part in multiple locations under more than one name) require modification of the data model and database structure, as well as modification of the application layer programming used to retrieve data views. If one is attempting to manage a large, rapid evolving body of information, accommodating exceptions can become quite complex and time-consuming.

4.2.3

STORAGE FRAMEWORK

Relational databases can use distributed architectures (master/slave replication, cluster computing, table partitioning, etc.) for large volume systems. However, given their primary emphasis on resource-

22

4. DATA MODELS & STORAGE

Figure 4.6: Entity-Relationship (ER) Diagrams are a common way of representing the overall database model.

Figure 4.7: Data is broken down into the smallest discrete units needed to preserve data integrity and consistency

intensive write operations (Create, Update, or Delete, or CRUD operations), their need to join data across multiple tables for meaningful data views, and their need to ensure data integrity and reliability, maintaining performance, integrity and consistency across multiple servers is usually complex and expensive.

4.3. WHAT HAS CHANGED RECENTLY

4.3

WHAT HAS CHANGED RECENTLY

4.3.1

SEARCH ENGINES

23

First, with search’s incursion into the enterprise, and the accompanying push to handle structured content—a force in Web search engine development as well—the concept of a ‘document’ began to evolve. For Search Based Application engines, the number and complexity of attributes stored in a given column greatly increased, and the conception of a document expanded from that of a literal document like a Web page or text file, to also include a meaningful collection of information akin to a database-style business entity. For example, a search ‘document’ in an SBA context may aggregate numerous discrete pieces of information from multiple sources to create a well-rounded representation of an entity like a product or employee. Unlike a database entity, however, this meaningful representation is always present in its entirety in a single synthetic document stored within the search engine index. Entity attributes are stored in a de-normalized state and can evolve as source data evolves (we’ll at how these representations are built and maintained in Chapters 5 and 6).

Figure 4.8: The concept of a ‘document’ in the search context has evolved to include representations of database-style entities

4.3.2

DATABASES

The core data model and storage unit for a relational database remains the row-oriented, normalized relational table, and scaling this model continues to be expensive and complex. As one way to improve performance, database engineers began to experiment with ways to introduce persistent, document-style modes of representing business entities into databases— mainly as a way to boost performance as joining multiple tables to get a unified data view is resource intensive.

24

4. DATA MODELS & STORAGE

At their simplest level, these efforts entailed a more extensive use of views or virtual tables, which are in essence simply cached versions of data ‘pre-assembled’ into meaning units [Carey et al., 1998]. At a more advanced level [Hull, 1997], these efforts have resulted in experiments with objectoriented databases, which were originally created to handle complex data types like images, video, audio, and graphs that were not well supported by conventional relational databases. However, performance and scaling issues, among others, have confined object-oriented databases to niche markets, though mainstream databases have adopted some features and strategies of these systems into conventional relational databases (such as using object-oriented strategies for handling select complex data types like multimedia). More recent efforts to surmount performance and scaling barriers and develop a more agile data model and more scalable data storage have resulted in new non-relational database structures. These include: • Key-value stores • Document databases • Wide-column stores • Graph databases These types of databases are collectively referred to as NoSQL databases [Leavitt, 2010] (for “Not Only SQL,” a reference to the standard querying language for relational databases), distributed data stores, schemaless databases, or VLDBs (for Very Large Databases, as many of these alternatives are reviewed in the conferences and journals of the non-profit VLDB Endowment). Though each label is less than ideal, we’ll use the most common, NoSQL databases, for this book. Despite their structural differences, NoSQL databases share several primary characteristics with each other, and with search engines. They all: • Represent data as key-value pairings stored in column-oriented indexes, • Use distributed architectures (supported by an extensive use of meta-indexes to support partitioning and sharding) to overcome the performance and scaling limitations of relational databases, • Emerged from the Web, or use Web-derived technologies (prime agents including Internet giants like Amazon, Facebook, Google, LinkedIn and eBay), and • Relax consistency and integrity requirements to improve performance (following the CAP theorem that you cannot achieve Consistency, Availability and Partition-Tolerance at the same time).4 Below are snapshots views of the data models and storage architectures employed by each of these types of non-relational databases 5 . 4 See Brewer [2000], and Gilbert and Lynch [2002]. 5 A list of NoSQL databases is found here: http://nosql-database.org.

4.3. WHAT HAS CHANGED RECENTLY

25

Key-Value Stores

Figure 4.9: Key-Value Stores enable ultra-rapid retrieval of simple data

Typically, these databases map simple string keys to string values in a hash table structure (Figure 4.8). Some support values in the form of strings, lists, sets or hashes where keys are strings and values are either strings or integers. None support repetition. For most, querying is performed against keys only, and limited to exact matches (e.g., I can search for ‘Prod_123’ but not ‘Zapito’). Others support operations like intersection, union, and difference between sets and sorting of lists, sets and sorted sets. Examples include Voldemort (LinkedIn), Redis, SimpleDB (Amazon), Tokyo Cabinet, Dynamo, Riak and MemcacheDB6 [DeCandia et al., 2007]. Document Database Document databases are key-value stores as well, but the values they contain are semistructured and can be queried. Figure 4.9 shows a simple example with multiple attribute name/value pairs in the Value column.

Figure 4.10: Document Databases contain semi-structured values that can be queried. The number and type of attributes per crow can vary, offering greater flexibility than the relational data model. 6 project-voldemort.com,

code.google.com/p/redis, aws.amazon.com/simpledb, www.dynamocomputing.com, memcachedb.org wiki.basho.com/display/RIAK

fallabs.com/tokyocabinet,

26

4. DATA MODELS & STORAGE

Document databases include XML databases, which store data as XML documents. Many support joins using Javascript Option Notation ( JSON) or Binary JSON (BSON)7 . Examples include CouchDB and MongoDB ( JSON/BSON) and MarkLogic, Berkeley DB XML, MonetDB (XML databases)8 . Wide Column Databases These structures, sometimes called BigTable clones as most are patterned after the original Bigtable [Chang et al., 2006], Google’s internal storage system for handling structured data, can be thought of as multi-dimensional key-value pairs (Google’s own definition: “a Bigtable is a sparse, distributed, persistent, multi-dimensional sorted map"). Figure 4.9 offers a simplified representation of such a map.

Figure 4.11: Wide Column Databases are multi-dimensional key-value stores that can accommodate a very large number of attributes. They offer no native structure for determining relationships or joining tables.

In essence, a Bigtable structure is made up of individual ‘big tables’ which are like giant, unnormalized database tables. Each can house a huge number of attributes, like SBA engines, but there is no native semantic structure for determining relationships and no way to join tables, though as with SBA engines, duplicates are allowed. In the case of BigTables, this duplication includes duplicate rows, as with the scooter price above. Timestamps are used to distinguish the most recent data while supporting historical queries and auto garbage collection (a strategy likewise employed by some XML databases). 7 www.json.org and bsonspec.org 8 couchdb.apache.org, www.mongodb.org,

etdb.cwi.nl

www.marklogic.com, www.oracle.com/technetwork/database/berkeleydb, mon-

4.3. WHAT HAS CHANGED RECENTLY

27

Examples in addition to Google’s Bigtable include Cassandra (Facebook), HBase, Hypertable and Kai9 . Graph Databases Graph databases [Angles and Gutierrez, 2008] replace relational tables with structured relational graphs of one-to-many key-value pairs. They are the only of the four NoSQL types discussed here that concern themselves with relations. A graph database considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph. These graphs can be represented as an object-oriented network of nodes, relations and properties.

Figure 4.12: Graph databases are more concerned with the relationships between data entities than with the entities themselves.

While accessing such relationships is very useful in many applications (consider social networking for example), querying graph databases is typically slow [Cheng et al., 2009] since graph structures have to be matched, even if the volume of data that can be treated is very high. Examples of graph databases include Neo4j [Vicknair et al., [Giannadakis et al., 2010], Sones, VertexDB, and AllegroGraph.10

2010],

InfoGrid

The column-oriented, key-value centered distributed storage model used by all of these NoSQL databases is the key to their massive scalability and IR processing speed, and it represents the major point of convergence with search engines. Information retrieval needs have evolved, 9 cassandra.apache.org, hbase.apache.org, hypertable.org, sourceforge.net/projects/kai/ 10 neo4j.org, infogrid.org, sones.com/home, dekorte.com/projects/opensource/vertexdb, franz.com/agraph/allegrograph

28

4. DATA MODELS & STORAGE

and both search engines and NoSQL databases are filling IR needs unmet by relational databasecentered IR. However, significant structural differences between the these two technologies remain in structured data handling, the use of semantic technologies, and querying methodologies. We’ll explore these differences in the next two chapters.

29

CHAPTER

5

Data Collection/Population At A Glance Characteristic Primary method Pre-processing Data freshness

Search Engine Crawlers Not required Quasi-real-time

5.1

SEARCH ENGINES

5.1.1

COLLECTION

Databases Direct writes, ETL (connectors) Required 24hrs+ for data warehouses

Early Web search engines used a single primary tool to collect data, a software program called a crawler [Heydon and Najork, 1999]. The crawler would connect to a website, capture the text it contained along with basic metadata like page titles, content headers or sub-headers, etc. (sending the information collected back to a central server(s) for indexing), and then follow the hyperlinks from one page to the next in an unending circuit across the Web. Aside from some basic, automated formatting and clean up (for example, removing HTML tags or double whitespaces), no pre-processing was required for the data collected – it was a straight take-it-as-found mode of operating.

5.1.2

UPDATING

Search engines employ varying update strategies [Wolf et al., 2002] according to available resources and editorial or business objectives. Some simply crawl the entire Web on a fixed schedule (biweekly, monthly, etc.), re-indexing content as they go, others employ strategies based on factors like the expected frequency of changes (based on prior visits or site type, such as news media) and site quality heuristics. Whatever strategy is used to achieve optimal freshness, search engines are designed for incremental, differential data collection and index updating, and there is no technical barrier to high performance search engines performing quasi-real-time updates for billions of pages.1 1 Prior to its recent migration from a MapReduce to BigTable index architecture, Google employed a vertical strategy for updating

some portions of its index more frequently than others. This kept the index relatively fresh for certain types of sites like major news outlets, but the intensive batch-style processing within MapReduce impacted index freshness even for these frequently crawled

30

5. DATA COLLECTION/POPULATION

5.2

DATABASES

5.2.1

CREATION/COLLECTION

Databases typically occupy a middleware position in IT architecture, receiving data through discrete data inputs by human users through front-end applications, through individual writes by other applications, or though batch imports (automated or manual). Automated batch transfers are usually accomplished with the aid of ETL tools (for Extract, Transform and Load).

Figure 5.1: Databases occupy a middleware position in conventional IS architectures, and are at the core of all data creation, storage, update, delete and access processes.

Because databases were primarily designed to record business transactions, they feature a host of tools designed to ensure the accuracy and referential integrity of data [Bernstein and Newcomer, 2009, Chapter 6]. For example, being able to roll data back to its original state if a transaction is interrupted mid-stream (for example, by a power failure), rejecting data that is not of a consistent type or format (for instance, enforcing a uniform representation of monetary amounts), preventing other users or processes from accessing a table when a user is creating or modifying a record to prevent conflicts (transaction locking), or being able to recover a successfully completed transaction from a log file in case of a system failure. These types of constraints are referred to as ACID constraints, for Atomicity, Consistency, Isolation, and Durability. To ensure consistency during batch transfers, ETL tools can be used to ‘clean’ data, that is to say, to render it consistent with the target database’s structure and formatting conventions (with a fair amount of manual mapping and configuration). If the data to be imported is simply inconsistent with the target database’s data model, that model must be revised before the transfer can proceed. sites. Under the new architecture, Google has moved much closer to a global, near real-time incremental update strategy. See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html.

5.3. WHAT HAS CHANGED

5.2.2

31

UPDATING

Databases typically use differential update processes, inserting, updating or deleting data as received by user or system inputs. However, for large central repositories like data warehouses, processing such real-time incremental changes can be slower than a complete transfer and rebuild. Whether such systems employ a differential or rebuild strategy, resource-intensive updates are typically executed in batch operations scheduled for off-peak hours, typically once a day, to avoid performance bottlenecks. The down side of this practice is the users who rely on business intelligence systems built on data warehouses and marts have to make decisions based on data that is 24 hours old (or older depending on the scale and complexity of the database systems).

5.3

WHAT HAS CHANGED

5.3.1

SEARCH ENGINES

As search engines were pushed to accommodate a wider variety of data, they developed software interfaces called connectors that enabled them to access and acquire new types of content. These include file system connectors (to sequentially read and index document repositories like enterprise file servers), messaging connectors (for connecting to enterprise email systems), and, for Search Based Application engines, database connectors (using the Open Database Connectivity - ODBC, or Java Database Connectivity - JDBC, protocols).2 . Many engines now also feature a Push Application Programming Interface (API) that supports custom connectors developed in standard programming languages and typically communicating with the engine via HTTP protocols. As a result, there is now a generation of search engines that can connect to virtually any information source, and process virtually any type of content: unstructured (text documents, presentations, multimedia files), semi-structured (email messages, HTML, XML), and structured (databases, logs, spreadsheets)3 And in an advance important to the development of SBAs, latter generation engines can not only ingest the data contained in structured systems, they can capture and exploit the data schema employed by the source system - still with no pre-processing other than a basic metadata mapping in a push API or database connector. The schematic information extracted is not only represented within the index, it can also optionally be used to guide the entire indexing and query processing chain, much as earlier engines used external taxonomies and ontologies to guide the indexing of text (more on this in the Chapter 7, Data Processing). In addition to expanding the range and semantic depth of information a search engine can ingest, this framework of crawlers, connectors and APIs has given businesses considerable control over data freshness. Indexes continue to be updated in a differential, incremental process—with the rules and schedules being separately configurable for crawlers and connectors—and updates 2 See support.microsoft.com/kb/110093, download-llnw.oracle.com/javase/tutorial/jdbc/basics/ 3 See http://incubator.apache.org/connectors/ or

http://www.exalead.com/software/common/pdfs/products/cloudview/Exalead-Connectors-and-Formats. pdf.

32

5. DATA COLLECTION/POPULATION

can be performed with little to no impact on source systems.4 This means engines can index data directly from source systems rather than centralized repositories, and data freshness can be quasiinstantaneous even for large data volumes, or scheduled as desired to optimize cost and performance. Even in databases, the idea of data freshness can vary, with data being updated once a day (for example, for calculating sales) to real-time (as in the stock market). SBA platforms can download data from databases on a regular schedule (every 15 minutes, once an hour, etc.). Accordingly, deploying an SBA platform alongside a data warehouse can drop data latency from 24 hours to near real-time. And once the data is loaded in the index, unlimited users can query the data at will with zero impact on underlying production databases—a significant advantage given that access requests typically far

5.3.2

DATABASES

The basic procedures for loading data into relational databases, meanwhile have not changed, though ODBC connectors are extensively used for data consolidation, and the number and sophistication of ETL tools has increased. High end ETL tools5 now have the capacity to load structured, unstructured, and semi-structured data into a relational database management system (RDBMS), though this is typically accomplished via a companion search engine. Whatever its original format, the data to be loaded must still conform to the RDBMS data schema, and latency remains an issue as long as primary access is delivered via a centralized data warehouse. NoSQL databases use strategies similar to search engines for data collection and updating, including the use of crawlers and HTTP/REST protocols6 executed through APIs. And their flexible data models alleviate the need for heavy pre-processing of ingested data. However, unlike search engines, NoSQL databases are often positioned as alternatives to databases, and much of their efficiency is achieved through compromises to ACID constraints [Leavitt, 2010] that make them a less desirable tool for OLTP. Such systems often aim simply for eventual consistency, meaning that under their particular distributed architecture, one user may see a data update a few seconds later than another user, but eventually, everyone will have access to consistent information. This eventual consistency is sufficient for many applications, especially access-oriented Web applications. As Tony Tam, VP of engineering at Wordnik, stated when migrating five billion documents (1.5 terabytes of data) from a MySQL database to MongoDB, “We kind of don’t care about consistency."7 And although some NoSQL databases can support ACID constraints to some degree (e.g., document databases), they are not designed to support high throughput OLTP applications 4 Generally, such update requests have no impact on production databases. However, if a request does put a load on a database, it is

a known load, and the database administrator can spend whatever time is required to optimize the process, or can simply schedule it at an appropriate time or rate. 5 See www.etltool.com 6 REST stands for “representational state transfer" and refers to the ability to access information through an HTTP call. 7 Reported in the article “MongoDB Handles Masses Of Data," InformationWeek, May 3, 2010. http://www. informationweek.com/news/software/database/showArticle.jhtml?articleID=224700367

5.3. WHAT HAS CHANGED

33

workloads.8

over nonpartitionable SBA engines don’t require such compromises because they are intended to complement, rather than replace, relational database systems (though they could be used as replacements in certain access-oriented contexts).

8 For an example of the research being done on scaling ACID-compliant OLTP systems on distributed, shared-nothing architectures,

see http://db.cs.yale.edu/determinism-vldb10.pdf [Thomson and Abadi, 2010].

35

CHAPTER

6

Data Processing At A Glance Characteristic Processing Principal technology

6.1

Search Engine Natural language processing Semantics

Databases Data processing Data mapping

SEARCH ENGINES

Many search engines prepare extracted content for indexing through a two step process: natural language processing, and assignment of relevancy criteria. Natural Language Processing serves three purposes: normalizing away linguistic variations before a document is indexed, recognizing structure in text such as noun phrases that should be indexed as a unit, and typing the structures found, identifying them, for example, as persons, places or things. These typed normalized features are then indexed. The index contains pointers to where the features were found: in what document, in what sentence, and in what position. In addition to these positions, weights are also assigned to each feature, with rarer features receiving higher weights, to be used during the calculation of the relevancy of a document to a query.

6.1.1

NATURAL LANGUAGE PROCESSING

After documents consisting of text and basic metadata (e.g., URL, file size, file type, modify/creation/extraction date, etc.) have been extracted from source systems and saved in the designated storage structure (partial/complete cache; in-memory or remote), the unstructured textual parts are prepared for indexing via simple tokenization or more elaborate natural language processing (NLP). This preparation identifies indexable key words within documents and normalizes them in a process that can include these NLP steps: Language Detection Using statistical sampling of common sequences of letters, the engine determines in which language a document is written [Grefenstette, 1995]. For example, a text containing a word ending in -ack is more likely to be English than French, and a word ending in -que is more

36

6. DATA PROCESSING

likely to be French than English. Cumulative statistics over all the letter sequences found in a piece of text are use to decide the language. Tokenization Next, the text is tokenized [Grover et al., 2000], or split, into a sequence of individual words using language-specific grammar, punctuation and word separation rules. Tokenization of input text solves simple problems such as stripping punctuation from words, but it can be complicated by the fact that certain languages include punctuation inside words (for example, the French word aujourd’hui (today)) and certain words may contain punctuation (such as C++, a programming language). Tokenization also recognizes sentence boundaries so that a document containing the fragment “...in New York. Researchers from...” won’t be returned as an exact match for a phrasal query on “New York Researchers.” Stemming and Lemmatization Recognized tokens are further normalized either through simple suffix removal, or by morphological analysis. A stemming module [Baeza-Yates and Ribeiro-Neto, 2010, Chapter 7]will apply language specific suffixing rules to remove common suffixes, for example, removing a trailing -s from “items” to form the stem “item,” which is then indexed.1 More elaborate, linguistically-based morphological analysis (also called lemmatization) uses dictionaries and rules [Salton, 1989] extensively to identify more complex variants, tranforming “mouse” into “mice” for indexing. Part of Speech Tagging Optional, and more computationally expensive, tagging modules [Manning and Schütze, 1999, Chapter 10] can be used to improve lemmatization by first identifying the part of speech of a word, determining by context if a term like “ride” is being used in the source document as a verb, not a noun, so that the term can be mapped to appropriate variants like “rode” and “riding.” Some search engines allow the possibilities choosing to index words based on their part of speech. Chunking Part of speech tagged text can be parsed or chunked [Abney, 1991] to recognize units, such as noun phrases, which can be indexed as a unit, or stored with the document as additional features. For example, from the text "... information from barcode scanners are downloaded to the ...", the normalized noun phrase barcode_scanner can be recognized by parsing and added as an additional feature in the index. 1 Stemming, Lemmatization, and Part of Speech Tagging may likewise be applied query time as well as during indexing.

6.2. DATABASES

6.1.2

37

RELEVANCY CRITERIA

For both Web and enterprise engines, the indexing process also includes computation of general ranking and relevancy scores of the document as well as the assignment of relevancy-related metadata that can be used at query time to determine relevance in the context of a specific query. This may include, for example, the position and frequency of a given term within a document [Clarke and Cormack, 2000], a weighting for the ‘credibility’ of the source [Caverlee and Liu, 2007], the GPS coordinates of the document for use in a map-based search, among others. On the Web, document processing may include calculating the number and quality of external links to the document (or web page) and the frequency of page updates [Page et al., 1998]. Once Natural Language Processing and and relevancy assignments are complete, documents are automatically mapped to high level index fields (like language or mime_type) and written to the index2 .

6.2

DATABASES

Conventional relational databases do not apply linguistic analysis when preparing data to be written to database tables. Instead, natural language processing is replaced by conventional data processing. The data is mapped, formatted and loaded according to strictly defined structures and procedures. These mappings must traditionally be established manually for each source. For example, in the case of a structured source, the source system may record customer gender in a “Gender” column, with cells containing “Female” or “Male” values, and the target system may use a “Sex” column with cells containing “1” for male and “2” for female. The administrator would need to configure the connector, API or ETL to map these columns as well as converting the textual data to its numeric counterparts. For data entry by end users of applications built on databases, this mapping would be handled by the application programming. For example, the user may select “Female” or “Male” from a pulldown menu, but the backend programming would pass values of “1” or “2” to the SQL script in charge of writing data to the database. At the time of the load (or SQL commit) data validation is applied. If validation fails, it may result in full rejection of the transaction.

6.3

WHAT HAS CHANGED

6.3.1

SEARCH ENGINES

There are three main evolutions in search engine data processing that enable the unified access to information found in Search Based Applications: 1. An expansion of the statistical calculations performed on source data 2. A capacity to extract the semantic information contained in structured data 2 One also says ingested into the index.

38

6. DATA PROCESSING

3. An expansion of the range and depth of semantic processing on all data As search engines entered the enterprise, there was a marked expansion of the statistical calculations applied to 1) develop meaningful ranking criteria in the absence of Web-style popularity measures [Fagin et al., 2003], and to 2) support dynamic categorization and clustering3 to provide task-oriented navigation and refinement of search results. To meet these two needs, Search Based Application engines began to apply counts to every indexable entity and attribute (references to a particular person in an email system, number of products in a database with a particular attribute, regional sales from an ERP, etc.). In SBAs, the capacity to organize information into categories and clusters is being used for generic, database-style content presentation and navigation (rather than simply search results refinement), and the wealth of statistical calculations intended to support relevancy are being repurposed to generate ad hoc reporting and analysis across all data facets, as will be covered in more depth in the next chapter. Next, advances in connector technology [Kofler, 2005, Reese, 2000] are enabling search engines to retrieve data schemas from structured systems and to ingest these schemas as indexable entity attributes as well as employing them for for high level index organization. The ability to capture, access and exploit the rich semantic metadata encapsulated in structured data is the prime reason search engines now provide an effective means of database offloading (see Figures 6.1 and 6.2). The next SBA-enabling evolution is a significant extension of the depth and range of natural language processing [Manning and Schütze, 1999] employed by SBA engines. These expanded semantic capabilities, coupled with basic mapping supplied in configuring APIs and connectors, enable SBA engines to: • Effectively structure unstructured content • Enrich data with meanings and relationships not reflected in source systems • Meaningfully aggregate heterogeneous, multi-source content (non-structured and/or structured) into a meaningful whole This expansion is also the reason the ‘document’ in the search document model has been able to evolve from a literal document to a meaningful unit of business information. Beyond the basic Natural Language Processing functions outlined earlier (Language Detection, Tokenization, Stemming, Lemmatization and Part of Speech Tagging), semantic analysis today may include: Word Sense Disambiguation Using contextual analysis [Gabrilovich and Markovitch, 2009], for example, to determine if a “crane” in a given text a type of bird or a type of machine using the words found around the 3 See, for example, http://demos.vivisimo.com/projects/BioMed

6.3. WHAT HAS CHANGED

39

Figure 6.1: SBA engines use database connectors plus natural language processing to transform relational database models into a column-oriented document model.

word “crane” in the document and comparing these to word sense models created ahead of time. The word might then be indexed as “crane (bird)” or “crane (machine)”. Standard and Custom Entity Extraction Identifying, extracting and normalizing entities like people, places, organizations, dates, times, physical properties, measurements, etc., a process aided by dictionaries or the use of thesauruses paired with extraction rules [Kazama and Torisawa, 2007]. Dependency Analysis Identifying relations between words in a sentence, such as the subjects or objects of verbs.4 Dependency parsing helps to isolate noun phrases that can be indexed as a unit as well as showing relations between entities. Summarization Producing, for example, a shorter text sample that would contain much of the important information in a longer text [Mani and Maybury, 1999]. Search engine generally post snippets 4 http://nlp.stanford.edu/software/lex-parser.shtml

40

6. DATA PROCESSING

Figure 6.2: The entity attributes and relationships extracted from the database remain discretely accessible and exploitable.

in which the important query words were found close together. Summarization of longer documents can also be done independently of a query, by calculating the most important (frequent) words in the documents and pulling out a number of sentences that are dense in these important words. Pronoun Resolution Determining, for example, who the words “they” or “this person” refer to in a text [Mitkov, 2001]. Pronoun resolution connects these anaphoric references with the most likely entities found in the preceding text. Event and Fact Extraction Recognizing basic narrative constructions: for example, determining if a text describes an attack (even if the word “attack” was not used in the text) and determining when it took place, who was involved, and whether anybody was hurt [Ji and Grishman, 2008]. Event extraction,

6.3. WHAT HAS CHANGED

41

which often relies on dependency extraction, can provide a data base entry-like summary of a text. Table Extraction Recognizing tables in text [Cafarella et al., 2010]. This can help type entities since the column heading often provides the type of the entities found in the column. It also can be used to attach attributes to these entities, for example, the team that a basketball player plays for. There is much research into transforming data found in semi-structured tables into Linked Open Data [Bizer et al., 1998]. Multimedia Analysis Extracting semantic information beyond standard metadata like titles and descriptions; for instance, using automatic speech-to-text transcription to open access to speech recordings and videos [Lamel and Gauvain, 2008], or applying object recognition processing to images, such as face detection or color or pattern matching [Lew et al., 2006]. Sentiment Analysis Deciding if a text is positive or negative, and what principal emotive sentiments it conveys [Grefenstette, 2004].This is done through lexicons of negatively and positively charged words. Current techniques use specific lexicons for each domain of interest.5 Dynamic Categorization Determining from context if, for example, the primary subject of a document is medicine, or finance, or sports, and tagging it as such. Documents may be clustered according to these dynamic categories, and one may apply further classification technologies to these clusters to organize them into a meaningful structure [Russell and Norvig, 2009, Chapter 18],[Qi and Davison, 2009]. Examples include: • Rule Based Classification (supervised classification using a model ontology and document collection). Rule Based Classification may include the use of ‘fuzzy’ ontology matching - using contextual analysis, synonyms, linguistic variants - to more effectively identify common entities across heterogeneous source systems. • Bayesian Classification (unsupervised classification based on statistical probabilities) • SVM (Support Vector Machine) Classification (another form of supervised statistical learning) Relationship Mapping 5 See http://www.urbanizer.com that implements domain lexicons for restaurants’ cuisine, service and atmosphere

42

6. DATA PROCESSING

Using the principle of “co-occurrence” to map relationships between people, objects, places, events, etc6 . For example, if person A and person B are mentioned in one text, and person B and C in another, there is tangential evidence of a relationship between persons A and C. These and other types of semantic analysis tasks all involve deciding when two things are the same, or applying some label (metadata) to a piece of information to better represent its meaning and context, or showing the relationship between two things. Commercial search engines7 may contain up to 20 different semantic processors to extend and enrich content. Accordingly, the depth and complexity of meanings and relationships a modern semantic engine can unearth and capture exceeds that of most database systems. Moreover, these relationships are organic rather than pre-defined, which means they can evolve as underlying data evolves.

6.3.2

DATABASES

In general, databases continue to follow strict, top-down data processing procedures. These procedures are part and parcel of the core strength of databases: ensuring data integrity. What has changed, however, is that conventional databases are increasingly employing Natural Language Processing to aid and automate the mapping processes during data integration, especially when dealing with very large databases. To return to our prior example, a natural language processor may be used within an ETL tool to automatically map the “Gender” and “Sex” columns or, at least, to identify them as possible matches for subsequent human review8 . To date, however, the NLP tools used in such processes are limited, and advanced semantics are not employed at the processing level, though the semantics inherent in the database structure itself can, of course, be exploited to great advantage during the retrieval process. With the exception of relationship mapping in graph databases (such as Neo4j), No SQL systems (key-value, document, and wide-column stores) likewise employ NLP in a limited manner. They do not support full text indexing (or, consequently, full text searching), nor the automatic categorization and clustering that enables faceted search and navigation, nor the semantic tagging that supports fuzzy query interpretation and matching. In addition, they do not support the ranking and relevancy calculations that are being used to provide reporting and analysis in SBAs. In fact, both RDBM and NoSQL systems are typically paired with a search engine (external or embedded) to deliver these capabilities, or with an RDBMS for categorization, reporting and analysis.

6 See, for example, http://labs.exalead.com/experiments/miiget.html 7 Attivio, Autonomy, Endeca, Exalead, Funnelback, Sinequa, Vivisimo, among others. 8 See, for example, http://www.megaputer.com/polyanalyst.php

43

CHAPTER

7

Data Retrieval At A Glance Characteristic Read Pattern Query method Algebraic, numeric operations Filtering Query Interface Data output

Search Engine Column Natural language No Post (Ranking/relevancy) Unique query box Ranked lists, visualisation

7.1

SEARCH ENGINES

7.1.1

QUERYING

Databases Row SQL commands Yes Pre (Exact match) Form-based interface Limited by data and structures

The traditional role of search engines is to help human users locate documents (Web pages, text files, etc.). To do so, a user enters a natural language question or one or more key words in a search text box. Similar to the content indexing pipeline, the search engine first parses the user’s request into individual words (tokenization), then identifies possible variants for these words (stemming and lemmatization) before identifying and evaluating matches. There are three basic types of queries for conventional search engines: Ranked Queries, Phrase Queries and Boolean Queries. • Ranked Queries Ranked Queries produce results that are ordered, ranked, according to some computed relevancy score. On the Web, this score includes lexical, phonetic or semantic distance factors and ‘popularity’ factors (like the number and type of external links pointing to the document [Brin and Page, 1998]). Ranked queries are also called top-k queries [Ilyas et al., 2004] in the database world. • Phrasal Queries Phrasal queries are ranked queries that take word order into account to make the best match. For a query like “dog shelters New York,” the engine would give a higher score to a document entitled “List of dog shelters in New York City” than to a document entitled “York Supervisor

44

7. DATA RETRIEVAL

Complains New Emergency Shelter Not Fit for a Dog.” The possibility of phrasal queries imply that the positions of the words inside each input document are also stored in the index. • Boolean Queries Search engines also support Boolean queries, which contain words plus Boolean operators like AND, OR and NOT, and may employ parentheses to show the order in which relationships should be considered: (crawfish OR crayfish) AND harvesting

7.1.2

OUTPUT

For conventional search engines, results are output as ranked lists, on a Search Engine Results Page, a SERP (Figure 7.1). These lists typically contain document titles, content snippets or summaries, hyperlinks to source documents, and basic metadata like file size and type. Users scan the list and click through to source documents until they find what they need (or rephrase or abandon their search). A whole industry of Search Engine Optimization (SEO) exists on how to improve ranking in these lists1 , from the point of view of the web page author or owner.

7.2

DATABASES

7.2.1

QUERYING

Relational databases are queried for a broad range of purposes: to perform a task (e.g., process an order), locate specific data (e.g., look up an address), display content, or analyze information (e.g., generate and review reports by various data dimensions). These functions are performed using Structured Query Language (SQL) commands. SQL2 is a robust query language permitting a wide range of operations for both numerical and textual data. Retrieval options include comparisons (=, <>, >, <, >=, <=, BETWEEN, LIKE, IN, AND, OR), aggregations ( JOIN, LEFT JOIN, RIGHT JOIN, INNER JOIN, FULL JOIN, UNION), mathematical calculations (+,−, ∗, /, %), and results sorting operations (ORDER BY, DESC (descending), ASC (ascending), and LIMIT). Here is a simple SQL example for retrieving a list of all employees working at a particular location: SELECT first_name, last_name FROM employees WHERE office_location = 123 SQL commands may be entered directly in a relational database management system (RDBMS) by administrators, submitted to the RDBMS by other software applications, or passed by forms manipulated by end users.These forms, which are often quite complex, are used to sequentially construct a back end query that conforms to the database’s structure. 1 Google provides a search engine optimization starter guide at http://bit.ly/aHgG3Y 2 Numerous tutorials for SQL can be found online, e.g., http://sqlcourse2.com/.

7.3. WHAT’S CHANGED?

45

Figure 7.1: Search Engine Results Page (SERP)

7.2.2

OUTPUT

The raw output for all SQL queries is essentially a structured list of results. The formatting and presentation of these results is handled by the application layer, and can take virtually any form (tabular lists, pie charts, graphs, formatted Web pages, etc.). However, the range of possible operations and data views is constrained by the database’s pre-defined data model and pre-defined SQL queries (though some level of dynamic querying can be built into the application layer).

7.3

WHAT’S CHANGED?

7.3.1

SEARCH ENGINES

Querying Search engines continue to process natural language queries and conduct full text search, but three changes have occurred to advance search beyond the SERP and into the world of database-style querying and output.

46

7. DATA RETRIEVAL

Figure 7.2: Classical form based database query interface. The user needs to know what each of the fields mean, and into which field they must enter the information they are looking for.

1. APIs for System-to-System Queries First, the use of access APIs (Application Programming Interface) has enabled Search Based Application engines to assume an infrastructure rather than application layer position3 . This means queries can be accepted and results made available to other software applications, rather than being confined to simple text box key word entries by end users. Typical Search APIs use a standard Internet HTTP protocol, support multiple programming languages ( Java, .NET, PHP, Ruby, Python, Perl, etc.) and often offer multiple interfaces (SOAP, REST, RSS, POST XML, etc.). 2. More Flexible, SQL-Like Querying Second, while querying in search engines is still based on natural language, SBA engines support user and application queries containing textual, numerical, and symbolic constraints, with continuing extensive Boolean operator support. The result is more robust, SQL-like operations (including JOINs) against both structured and unstructured data, with the flexibility to apply business rules to query processing. 3. Semantics & Business Rules Applied to Query Interpretion 3 Information systems are commonly described has having four layers: the user interface layer that a user interacts with, the

application layer which performs required actions on the information, the domain layer which organizes the business information, and the infrastructure layer that does low level message passing over the network.

7.3. WHAT’S CHANGED?

47

Figure 7.3: Since modern search engine query languages also support complex structured querying (SQL-like with a variety of operators), search based applications can also provide form-based interfaces, along with the unique search box, which allows users experienced with form-based input in legacy database systems to transition seamlessly. Here in this screenshot is the advanced search option for a customer service SBA.

Third, the use of semantic technologies enables a level of query interpretation previously unattainable. For example, an SBA engine can use natural language processing to break apart and analyze a search request, much as it does when processing content for indexing, and apply a host of fuzzy matching operations including: • Approximate spelling (“hyphen" will match “hyphene") • Phonetic spelling (“hyphin" will match “hyphen") • Word truncations (“rob*” will match “robust", “robot" and “robin") • Regular expressions (“/re.ort/” will match “report" and “retort") • Semantic expansion (“plane" will match “aircraft") • Rules-based matching (matches may be filtered according to specific business rules and constraints) These changes mean users can retrieve relevant results even with incomplete, misspelled or imprecise queries, and that they can execute even complex queries using a familiar Web-style text box. For instance, for a logistics application, a user could type “red fiats delivered to sicily

48

7. DATA RETRIEVAL

today," and the engine could properly interpret the query and return the appropriate results; for a database, this same query would require a multi-field form (make, model, color, event type, location, date…), or a command of SQL.

Figure 7.4: With natural language queries available in Search Based Applications, users can put anything like a name of a car, product number, account number, etc., without having to know how to formulate complex queries or have specific data like ID numbers on hand

Output In addition to these evolutions in query possibilities, SBA engines can now support the same flexible results output options as databases: tabular results, charts, graphs, cartography, etc. In addition, they support output options databases do not, such as automatic semantic mapping and faceted navigation by content categories. Most importantly, results manipulation and exploration are dynamic and ad hoc. Every facet of data represented in the index can be exposed and manipulated using dynamic clustering and reporting, without formulating queries in advance, and the presentation of data evolves as the underlying data changes. We’ll cover the facet-based reporting capabilities of SBA engines further in Chapter 12, Anatomy of a Search Based Application.

7.3. WHAT’S CHANGED?

49

Figure 7.5: Here, an engine applies synonym matching, phonetic spelling and approximate spelling to correctly interpret “An interruptor in Manatan" as a request for “Control Switches in Manhattan, New York, USA.” Applying business rules, priority ranking may be applied to particular switch manufacturers, and related query suggestions may be offered for further guidance and refinement

7.3.2

DATABASES

SQL continues to be the standard query language for RDBMS, though XQuery (a language developed for querying XML documents, Boag et al. [2007]) has gained ground as a tool for querying RDBMS. Data views and reports also continue to be circumscribed by fixed, pre-defined data models and SQL commands, and even though RDBMS are being used to store more textual data [Agrawal et al., 2002], they cannot provide the kind of flexible, Web-style full-text search users expect unless coupled with a search engine. On the other hand, even though search engines are now capable of executing a wide range of operations producing database-style querying and reporting (and therefore meeting most data access and analysis needs), SQL remains the optimal technology for conducting highly complex analytical and historical analysis. NoSQL systems, by contrast, employ a wide range of query languages and methodologies, including HTTP/REST APIs, non-SQL query languages (GQL, SPARQL, Gremlin, etc.), search engine-style Boolean operations, Map-Reduce functions (in Erlang or JavaScript) and specialized query APIs (Bigtable, Neo4j). For some, querying is as restrictive as for legacy search engines, for

50

7. DATA RETRIEVAL

others, it approaches the suppleness of SQL (particularly in the case of XQuery for XML databases and SPARQL for RDF/graph databases). Some systems support only exact data matches against IDs (key value stores). Some do not support JOINs (wide column datastores), others do via JSON ( JavaScript Object Notation) or BSON (Binary JSON). Most, with the exception of some document databases and wide column stores, do not support search engine-like full text searching. Because of this latter limitation, and the fact that many of the query methods are not as familiar nor mature as SQL, system architects often recommend coupling a NoSQL system with an RDBMS or a search engine to facilitate information retrieval.

Figure 7.6: Querying NoSQL systems can be a challenge, leading many to couple them with search engines or RDBMS to facilitate data retrieval. (Illustration courtesy of John Muellerleile.)

51

CHAPTER

8

Data Security, Usability, Performance, Cost At A Glance Characteristic Security Usability Data volume Response time Number of users Cost of ownership

8.1

Search Engine Weak High Billion of records Sub-second Unlimited Low per user

Databases High Low Millions of records Seconds, minutes or even hours Limited High per user

SEARCH ENGINES

Conventional Web search engines score extremely high on usability, performance and per-user cost. They were natively designed for highly scalable, distributed computing environments, and are optimized for read (access) operations. They are also user-friendly IR systems, having well demonstrated their capacity to provide timely access to billions of documents by millions of users around the globe — with no end user training. Their cost per user is also minuscule compared to traditional business information systems. However, security is not a core strength of conventional search engines. As they were developed for IR against publicly available information access, they were not engineered for a high level of data security, instead leaving security management to the application layer for individual sites.

8.2

DATABASES

Designed from inception to safely and accurately record business transactions, databases score extremely high on security, but, as described in prior chapters, scaling to improve performance is costly and difficult once one surpasses a certain threshold of users and data. The relational model is simply not well-suited to massive, evolving data sets. ACID constraints are resource-intensive, as are

52

8. DATA SECURITY, USABILITY, PERFORMANCE, COST

complex and numerous SQL queries, and conflicts frequently arise between transactional and access demands. As noted in the last chapter, usability is also not a core strength of databases: retrieving or writing data often requires the use of complex forms or a command of SQL. And even though much work goes into enterprise applications (Customer Relationship Management (CRM), Supply Chain Management (SCM), Product Lifecycle Management (PLM), Enterprise Resource Planning (ERP), etc.) built on databases to make them easier to use, most still require substantial training, documentation and/or technical support for successful use. Deployment is also typically slow, and modifying or scaling a system is often difficult. The cost of ownership is also elevated compared to search. Aside from per-seat licensing for commercial RDBMS, administrative costs are typically higher for RDBMS-based applications, and search engines usually have a much leaner hardware footprint for equivalent volumes of users and data.

8.3

WHAT HAS CHANGED

8.3.1

SEARCH ENGINES

Security As search engines entered the enterprise, security rose to the forefront of engineering concerns. Of the three basic areas of data security: 1. Confidentiality: protection against unauthorized access, 2. Integrity: protection from the unauthorized creation, modification or deletion of data, and 3. Availability: protection from unauthorized access blockage, typically network assaults such as a denial of service attack, enterprise search concerned itself primarily with data confidentiality, leaving the data’s parent application (such as an RDBMS) to safeguard data integrity, and broader network defenses and architectures continue to protect data availability. However, most Search Based Application engines do employ transaction logging and locking models to ensure that all index partitions and replicas remain fully synchronized, and they can leverage this infrastructure to provide a degree of data integrity protection sufficient for basic transaction processing. To ensure data confidentiality, SBA engines usually adopt and reuse existing security infrastructures. In essence, data is acquired with its user authorization data (individual and group identification, and privileges) from common security sources like: • Local system security • LDAP (Lightweight Directory Access Protocol) • OpenLDAP

8.3. WHAT HAS CHANGED

53

• Active Directory • Document Management System Directories (e.g., Lotus Notes, EMC Documentum Server, etc.) and standard connectors such as: • Filesystem • Database • IMAP (Internet Message Access Protocol) • Lotus Notes

Figure 8.1: In Search Based Applications, security uses the underlying access control lists of original databases.

RDBMS, on the other hand, have always concerned themselves with both data confidentiality and integrity, and accordingly they remain the preferable technology for ensuring the integrity during write operations. SBAs therefore typically pass complex or sensitive transactional requests to source databases for secure processing.

54

8. DATA SECURITY, USABILITY, PERFORMANCE, COST

Usability, Performance, Cost The advantages in terms of usability, performance and cost for search technologies in IR remain the same. However, what is new is the porting of these advantages from search tasks to broader information access, reporting and analysis needs within enterprise applications. When deployed as an IR backbone in business applications, SBAs can significantly reduce costs while simultaneously enabling significant scaling in users and data volume.

Figure 8.2: In Search Based Applications, the cost per query is radically reduced.

They also reduce training requirements, replacing classic database search and navigation with Web-style text box search, faceted navigation and ad hoc reporting. SBAs also enable rapid, iterative software development, with production releases typically deployed in days or weeks versus months or years for conventional RDBMS architectures. This deployment advantage is due to SBA engines’: • Minimal, flexible data model • Highly automated processes (extraction, semantic processing, indexing, query processing, reporting, etc.)

8.3. WHAT HAS CHANGED

55

• Maturity (standardization, packaging, support, etc.) The flexibility of the data model together with the distributed architecture employed by SBA engines also make post-deployment modifications much easier: there is no need to modify source systems or undertake traditional data migration whenever there are changes to underlying data schemas or marked increases in data volumes or users, and documents and/or document fields can often be dynamically added, updated or removed in the index. NoSQL systems offer similar performance, cost and adaptability advantages. Like SBA engines, many of these gains are enabled by weaker enforcement of ACID constraints and limited support for complex transactions processing1 . So, while NoSQL systems do provide limited transaction support and offer various methods of authentication and/or authorization, RDBMS remains a wiser choice for sensitive, transcation-heavy operations such as ecommerce. In such circumstances, one could consider a NoSQL/RDBMS pairing akin to the SBA/RDBMS pairing mentioned earlier (a strategy employed by Amazon2 ); however, deploying an SBA engine alongside a database is a considerably simpler task than deploying a NoSQL complement, and you immediately gain full-text search, faceted navigation and reporting functionality unavailable out-of-the-box with NoSQL systems. In addition, the lack of maturity of NoSQL systems often makes their initial deployment considerably slower than their plug-and-play SBA engine counterparts.

1 See David Intersimone’s blog article “The end of SQL and relational databases?” (2/2/2010) starting at http://blogs.

computerworld.com/15510/the_end_of_sql_and_relational_databases_part_1_of_3

2 At the April 2010 NoSQL EU event in London, Amazon CTO Werner Vogels explained that the Amazon.com homepage is

a combination of 200-300 different services, with multiple data systems, noting that users do not think about data sources in isolation, they care about the amalgamated service.

57

CHAPTER

9

Summary Evolutions and Convergences At A Glance Comparison Characteristic Massive scalability Real time information access Low per-query cost Rapid deployment Evolutive data model Full exploitation of structured data Ad hoc analysis over all data dimensions Structuring of unstructured data Dynamic clustering and classification Fuzzy full-text search High throughput OLTP Enterprise grade security

9.1

RDBMS N N N N N Y N N N N Y Y

NoSQL System Y Y Y N Y N N N N N N N

SBA Engine Y Y Y Y Y Y Y Y Y Y N Y

SBA-ENABLING SEARCH ENGINE EVOLUTIONS

Search engines have evolved from simple tools for locating documents to a desirable information infrastructure for business applications because they are now capable of: • Structuring unstructured data to make it exploitable for business use • Providing intuitive, scalable access to structured data • Delivering unified, uniform access to heterogeneous, multi-source content • Supporting full text search, faceted navigation, and ad hoc reporting and analysis in a single platform

58

9. SUMMARY EVOLUTIONS AND CONVERGENCES

As shown in the previous chapters, this evolution has occurred in response to changing requirements accompanying the entry of search engines into the enterprise as well as evolving IR demands on Web (including the drive to index more semi-structured and structured content). Below is a summary of the native characteristics and recent evolutions supporting these new capabilities:

9.1.1

DATA MODEL

• Native capacity Flexible, evolutive data model (the document model); de-normalized representation of data attributes • Recent changes Evolution of the document model towards a persistent, entity-style representation of information

9.1.2

DATA STORAGE

• Native capacity Index as primary storage structure; column-oriented data representation • Recent changes Expansion of volume and complexity of data attributes stored; ability to treat different data types differently (e.g., numbers, dates, text strings); ability to exploit the structural semantics of category/value attribute pairs; extended use of subindexes for different data views

9.1.3

DATA COLLECTION

• Native capacity Crawlers for extracting Web data; use of incremental, differential updates • Recent changes More robust semantic tools for extracting entities from unstructured data (standard or custom thesauri), open API/connector framework for extracting new types of structured, semistructured and unstructured data; ability to extract and retain semantics of structured sources

9.1.4

DATA PROCESSING

• Native capacity Natural language processing; statistical calculation of relevancy • Recent changes

9.2. CONVERGENCE

59

New semantic criteria for measuring relevancy in a business context; independent semantic clustering and categorization; flexible use of ontologies (extracted, imported, manual, inferred or learned); sentiment mining; application of business rules to data processing; semantic enrichment of data with attributes, meanings and relationships not represented in source systems

9.1.5

DATA RETRIEVAL & OUTPUT

• Native capacity Natural language querying; full text scanning; column-oriented reads • Recent changes Search APIs that make index content exploitable by other applications; use of categories and clusters for faceted search, faceted navigation, and facet-based reporting; application of semantic interpretation and business rules to query processing; use of textual, numeric & symbolic constraints in querying; JOINs across unstructured and structured data; re-purposing of relevancy calculations for reporting and analysis; data output in standard formats (e.g., RSS, Atom, XML) to support infinite representation possibilities (lists, charts, graphs, text, HTML, maps, etc.).

9.1.6

DATA SECURITY, USABILITY, PERFORMANCE, COST

• Native capacity Distributed architecture and column-oriented index structure for massive scalability, high performance and high availability; high usability due to natural language support, fuzzy matching technologies, use of dynamic facets, and adaptive data model; low per user cost • Recent changes Strong protection of data confidentiality with the capacity to import, update and enforce access rights and rules from standard security infrastructures

9.2

CONVERGENCE

All of these evolutions have enabled search engines to produce database-style access and reporting, blurring the line between IR in RDBMS and search spheres, and shedding light on the rise of NoSQL systems that likewise represent an intersection of strategies, technologies and end uses between the search and database worlds. The depth of this convergence is reflected in the current RDBM strategies to improve handling of semi-structured and unstructured content and to overcome performance and scaling limitations1 , 1 See, for example, “The Case for Determinism in Database Systems” by Thomson and Abat: http://db.cs.yale.edu/

determinism-vldb10.pdf

60

9. SUMMARY EVOLUTIONS AND CONVERGENCES

and in the confusion in self-representation within NoSQL and Search Based Application engine communities. SBA engine developers increasingly refer to their products not as search engines but an “information access platforms,” and many NoSQL developers don’t use the word “database” at all—preferring “store” (as in key-value and wide column stores) or “distributed storage system.” As an example of the fluidity of these terms, Google originally labeled BigTable a “distributed data storage system"’ but now refers to it as a “distributed database,” and BigTable is now employed within Google Caffeine as the prime index structure for the Google Web index, with a “kind of framework” on top that employs methods comparable to “old-school database programming and the use of ‘database triggers’."2 The confusion is well-founded. There are enough points of intersection between NoSQL systems and SBA engines that you could describe an SBA search engine as a sort of composite of all NoSQL databases: an SBA engine is after all a kind of distributed, multi-axial key-value store with a column and document orientation that is concerned with relationships. On the other hand, you wouldn’t describe a NoSQL database as an SBA engine. Some NoSQL DBs are concerned with search, some are more focused on general information access. Some are positioned as RDBMS replacements, others as complements. Some are suited to reporting for decision intelligence, others for simple IR. Some are intended to be accessed only by other applications, others support direct human interaction. SBA engines, however, combine all these characteristics. And, SBA engines make extensive use of full text analytics and semantic technologies that are either not present or marginally present in NoSQL databases. They are also mature, deploy rapidly and are easy to use and manage.

2 From an interview with Eisar Lipkovitz, a senior director of engineering at Google, published in “Google search

index splits with MapReduce,” The Register, September 9, 2010: http://www.theregister.co.uk/2010/09/09/ google_caffeine_explained/

61

CHAPTER

10

SBA Platforms “The first and most mature information access technology is search engine technology." —Gartner [Andrews and Knox, 2008, page 3]

10.1 WHAT IS AN SBA PLATFORM? There are scores of basic search engines on the market that can meet an organization’s need for simple and convenient desktop, enterprise and website search. However, only a subset of search engines are infrastructure-level platforms that can support Search Based Applications. To serve a wide range of information consolidation, access, discovery and analysis needs in addition to providing classic enterprise and website search (and hence earn the “SBA Platform" label), a search engine should be able to: • Collect and process unstructured, structured and semi-structured data • Consolidate data via an open API and connector framework (with an option to aggregate content through federated search, mash-ups and metasearch as well) • Use semantic technologies to effectively analyze and enrich source data • Automatically categorize and cluster content to support faceted search, navigation and reporting • Provide a search API or built-in dashboard tools for information visualization and analysis • Offer high performance and unlimited scalability (not all search engines are created equal!) The search platform should also be sufficiently mature to automate essential configuration, deployment and management tasks.

62

10. SBA PLATFORMS

10.2 INFORMATION ACCESS PLATFORMS These are the types of advanced search systems Gartner has traditionally referred as “information access platforms (IAPs)" [Andrews and Knox, 2008]. IDC takes a different approach, including what we label "SBA Platforms" in the broader category of "unified information access platforms" (or simply "unified access platforms"), which it defines as platforms that can "integrate huge quantities of information—structured, semi-structured, unstructured—into a unified environment for analysis and decision-making." It’s a category that includes an array of search, database and XML platforms offering "newer, hybrid architectures that combine elements of both databases and search," including some of the NoSQL systems we discussed in this book. IDC notes, however, that at present, it is the "search-centric innovators" that are leading the market for UAPs (which are still "in early and experimental form") (Feldman and Reynolds, 2010, p. 5-6). As this book addresses "search-based applications" and not the broader topic of "unified access applications," we, like Gartner, have chosen to focus on vendors that “incorporate search as the foundational capability of their information access products" [Andrews and Knox, 2008], p.4). In addition, their list, like ours, is restricted to vendors that sell their information access products separately from other products; vendors who sell only packaged SBAs and not a standalone SBA platform are represented in Section 10.5. It is evident, however, that convergence is making it harder by the day to maintain precise labels and neat distinctions. As the authors of the "2008 Claremont Report on Database Research" note, fast moving changes in information applications and technology “demand a reformation of the entire system stack for data management" [Agrawal et al., 2008], to be accompanied, no doubt, by a reformation of the entire vocabulary of data management.

10.3 SBA PLATFORMS: MARKET LEADERS The vendors included in this section are recognized by Gartner and/or IDC as search market leaders with demonstrated SBA capabilities. Many of them also market ready-to-use SBAs in addition to standalone SBA platforms (see Section 10.5). • Autonomy IDOL Server, SPE www.autonomy.com/idolserver • Attivio Active Intelligence Engine www.attivio.com/active-intelligence • Chiliad Discovery/Alert www.chiliad.com/products_chiliad-discovery-alert.php • Endeca IAP www.endeca.com/products-information-access-platform.htm

10.3. SBA PLATFORMS: MARKET LEADERS

63

• Exalead CloudView www.exalead.com/software/products/cloudview • Expert System’s Cogito www.expertsystem.net/page.asp?id=1521&idd=18 • Fabasoft Mindbreeze www.mindbreeze.com • Isys Search Software www.isys-search.com • *Recommind CORE www.recommind.com/products/core_platform • Sinequa www.sinequa.com • *Vivisimo Velocity vivisimo.com/technology/velocity-platform.html • *ZyLAB Information Management Platform www.zylab.com/Products *In the "MarketScope for Enterprise Search," which henceforth replaces the "Magic Quadrant for Information Access," Gartner considers these companies as more properly classified as vendors of commercial SBAs (Section 10.5) than enterprise search or SBA platform providers. While analysts often group the products below with the SBA platforms above, they are, at present, generally considered to be better adapted to specialized search needs within vendor-specific information ecosystems than to use as standalone, generalist SBA platforms. • IBM OmniFind www-01.ibm.com/software/data/enterprise-search • Microsoft FAST Search for SharePoint sharepoint.microsoft.com/en-us/product/capabilities/search • Oracle Secure Enterprise Search www.oracle.com/us/products/database/secure-enterprise-search • SAP NetWeaver www.sap.com/platform/netweaver While Google’s enterprise search offering, the Google Search Appliance (www.google.com/enterprisegsa), traditionally captures a large share of overall search market

64

10. SBA PLATFORMS

revenue, it is primarily designed as a simple plug and play tool for meeting basic enterprise search needs. Though Google has been incrementally expanding the GSA’s connectivity and functionality, we have not seen any indications, at present, that it will evolve into a full-featured SBA platform, though Google as a company is certainly a driving force in the research community that is imagining what tomorrow’s hybrid search/database infrastructures may look like.

10.4 SBA PLATFORMS: OTHER VENDORS Below are vendors that sometimes show up on analysts’ radars and sometimes do not. Some are small, have a limited customer base, or operate in niche markets. Some self-identify as SBA platforms, others as enterprise search engines, others as special purpose indexing or access engines, still others as producers of packaged horizontal or vertical SBAs. Most are commercial, some are open source. We have included them here to try to ensure that good candidates don’t slip through the net, but as we cannot vouch for their SBA suitability firsthand, we recommend you put their products through the usual screening procedures: SBA references, performance benchmarks for similar deployments, a Proof of Concept (POC) using your own data, etc. • BA-Insight www.ba-insight.net • Brainware Globalbrain www.brainware.com/search.php • Connotate www.connotate.com • Constellio Open Source Enterprise Search (Lucene-based) www.constellio.com • Coveo Enterprise Search Platform www.coveo.com/en/products/platform • Dieselpoint Search www.dieselpoint.com • dtSearch www.ba-insight.net • Funnelback Enterprise www.funnelback.com/our-products/enterprise-search • InfoFinder /www.infofinder.no

10.4. SBA PLATFORMS: OTHER VENDORS

• Intelligenx www.intelligenx.com • hakia Enterprise Search company.hakia.com/new/semanticbooster.html • Kapow Technologies Web Data Server kapowtech.com/index.php/products • MuseGlobal www.museglobal.com • Neofonie www.neofonie.com • ontoprise Semantic Infrastructure www.ontoprise.de • Perfect Search www.perfectsearchcorp.com • Pertimm www.pertimm.com • PolySpot Enterprise Search www.polyspot.com • Siderean Seamark www.siderean.com/products_suite.aspx • Sphinx, open source www.sphinxsearch.com • Surfray Mondosearch www.surfray.com/products/mondosearch.html • TEMIS Luxid www.temis.com/index.php?id=201&selt=1 • Thunderstone Search Applicance www.thunderstone.com/texis • X1 Technologies X1 Platform www.x1.com/landing/x1_professional.html • Xapian (search library), open source www.xapian.org

65

66

10. SBA PLATFORMS

10.5 SBA VENDORS: COTS APPLICATIONS To the best of our knowledge, these vendors’ core search and information access platforms are not separately available for general SBA development (or not specifically designed for such open use), but organizations with specific needs can nonetheless benefit from the SBA framework and natural language processing technologies at the core of these vendors’ products, which include software for eDiscovery, Compliance, Content Management, Collaboration, Business Intelligence, Question Answering, eCommerce, Merchandising and more. Look for rapid growth in the number of commercial-off-the-shelf (COTS) vertical and horizontal SBA solutions appearing in the marketplace. • Accept Software Corporation www.accept360.com • Attensity Group www.attensity.com • Baynote www.baynote.com • Clarabridge www.clarabridge.com • Cataphora www.cataphora.com • Consona www.consona.comv • Crowdcast, Inc. www.crowdcast.com • FTI Technology www.ftitechnology.com • InQuira www.inquira.com/solutions_overview.asp • Iron Mountain/Stratify www.ironmountain.com, www.stratify.com • Kazeon (acquired by EMC) www.kazeon.com • MyRoar www.myroar.com

10.5. SBA VENDORS: COTS APPLICATIONS

67

• Omniture (Adobe) www.omniture.com • Open Text www.opentext.com • Palantir Government www.palantirtech.com • Silobreaker Enterprise (Lucene-based) www.silobreaker.com/EnterpriseSoftware.aspx • Zilliant www.zilliant.com Let’s now look more closely at how and when SBAs are used, followed by a discussion of how they work.

69

CHAPTER

11

SBA Uses & Preconditions Search Based Applications are used to develop consumer and business applications that combine the unique advantages of multiple information access infrastructures, including: • The precise faceted navigation & multi-axial analysis of relational databases, • The massive scalability, performance and flexible data model of a NoSQL distributed data store, and • The simple, intuitive usage of a public Web search engine—all wrapped in a framework that uses semantic technologies to align unstructured and structured data, and make it more relevant and understandable to the end user.

11.1 WHEN ARE SBAS USED? SBAs are typically deployed when one or more of the following needs exist: 1. A Need to Aggregate Heterogeneous Content SBAs use semantic technologies to provide unified, structured access to a diverse range of multi-source content, with a flexible data model that can accommodate evolving source data. 2. A Need to Process Large Volumes of Data or Serve a High Number of Users Naturally optimized for executing access requests against large data volumes, and constructed with fully distributed architectures, SBA platforms provide exceptional performance and availability in large volume environments. 3. A Need for Real Time Information Deployed non-intrusively alongside relational database systems, SBA platforms support continual incremental and differential updates directly from source systems, enabling real-time (or quasi-real-time) data availability. 4. A Need for Ad Hoc Reporting against a Broad Range of Criteria SBA platforms enable ad hoc reporting against any criteria maintained in the index—whether the criteria is extracted from a source system or created by the engine during the course of natural language processing—without SQL programming.

70

11. SBA USES & PRECONDITIONS

5. A Desire to Democratize Information Access SBA platforms in essence ‘decouple’ data from complex systems and make it accessible using natural language technologies.

11.2 HOW ARE SBAS USED? SBAs are used for a variety of purposes, including: • Enterprise Business Applications, such as: – Customer Service & Support – Sales Support (CRM, Telemarketing) – Enterprise Resource Planning (ERP) – Compliance & eDiscovery – Supply Chain Management (SCM) – Logistics, Track & Trace – Business Intelligence (BI) – Competitive Intelligence (CI) & Reputation Monitoring • Web Applications Typically for B2C and C2B applications that mash-up data and functionality from diverse sources (databases, Web content, user-generated content, mapping functions, sentiment analysis, etc.). • Database Offloading SBAs are used to provide an alternate means of accessing database content to address performance, scaling and usability issues. Offloading a databases to an SBA is also sometimes used to ensure continuity of information access during large scale migration projects. • Information Lifecycle Management SBAs are used to accelerate and improve ILM processes and ecosystems, including Product Lifecycle Management (PLM), Customer Data Management (CDM) and general Master Data Management (MDM). Let’s now look more closely at how SBAs work, followed by case studies of several SBA deployments.

71

CHAPTER

12

Anatomy of a Search Based Application At an anatomical level, search based applications can be grouped according to the type of content they process: 1. Structured data 2. Unstructured data 3. A mix of unstructured and structured data In the case of the first, a search platform is used to provide more scalable and user-friendly access to RDBMS content. In the case of unstructured data, the SBA platform aggregates multi-source content and makes it exploitable by giving it a uniform structure. The latter hybrid category is the most common, with the SBA platform used to federate, contextualize and normalize heterogeneous data from diverse systems. This aggregation may be a ‘stable’ one that takes place at the index level, as is the case for most enterprise SBAs, or it may be a dynamic aggregation that takes place at query time (a real time ‘mash-up’ most common on consumer-facing Internet applications). Common IR functions for all these SBA types include: • Semantic Search • Faceted Navigation • Dynamic Reporting & Analysis • Sentiment Analysis (for SBAs processing unstructured data as qualitative data is rarely captured in structured format) Let’s look first at how an SBA works when processing structured data.

12.1 SBAS FOR STRUCTURED DATA 12.1.1 DATA COLLECTION Loading content from a database into a search engine index is a simple matter now that SBA platforms offer mature database connectors that can extract data within the context of the database’s

72

12. ANATOMY OF A SEARCH BASED APPLICATION

semantic schema. All one really needs do is to set a few high level entity mappings (products, customers, etc.) and let the engine take care of the rest. Once a primary key is identified for a top level entity, the engine and connector can follow the system of foreign keys to create holistic views of these entities. These views are stored as indexed documents. One doesn’t need to execute SQL queries in advance to achieve relevant data views—that is the power of SBAs. However, there are mechanisms in place to allow database administrators to construct triggers, stored procedures and pre-computed views if they so desire, and to control the messaging system and middleware used.

1. 2.

3.

How to Offload Data into a Search Engine Decide on top level entities that matter to users (e.g., invoice, order, client) Set these entities in the connector, and the engine will use the data source’s schema to construct documents based on these entities Each entity becomes a document The engine will extract entity attributes and map these into the appropriate document fields

12.1.2 DATA PROCESSING Using natural language processing, the engine extracts the semantic information in the database and uses it to build holistic views of the desired entities (the documents). Once indexed, all attributes are separately accessible even though they are stored in a ‘flat’ document model (Figure 12.1). With the wide-column structure of SBA engine indexes, the number of attributes that could be stored for each document range from hundreds to thousands. Classes (or clusters) of attributes become the facets used for search, navigation and reporting. (One could in essence think of the engine as executing multiple SQL ‘GROUP BY’ clauses in a single pass, and producing all the aggregates and counts one would need for an infinite number of views.) For an SBA that aggregates structured data from more than one source, one would need to map top level entities (like products) across multiple systems to achieve unified IR and reporting. Let’s say, for example, you would like to achieve a global view of a product across PLM (design, planning, specifications), ERP (cost and supplier data) and CRM (customer base) systems. To accomplish this, one would set ‘product’ as a top level entity to be tracked across all systems and assign it an entity ID (either original or drawn from one of the source systems). One would then map this ID to the variants automatically extracted from the different source systems. To more effectively identify common entities across heterogeneous systems, or even to automate the entire mapping process, a search platform may employ fuzzy ontology matching. In fuzzy ontology matching, the search engine uses semantic technologies (statistical semantics, contextual analysis, synonyms, and linguistic variants) to align schemas.

12.1. SBAS FOR STRUCTURED DATA

73

Figure 12.1: Engine and connectors leverage source schema to create holistic views of entities.

12.1.3 DATA UPDATES Entire documents or individual attributes stored in documents can be updated in a differential, incremental process. Most SBA platforms support both push and pull strategies for updating the index, and can accommodate any refresh rate. When underlying data is updated, the facets are dynamically updated as well. As noted previously, even though the engine does place an access load on source databases, it is a known load that can be optimized, and it is far more advantageous to process a single large, controllable access request than scores of direct end user queries. Once data is indexed, the SBA platform can handle a virtually unlimited number of such requests, without touching the database. And because SBA platforms can extract, aggregate and update data from a virtually unlimited number of source systems, SBAs enable one to sidestep the batch updates associated with data warehouses to provide unified, real-time data access that can satisfy the majority of routine access and reporting needs while preserving the data warehouse infrastructure for complex analytics, if so desired.

74

12. ANATOMY OF A SEARCH BASED APPLICATION

Figure 12.2: Here an SBA platform uses semantic technologies to aggregate data from 23 ERP systems to facilitate spend analysis for vendor negotiations and expense management.

12.1.4 DATA RETRIEVAL & ANALYSIS Once an index is built, the full range of database content can be explored through full-text, natural language search and navigated via dynamic facets. One can control which facets to display, and how to order them (at the index or application level), or for BI, simply let the engine expose all relevant facets as the user explores the data. One does not need a separate query for each data view of interest; all views are automatically computed on the fly by the index. To produce OLAP-style reporting, SBAs re-purpose the statistical computations used to calculate relevancy and ranking. To understand how this works, let’s first look at the mathematical operations already being performed for ranking. We’ll take the example of an application indexing a product catalog. First, a count is produced for absolutely every attribute tracked in the index (product name, price, sales volume, color, manufacturer, etc.) to provide baseline data for ranking results. For example, sales volume may be totaled and a formula applied to give priority ranking to high volume products in result sets. Additional computations may be performed to refine relevancy, for example, balancing a high “popularity” score based on sales volume with considerations of price and delivery delay. The application might want to use a relevancy scoring model where the document’s score is:

12.1. SBAS FOR STRUCTURED DATA

75

Figure 12.3: Facets are automatically updated whenever new data is added, and the flexible data model can incorporate new category/value pairings on the fly.

• Higher when the item is popular (sales volume is high) • Lower when the price is high • Lower when the delivery delay is high To implement this balance, the engine might decrease the sales volume score by 1 point for every $10 in the price, and by 10 points for every day of delivery delay, using a numerical expression like the following (note: many engines offer non-technical WYSIWYG tools for managing relevancy weighting): ln(sales_volume) ∗ 100 − (price ∗ 0.1) − (delivery ∗ 10) A numerical expression like this can be dynamically changed on a per-query basis, and therefore customized depending on the customer’s profile, the type of search query, the time of the day, or any other criteria one chooses. In addition to dynamic scoring, one can also generate dynamically computed numerical fields (also known as “virtual fields") at query time. These expressions can combine any numerical field

76

12. ANATOMY OF A SEARCH BASED APPLICATION

value associated with the document in the index (for example, the document’s date or the document’s popularity value) with numerical constants specified in the search query (for example, the current date, or the latitude/longitude position of the end user executing the search query). Available numerical operators include: • Basic numerical operators (+, −, ∗, /, %) • Parenthesized expressions () • Logarithm, exponential operators • Comparison operators (>, >=, <, <=, ==) • Conditional operators • Euclidean distance • Geographical distance between two coordinates (latitude, longitude) Together, these counts, scores and numerical fields provide a very powerful tool for turning document numerical properties into a ranking score value, giving full control over the way values are combined with each other and their relative weight in the final score. They also provide the foundation for dynamic, OLAP-style reporting in SBAs. We know from our example above that numeric values like ‘sales_volume’ are already being calculated to produce and manipulate ranking scores, and that a host of numerical operators exist for manipulating these scores. So, if we take a simplified orders document like the one below, we know a value for total order amount has already been calculated in order to assign a base ranking score, which at query time would take in other factors like lexical proximity (closeness of match, phonetic or approximate spelling matches, synonyms...). In addition, at query time, semantic processors generate clusters (facets) for this data around all intrinsic attributes (customers, dates and date ranges, order amounts and ranges, products, etc.) as well as some external ones (like related terms). Order_ID 410 411 412

Attributes Product_ID/123 ; Customer_Name/ACME; Amount/â‚¬12.500; Date/06-10-2010 Product_ID/157 ; Customer_Name/ACME; Amount/â‚¬10.000; Date/31-10-2010 Product_ID/123 ; Customer_Name/ABC Corp; Amount/â‚¬12.500; Date/01-11-2010

With this rich set of clusters and key data calculations, developers can simply feed this data to any report/dashboard generator to produce ad hoc reporting on any and every data facet and/or calculation, in any form desired: tables, charts, graphs, maps, etc. Accordingly, the document could generate automatic clusters and charts representing: • Order volume by customer, product, date or date range • Product volume or sales by customer, date or date range

12.2. SBAS FOR UNSTRUCTURED CONTENT

77

• Customer activity by sales volume, product, date or date range • Total sales by customer, product, date or date range • Order distribution by date or date range • Etc., etc. By leveraging existing numerical operators and dynamic virtual fields, one can extend the range of reporting even further, for example, calculating the average order amount by date, customer, product, etc. The range of existing data and available operations can in fact cover 95% of decision intelligence needs, omitting only highly complex analytical or historical functions. Keep in mind no SQL queries have been written to produce all this intelligence. It is available a priori in the search engine. There are no limits to the types of reports that can be presented, the number of facets that can be represented, or the level of detail a user can explore. All representations are generated on the fly from the result set, and that set evolves with each user click. For instance, if an attribute like Tax were added to the database, this data would be immediately available in representations of the orders. Even in large volume environments, it remains every bit as fast as simple keyword search.

12.2 SBAS FOR UNSTRUCTURED CONTENT 12.2.1 DATA COLLECTION For SBAs aggregating unstructured content, data collection is handled by file connectors and HTTP crawlers. Most SBA engines can natively accommodate an extensive range of formats, with a push API covering rare or legacy formats.

12.2.2 DATA PROCESSING The goal of processing in this context is to create structure where none exists. It is this transformation that make unstructured data meaningful and exploitable for business use. SBA platforms use semantic technologies to provide this structure, enhancing content with metadata not present in its unprocessed format. This processing includes: • Standard and Custom Entity Extraction • Summarization • Event and Fact Extraction • Multimedia Analysis • Sentiment Analysis • Dynamic Categorization and Clustering

78

12. ANATOMY OF A SEARCH BASED APPLICATION

12.2.3 DATA UPDATES Update processes for unstructured sources are also typically differential and incremental. With limited control over external sources like websites, a pull rather than push strategy is usually employed for these updates. The engine can accommodate any desired refresh rate. As with structured resources, when underlying data is updated, the facets are dynamically updated as well.

12.2.4 DATA RETRIEVAL & ANALYSIS As with natively structured data, all facets, counts and statistical calculations for processed unstructured content are available for structured search, navigation and reporting.These facets and associated computations include those performed for numerical data that did not previously exist; for example, extracted and normalized physical measures and cost data extracted from Web content.

Figure 12.4: This search based application, an aggregator, uses semantic processing to extract and automatically dimension structured facets from the raw text of 20 million online ads. This structured content could be used for OLAP reporting in addition to faceted search and navigation, delivering, for example, real-time after-market intelligence to automobile manufacturers.

12.3. SBAS FOR HYBRID CONTENT

79

12.3 SBAS FOR HYBRID CONTENT In developing an SBA aggregating unstructured and structured content, the engine gives unstructured data a structured format and captures the structural semantics of already structured data. From there, one can apply the same high level ontology strategies described under IR for structured data above, deciding what mix of dynamic (semantic) or directed (manual or learned) classification and clustering is desired. For example, Figure 12.4 shows a customer service SBA for a large telecommunications company. Prior to launching an SBA, contact center agents for this company had to consult up to two dozen systems (many text-only, ‘dumb’ terminal-type applications) to serve any one customer, with agents generating a flurry of Post-It notes for each service request to keep track of it all. To resolve these challenges, the company deployed an SBA providing index-based aggregation of hybrid data from sources including: • Siebel CRM data (customer name, address, market segment, customer type, etc.) • Provisioning information (type of equipment, cable length, line impediments, etc.) • Network monitoring systems (status, performance, loads, etc.) • Contract data (options, seniority, contract period, terms, etc.) • Technical information (intervention history, technician issues on-site, pending appointments, etc.) The new system places at agents’ immediate disposal all the information they need to satisfy any customer demand—plus sub-second responsiveness and a TCO representing a mere fraction of the cost of maintaining the current architecture or revamping it using traditional approaches. In addition to index-level aggregation, some SBA engines further support content federation at query time through live mash-ups. Query time mash ups allows one to add context and relevance to data without placing a load on the index. For example, for each query to an index aggregating product data from multiple PLM systems, one could execute simultaneous (parallel) queries (or feeds) to pull in relevant data from external sources like online catalogs in order to facilitate the search for suitable replacements or alternate suppliers. Each data feed can be configured to retrieve highly specific types of information, and queries to multiple sources can be nested within a single feed to successively enrich results. Online, these types of mash-ups are being used in industries like media, travel, ecommerce and publishing to produce engaging portals that merge content and functionality from sources such as databases, mapping services, business applications and the Web, including user-generated content (UGC) and social networks. We’ll now look at three SBA case studies in detail to learn more about how SBAs work and how they are being used.

80

12. ANATOMY OF A SEARCH BASED APPLICATION

Figure 12.5: This customer service SBA aggregates data from multiple structured and unstructured sources into a single dashboard.

12.3. SBAS FOR HYBRID CONTENT

81

Figure 12.6: This search based application uses semantic processing to create a one-stop travel site that aggregates unstructured and structured content from 10 internal and external sources in real time, with all data presented in context according to the company’s business rules.

83

CHAPTER

13

Case Study: GEFCO 13.1 BACKGROUND GEFCO, with over 10,000 employees present in 100 countries, is one of the top ten logistics groups in Europe1 . The company provides multimodal transport and end-to-end supply chain services for industrial clients in the automotive, two-wheel vehicle, electronics, retail, and personal care sectors. They have been very innovative in their information systems, integrating best of breed logistics ERP, and concerned with providing transparency, interactivity and excellence for their workforce and clients.

13.2 A TRACK & TRACE SOLUTION GEFCO wished to rebuild the Track & Trace solution for their automotive transport service. This service entails transporting automotive vehicles from factories to dealers, with GEFCO being responsible for the whereabouts of 7 million vehicles on any given day. An average of 100,000 new events concerning these vehicles are logged each day. Their database system consolidates all logistical movements of the cars worldwide, and holds 3 terabytes of data.

13.3 EXISTING DRAWBACKS Their existing current Track & Trace system was built over an Oracle database system whose use was restricted to a limited number of authorized users. Even after 2 years of expensive optimization projects, the existing system exhibited a low level of performance: more than 1 minute response time per query, restricted access during work hours to accommodate offloading and to avoid conflicts between information requests and internal transaction processing, and data latency of up to 24 hours due to batch updating.

1 http://uk.gefco.net/, 3.5 Billion euros in revenue in 2008

84

13. CASE STUDY: GEFCO

What is Track & Trace? Tracking Tracing Operational Reporting

Identify an item’s current location (Where is Mr. Smith’s car?) Following the route an item has taken (Where has Mr. Smith’s car been?) Real-time summary and analysis of activities (How many cars of type X are in transport?) Applicable to any type of product and any industry supply chain, a superior track and trace system provides: • A fully unified, organization-wide view of operations • End-to-end pipeline visibility • Real-time activity monitoring and reporting • Workflow integration for just-in-time management Collectively, these functions provide Operational Business Intelligence (OBI). With OBI, an organization can respond to risks and opportunities effectively and in real-time.

GEFCO wished to extend access to a wider customer base, but they knew existing performance problems would only get worse. The non-agility of the infrastructure was impacting business development.

13.4 OPTING FOR A SEARCH BASED APPLICATION GEFCO decided to envision an alternative architecture for the company’s new Track & Trace service: a search based application. An internal study indicated that a Search Based Application would deliver better performance, agility, usability, and security—at a lower cost—than their current database-centered model, responding to these criteria: • Global, real-time view of events (with differential and incremental indexing) • Simple, Google-style keyword search; natural language querying • Rich reporting and drill down on unlimited data facets • Sub-second query processing • Capacity to scale to any volume • Security management at search engine rather than portal layer • Rapid, non-intrusive deployment • Reduced infrastructure cost

13.5. FIRST PROTOTYPES

85

Figure 13.1: At GEFCO, before turning to a search based application, customers had to complete this form to perform a query. There were some mandatory fields which could be potentially complex, and one often needed a printed paper on hand with long codes (like vehicle ID numbers) to re-type into the form—not a very useable or ergonomic interface.

• Fast return on investment • Agile, service-oriented architecture for fast adaption to evolving needs

13.5 FIRST PROTOTYPES During the diagnostic phase, it was possible to produce an operational prototype, based on search engine technology, in only 10 days. From these search engine roots, it was immediately clear the platform would provide the Web-style simplicity GEFCO was seeking, with a single text box for launching complex queries, support for natural language processing, and ‘fuzzy’ results matching. After a trial period of a few weeks, GEFCO was also convinced that a search based application could efficiently address all the challenges linked to its complex database, specifically maintaining high performance in the face of large data volumes and a large user base, furnishing real-time information, deploying rapidly, and supporting fast, non-intrusive modification. The GEFCO team was also pleasantly surprised at the platform’s capacity to produce rich, onthe-fly operational reporting. They had not anticipated that a search engine could be so easily tuned to produce dynamic tables, charts, and graphs based on unlimited data facets. And the familiarity of the Web search engine interface meant users could simply sit down and start working with the tool

86

13. CASE STUDY: GEFCO

Figure 13.2: GEFCO track and trace interface with operational reporting and search refinement.

- no training needed. Finally, the engine provided the strong, integrated security GEFCO required. Security was enforced down to the metadata level, and it was managed at the search platform level instead of the portal application layer.

13.6 DEPLOYMENT A production installation of the new portal serving several thousand users was deployed in a just few weeks, with a full version serving all users launched in 6 months. With this release, GEFCO was able to successfully scale information access without having to scale the underlying database infrastructure at considerable cost. It was also able to guarantee the future scalability and adaptability of its applications while preserving its existing infrastructure investments: the new platform runs on a basic Linux server farm while the original database infrastructure remains on a high-capacity Unix cluster (pSeries IBM). The portal can thus be adapted and scaled as requirements evolve at a markedly reduced cost. Other benefits of the new installation include: • An accelerated data refresh rate, cut from 24 hours to 15 minutes • A 50% cut in the cost per user • A maximum query processing time of 2 seconds • A vast increase in users—with no end user training • A 99.98% availability rate with a limited material investment • A large improvement in information accessibility, with a simpler user interface, exposure for far more data facets, and the addition of quasi-real time operational reporting

13.7. FUTURE

87

GEFCO Comparative Study: Search Engine vs. Classic Database Functions Filtering capacity of data, management of profiles, tolerance of errors

Agility

Rapidness of development, maintenance, ease of installation, setting up parameters, deployment, interoperability

10 8 6

Robustness-Security

4

Ergonomics Ease of use without preliminary training

2 0

Durability Capacity of framework, reusability of sockets and components, reliability of subcontractor, simplicity of architecture

Satisfying our needs Number of functions covered addressing our needs

Opportunities DB

Search engine Number of supplementary functions offered

Figure 13.3: Perceived improvement of customer satisfaction using a search based solution over a database solution, displayed in a radar chart [Basu, 2004].

13.7 FUTURE GEFCO continues to add new operational reporting functionalities, opening a wider window on production data. Performance has not been impacted as new functions are added, even though the client base continues to expand. The company is preparing to deploy the same model for automotive suppliers, helping them optimize the provisioning of parts to automobile factories, an effort that involves a database 10 times more voluminous than the vehicle track and trace system.

89

CHAPTER

14

Case Study: Urbanizer

Figure 14.1: Urbanizer combines database directory listings, Web and UGC content, sentiment analysis and social networking to create the first ‘mood-based’ local search application

Urbanizer is a new iPhone application from Yellow Pages Group (YPG), Canada’s leading performance media and marketing solutions company. Urbanizer is the directory publishing industry’s first ‘mood-based’ local search application. The Search Based Application is fed by both database and Web content, and combines search, sentiment analysis and social networking to help consumers find the perfect local restaurant according to their mood (“Tonight, I’m in the mood for an authentic, cozy Italian restaurant.").

90

14. CASE STUDY: URBANIZER

14.1 BACKGROUND Urbanizer was conceived after the YPG team viewed Restminer, a proof-of-concept (POC) that demonstrated how an SBA mash-up could transform a simple restaurant directory into a full-featured consumer guide—automatically and inexpensively (see Figure 14.2).

Figure 14.2: Restminer, a proof-of-concept for an SBA-powered restaurant directory, merges Web content, database listings and geodata, and applies sentiment analysis to the Web content.

Restminer uses semantic processing to collect, analyze and normalize large volumes of diverse content and to present it in a pertinent, consistent manner. Source data includes database content (restaurant listings in a directory database), website content (media reviews, photos, details like opening hours, prices, menus, payment options, etc.), user-generated content (opinions, ratings, blog posts, etc.) and geospatial data for mapping. The system further applies sentiment analysis to aggregate content to summarize opinions about service, ambiance, and food quality in a tag cloud. The result is an ultra-rich (and search engine-friendly) directory that aids decision making and encourages discovery with 360-degree views of restaurants.

14.2. THE URBANIZER SOLUTION

91

14.2 THE URBANIZER SOLUTION While the Restminer POC functions as a general consumer guide, Urbanizer functions as a recommendation engine that makes suggestions based on a user’s mood, location, and the preferences of their own social network. Users can choose from a selection of pre-defined moods such as “romantic dinner” and “hipster snack,” or use Urbanizer’s equalizer function to create a custom mood based on combinations of cuisine, ambiance and service categories.

14.3 HOW URBANIZER WORKS Urbanizer matches the comprehensive restaurant information from Yellow Pages Group with quantitative and qualitative data distilled from unstructured Web content, including consumer comments posted to Urbanizer. The SBA analyzes restaurant user reviews and editorial content to dynamically assign a “mood” to locations. It utilizes Facebook Connect to build the social graph into the app as well as to broadcast information back through the Facebook news feed. Users can share their experiences with friends, tag establishments they’ve visited and assign them a mood, or identify locations as ‘favorites’. As each Urbanizer member interacts with the database and their own social network to refine their search and share experiences, a color-coded “mood map” of an entire city is constructed for the benefit of the entire Urbanizer community.

14.4 WHAT’S NEXT Urbanizer is now available for download in the iTunes App Store™ as a beta version in English and French, and is focused on restaurants in Montreal, Toronto, Vancouver, Calgary and Ottawa. Future versions will see cities added based on user votes as well as categories expanded to include nightlife, shopping, activities and events. This type of rich, emotive search grounded in social networking carries great potential for numerous sectors, including hospitality, travel, entertainment and personal services.

92

14. CASE STUDY: URBANIZER

Figure 14.3: Sentiment analysis and dynamic classification and clustering distill thousands of comments and reviews into three simple mood-based sliders. Semantic technologies are also used to identify, extract and normalize entities embedded in unstructured content, in this case, cuisine type and pricing information.

14.4. WHAT’S NEXT

93

Figure 14.4: A mash-up API enables the application to plug into mapping data and functionality at query time.

94

14. CASE STUDY: URBANIZER

Figure 14.5: Each Urbanizer member interacts with the database and their own social network to refine their search and share experiences.

95

CHAPTER

15

Case Study: National Postal Agency The versatility of Search Based Applications can be seen in their multi-purpose deployments within a single organization, in this case, a national postal agency. Over the past two years, this agency has deployed three different SBAs to meet several mission critical needs, with additional projects under development. Current projects include: • A Customer Service SBA This SBA provides a single, unified information access point and operational reporting platform atop 10 heterogeneous databases. It offers a single solution for meeting the needs of three different audiences: 1) search, retrieval and updating for the Front Office (350 operators in 7 call centers), 2) complex queries and updates for the Back Office (25,000 agents in 3,300 mail facilities), and 3) operational reporting for Management. • An Operational Business Intelligence SBA This SBA monitors and reports on 62 billion intra- and inter-site mail traffic flow events annually involving 180,000 people—in real-time. The system has given the agency the ability to optimize processing and distribution in real-time, and is enabling it to develop new premium customer services. • A Sales Support SBA The Sales Information SBA for Telemarketing unifies access to all relevant customer information (from a Siebel CRM and other internal applications) while providing the Web-style usability and simplicity agents need. It also provides the sub-second responsiveness telemarketing requires, with the system response rate cut from 15 seconds to 1/2 second in migrating to an SBA framework.

15.1 CUSTOMER SERVICE SBA 15.1.1 BACKGROUND The agency had established a strategic plan for 2008-2012 that included aggressive new customer service objectives stressing responsiveness, promptness and reliability. To meet these objectives, the agency needed a new Customer Service Information System (CSIS) that could replace multiple

96

15. CASE STUDY: NATIONAL POSTAL AGENCY

siloed support systems with a comprehensive solution for rapidly analyzing and responding to all customer inquiries and requests, whether received by mail, Internet or telephone. In 2008, the agency opted to deploy an SBA instead of a traditional database application for its new CSIS. The solution is entirely built around the concept of documents enriched over time, stored in a database and recovered though search. The search engine provides the core information access layer and is situated within a larger framework encompassing: • An open source infrastructure (Linux, MySQL, PHP) • A Voice over Internet Protocol (VoIP)/SIP for call centers • Standard Web formats & protocols (REST, Atom RSS) • Web-based access for all users

15.1.2 DEPLOYMENT Employing an SBA model enabled the agency to deploy an operational version serving seven call centers in only 90 days, followed by 4 iterative releases over the next 6 months. In addition to a rapid time to market, the SBA model enabled the agency to: • Provide unified information access - delivered instantly The SBA provides a single point of access to real-time data collected and normalized from 10 source databases, with an average processing rate for queries of 500 milliseconds. • Deliver Web-style simplicity of use The application features a single text-box for launching natural language queries against the unified database content, with user aids like approximate (fuzzy) matching and spelling suggestions. Refinement of results is achieved through faceted navigation. These facets and the native statistical computations executed by the search engine are used to generate reporting for Back Office staff and OLAP-style reporting on key indicators for Management. • Preserve existing investments while gaining agility The SBA was deployed non-intrusively alongside existing systems. No modifications were required for existing databases, with a high level schema and fuzzy ontology matching used to normalize the data across systems. The flexible data model, independent data layer and SOA architecture support rapid, non-intrusive modification as well. • Achieve high performance at a low cost With a small footprint and the use of low commodity hardware, the SBA has a low TCO. It can also scale linearly and cost-effectively by adding additional low-cost servers.

15.2. OPERATIONAL BUSINESS INTELLIGENCE (OBI) SBA

97

Figure 15.1: This cost-effective SBA (Search platform + REST/Atom + LAMP) provided an agile framework for meeting the needs of Front Office agents, Back Office staff and Management. (Architecture & schema by ARMOSC; systems integration by Capgemini.)

15.2 OPERATIONAL BUSINESS INTELLIGENCE (OBI) SBA 15.2.1 BACKGROUND After seeing the effectiveness of a search platform in providing IR against large volumes of data in real-time using minimal resources, the agency began to envision using an SBA to exploit data it had recently begun collecting from its mail sorting machines. This large and complex body of data encompasses 3 to 5 events per letter for 55 million mail pieces treated daily. If the agency could find a way to efficiently access and report on this data, it could achieve full pipeline visibility into its sorting logistics, helping it to optimize processing and distribution in real-time.

98

15. CASE STUDY: NATIONAL POSTAL AGENCY

15.2.2 DEPLOYMENT The SBA developed in response to this need incorporated more than just data flowing from production equipment like sorting machines. It aggregates production data with internal databases and reporting systems to deliver monitoring and reporting on 62 billion intra- and inter-site mail traffic flow events annually involving 180,000 people—in real-time. As is the hallmark of SBAs, the new system hides the complexity and scale of underlying data beneath a simple-to-use Web interface (Figure 15.2). Key functions provided include: • A unified, organization-wide view of operations • End-to-end pipeline visibility • Individual mail piece tracking • Workflow support for just-in-time management The SBA has enabled the agency to meet its prime objective—optimizing processing and distribution in real-time—as well as providing a platform capable of supporting new premium customer services. Initiatives under consideration include secure, virtual P.O. Boxes for receiving and storing important documents, SMS push messaging for deliveries, a track-and-trace service for letters, delivery of physical mail via email (mail2email), and, for high-volume commercial clients, complete mail campaign management services. The capacity to provide easy IR against large volumes of real-time data makes SBAs like this one well-suited to use by logistics operators outside the postal industry, including freight companies, retail distributors, and Supply Chain Management service providers.

15.3 SALES INFORMATION SBA FOR TELEMARKETING 15.3.1 BACKGROUND When Front Office sales staff saw the improved access enjoyed by their support counterparts, they requested a similar improvement to the systems used to support their own inside sales mission. Like service representatives, sales staff often had to jump back and forth between multiple siloed applications over the course of a single customer interaction. And, while their support counterparts received answers in less than a half second, it could take up to 15 seconds for the sales staff ’s Siebel CRM system to retrieve customer records—-an unacceptable delay when a prospect is on the telephone.

15.3.2 DEPLOYMENT Work began on the new Sales Information SBA in September of 2009 and launched in January of 2010. The platform indexes data from the Siebel CRM system and other internal applications, automatically normalizing, classifying and ranking data along the way. To speed development, the

15.3. SALES INFORMATION SBA FOR TELEMARKETING

99

Figure 15.2: This OBI SBA aggregates data across an entire postal organization, extracting data from business systems, data warehouses, and production equipment like sorting machines to produce a highly visual, simple-to-use Web application that monitors and reports on millions of daily events—in real-time and with sub-second responsiveness.

agency was able to adapt the CSIS interface to the needs and work habits of the telemarketing team. The system also provides agents with the text-box simplicity and sub-second responsiveness they have become accustomed to on the Web.

100

15. CASE STUDY: NATIONAL POSTAL AGENCY

Figure 15.3: Navigation of database CRM systems can be complex and time-consuming, with an interface built around a system of tabs and forms that requires a working knowledge of the underlying database schema. Performance can also be sluggish in the case of heavy usage and large volumes of data.

15.3. SALES INFORMATION SBA FOR TELEMARKETING

101

Figure 15.4: The use of an SBA enabled the agency to break through usability and performance barriers associated with the primary CRM database as well as to add context and relevance by aggregating CRM data with data from other enterprise systems.

103

CHAPTER

16

Future Directions Given the ubiquitous need for unified, understandable access to large volumes of heterogeneous data, and the rapidity and cost-effectiveness of Search Based Application platforms in meeting this need, one can anticipate a sharp increase in the number of search-based applications deployed in the enterprise and over the Internet in the coming decade. It is a trend that dovetails not only with the ever-increasing volume, diversity and complexity of information stores, but with the general shift toward task-based Information Retrieval. Workers and consumers trying to perform a task or make a decision need not only access to the full breadth of information available, but by necessity to have it sharply distilled and presented in tightly focused context.

Figure 16.1: Qualities of a search based solution.

In addition to seeing more SBAs deployed in general, one can also count on a proliferation in the number of pre-packaged vertical SBA solutions available in the marketplace. Many purchasers and users will not know (or care) that these applications are built upon a search infrastructure. If one

104

16. FUTURE DIRECTIONS

canvasses current vertical SBA offerings, one can see that even vertical SBA vendors are fuzzy about what exactly is under the hood. Given the fast moving (and sometimes confusing) convergence of search and database technologies, and business’s primary interest in results, this is understandable. One can also anticipate accelerated development of the underlying search technology evolutions that have given rise to SBAs. In particular, look for continual improvements in these areas: 1. The capacity to extract, retain and exploit the semantics of structured data 2. The ability to structure unstructured data 3. Access to multimedia content 4. Innovation in information visualization As always in the search arena, these advances will be fueled by evolutions on the Internet. Some of these influences include the Deep Web, the Semantic Web and the Mobile Web.

16.1 THE INFLUENCE OF THE DEEP WEB 16.1.1 SURFACING STRUCTURED DATA Search engine advances in structured data handling will continue to arise from efforts to surface and exploit more content from the Deep Web (also called the Hidden Web or Invisible Web). The Deep Web refers to the massive volumes of Web content that are available through web interfaces but which are not indexed by search engines. This ‘hidden’ content largely resides in databases, and is only accessible in dynamic pages generated after a user inputs data or makes selections using a form.1 . There are two basic approaches to surfacing this content: automatic harvesting, and voluntary surfacing. In the case of the first, search engines continue to pursue new statistical algorithms and semantic strategies to try to surface more of this content by determining from context what inputs are appropriate for a given form (and how to avoid bogging down either the engine or site in so doing), and indexing the resulting Web pages. In the case of voluntary surfacing, site creators, particularly those in media, publishing and ecommerce, devise new ways (including the use of SBAs) to generate search-engine accessible pages from database content, or conversely, as in the case of repositories such as public libraries or research organizations, to open access to their records via standard connectors or APIs.2 1 Some studies indicate the volume of Deep Web content is at least 1000 times greater than the surface Web. See “Exploring a ‘Deep

Web’ that Google can’t Grasp,”’ New York Times, February 23, 2009, http://www.nytimes.com/2009/02/23/technology/ internet/23search.html.There is also Web accessible information that are not indexed because they are in password-protected intranets, in uncrawled islands of sites not linked to the sites crawled by the web browsers, and behind paywalls 2 See, for example, Google’s Fusion Tables project, which allows users to upload structured data, and to have that data automatically merged with that of other users and presented in the form of HTML tables or maps (http://tables.googlelabs.com/), or the collaborative engines listed at http://www.makeuseof.com/tag/ 10-search-engines-explore-deep-invisible-web/

16.2. THE INFLUENCE OF THE SEMANTIC WEB

105

Look for SBA engine platform enhancements derived from both types of Deep Web surfacing. While the high level goal of automated harvesting may be simply to obtain HTML pages that can be inserted into the search engine index, the effort has been accompanied by a collateral drive to make better use of the semantics of this growing body of structured data. In the case of voluntary surfacing, look for improvements (including greater semantic awareness) in the connectors and APIs used by SBA platforms, technologies largely shaped by Web standards.

16.1.2 OPENING ACCESS TO MULTIMEDIA CONTENT In addition to form-generated content, the Deep Web also encompasses content that is search engine-inaccessible for other reasons. One prime example is multimedia content. Though there has been a veritable explosion in the amount of digital multimedia content created and stored in both the enterprise and on the Internet, most of it can only be accessed if someone has taken the trouble to manually tag it with useful titles, descriptions and other metadata. As pointed out in Chapter 6, search engines are increasingly coupling semantic processing with technologies like auto-transcription, object recognition and OCR scanning to open up access to this type of content. As the volume and importance of multimedia content is increasing exponentially in both the enterprise and on the Internet, look for bi-directional advances in these technologies from both of these domains.

16.2 THE INFLUENCE OF THE SEMANTIC WEB While Deep Web initiatives focus principally on exposing the structured data already present on the Web, the Semantic Web refers to efforts to create structure where none exists. As postulated by World Wide Web Consortium (W3C) director Tim Berners-Lee in 20013 , the original vision for the Semantic Web was to evolve a body of standards (high level ontologies, interchange protocols and representational formats) to lend meaning and relevance to data on the Web, to make it easier to reuse and share that data, and to automate more of the routine IR tasks carried out by users. Given that this original vision relied on the voluntary and consistent use of standards at a global level, no monolithic "Semantic Web” has (or likely ever will) surface. However, the technologies rooted in W3C’s efforts to advance the Semantic Web (RDF, OWL, SPARQL, RDFa, SKOS, RDFS, GRDDL, POWDER, RIF, SAWSDL, etc.) have been influential in shaping semantic models and architectures both on the Web and in the enterprise. Efforts to address the limitations of the top-down approach to the Semantic Web have produced even more semantic innovation. These efforts seek to construct the Semantic Web from the bottom up. They are focused on how best to automatically relate high volume corpora that use inconsistent or imprecise ontologies, or none at all. Attendant research in fuzzy ontology matching; semantic categorization and clustering; Rule Based, Bayesian and SVM Classification, and other areas related to the dynamic mapping of taxonomies and fields have and will continue to shape both Web search engines and SBA platforms. 3 Berners-Lee, Tim; James Hendler and Ora Lassila,"The Semantic Web,” Scientific American Magazine, May 17, 2001, http://

www.sciam.com/article.cfm?id=the-semantic-web&print=true.

106

16. FUTURE DIRECTIONS

16.3 THE INFLUENCE OF THE MOBILE WEB 16.3.1 MISSION-BASED IR Another broad trend affecting the development and adoption of SBAs is the ascendancy of non-PC devices (phones, tablets, car electronics, gaming and entertainment devices, etc.) over PCs for global Internet consumption.4 As Steve Rubel pointed out in a recent Advertising Age article, "Mobile devices, by their nature, force users to become more mission-oriented. As more internet consumption shifts to gadgets, it’s increasingly becoming an app world (as opposed to a search world) and we just live in it...single-purpose utility will rule while grandiose design and complexity will fall by the wayside."5 This task-based usage requires hiding away the complexity and scale of information systems and delivering to users exactly what they need, in a form they can immediately digest. This is the strength of SBAs, and primary reason Internet SBAs such as Urbanizer will proliferate in the years to come. Look also for the highly personal, socially connected and at-your-fingertips style of mobile SBA app to exert increasing influence over user expectations for conventional enterprise apps (CRM, ERP, SCM, etc.).

16.3.2 INNOVATION IN VISUALIZATION The expectations will likely include increasingly visual forms of interacting with information. In particular, visual forms of summarizing information already employed by SBA engines, like dynamic charts and graphs, tag clouds, sliders and wheels, geospatial maps, relationship maps, semantic maps, etc., will become increasingly important, with new forms arising as search frontiers and mobile device usage expand.

16.4 ...AND CONTINUING DATABASE/SEARCH CONVERGENCE Just as the boundaries between the Internet and the enterprise become more blurred by the day, one can also anticipate a continued convergence in the database and search domains. This includes both a technical convergence, and increased couplings in the field to fill in the gaps for convergences underway (or those that may never come to pass). Look for SBA engines to boost their OLAP capabilities, and to incorporate more databaseinspired storage and retrieval structures (including elements from both relational and NoSQL systems), if not deployed in tandem with auxiliary RDBMS systems like MySQL. Look for RDBMS to improve its scalability, but to see its role increasingly focused on OLTP rather than general infor4 Morgan Stanley notes that the mobile Internet is ramping faster than desktop Internet did, and predicts more users

may connect to the Internet via mobile devices than desktop PCs within 5 years. "The Mobile Internet Report,” Morgan Stanley Technology Research, December, 2009, http://www.morganstanley.com/institutional/techresearch/ mobile_internet_report122009.html. 5 Steve Rubel, "It’s Time to Prepare for the End of the Web as We Know It,” Advertising Age, July 12, 2010, http://adage.com/ digital/article?article_id=144867;

16.4. ...AND CONTINUING DATABASE/SEARCH CONVERGENCE

107

mation search, access and reporting. Look for NoSQL systems to gain ground as backbones for large scale Web and Cloud services, but to remain infrequently used in rank-and-file business and Web applications as their effectiveness in that context usually requires a coupling with a search engine and/or RBDMS to achieve the same results available out-of-the-box with a single SBA platform.

109

Bibliography Abney S. “Parsing by chunks.” In Berwick R, Abney S, and Tenny C, editors, Principle-based Parsing. Kluwer Academic Publishers, 1991 36 Agrawal S, Chaudhuri S, Das G. “DBXplorer: A System for Keyword-Based Search over Relational Databases”, 18th International Conference on Data Engineering (ICDE’02), pp. 5–16, 2002. DOI: 10.1109/ICDE.2002.994693 49 Agrawal R. "Search and Data Management". Summary presentation at the Claremont Database Research Self-Assessment Meeting, Berkeley, CA, May 29-30, 2008, http://db.cs.berkeley. edu/claremont/ Agrawal R et al. “The Claremont Report on Database Research". Report issued by the Claremont Database Research Self-Assessment Meeting, Berkeley, CA, May 29-30, 2008, http://db.cs. berkeley.edu/claremont/ DOI: 10.1145/1462571.1462573 62 Andrews W. “Gartner MarketScope for Enterprise Search," Gartner, Inc., ID Number: G00206087, November 2010. 3 Andrews W and Knox RE. “Gartner Magic Quadrant for Information Access Technology," Gartner, Inc., ID Number: G00161178, 2008 61, 62 Angles R and Gutierrez C. “Survey of graph database models.” ACM Computing Surveys, vol. 40, no. 1, pp. 1–39, 2008. DOI: 10.1145/1322432.1322433 27 Baeza-Yates C and Ribeiro-Neto B. Modern Information Retrieval. Second Edition. Addison Wesley Longman, 2010. 36 Basu R. Implementing quality: A practical guide to tools and techniques. Thomson, London, 2004. 87 Bernstein P and Newcomer E. Principles of transaction processing, Morgan Kaufmann, Chapter 6, 2009. 30 Bizer C, Heath T and Berners-Lee T. “Linked Data – The Story So Far”. In Heath T, Hepp M, Bizer C. (eds) International Journal on Semantic Web and Information Systems (IJSWIS), vol. 5, no. 3, pp. Pages 1-22, 2009. 41 Boag S, Chamberlin D, Fernandez MF, Florescu D, Robie J and Simeon J. “XQuery 1.0: An XML query language”. Technical report, World Wide Web Consortium, http://www.w3.org/TR/ xquery/ January 2007. 49

110

16. FUTURE DIRECTIONS

Bonifati A, Cattaneo F, Ceri S, Fuggetta A and Paraboschi S. “Designing data marts for data warehouses”, ACM Transactions on Software Engineering Methodology vol 4, pp. 452–483, 2001. DOI: 10.1145/384189.384190 13 Brin S and Page L. “Anatomy of a large-scale hypertextual web search engine” In Proceedings of the 7th International World Wide Web Conference pp. 107–117, Brisbane, Australia, Apr. 14–18, 1998. DOI: 10.1016/S0169-7552(98)00110-X 12, 43 Brewer E. “Towards Robust Distributed Systems”, Keynote Address, Annual ACM Symposium on Principles of Distributed Computing (PODC), July 19, 2000. http://www.cs.berkeley. edu/˜brewer/cs262b-2004/PODC-keynote.pdf DOI: 10.1145/343477.343502 24 Büttcher S, Clarke CLA and Cormack GV. Information Retrieval: Implementing and Evaluating Search Engines, MIT press, 2010. 18 Cafarella MJ, Madhavan J and Halevy A. “Web-scale extraction of structured data.” SIGMOD Rec., vol. 37, no. 4, pp.55–61, 2009 DOI: 10.1145/1519103.1519112 41 Carey MJ, Haas LM, Kleewein J and Reinwald B. “Data access interoperability in the IBM database family.” IEEE Quarterly Bulletin on Data Engineering; Special Issue on Interoperability, vol. 21, no. 3, pp. 4–11, 1998. 24 Caverlee J and Liu L. “Countering web spam with credibility-based link analysis.” In Proceedings of the Twenty-Sixth Annual ACM Symposium on Principles of Distributed Computing, Portland, Oregon, August 12 - 15, PODC ’07. ACM, New York, NY, pp. 157-166, 2007. DOI: 10.1145/1281100.1281124 37 Chang F, Dean J, Ghemawat S, Hsieh W, Wallach D, Burrows M, Chandra T, Fikes A, and Gruber R. “Bigtable: A distributed storage system for structured data.” In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation. pp. 205–218, 2006. 26 Cheng J, Ke Y, and Ng W. “Efficient query processing on graph databases.” ACM Transactions on Database Systems (TODS) vol. 34, no. 1, April 2009. DOI: 10.1145/1508857.1508859 27 Chaudhuri S and Dayal U. “An overview of data warehousing and OLAP technology.” SIGMOD Record, vol 26, 1997. DOI: 10.1145/248603.248616 13 Claybrook B. “On Line Transaction Processing Systems", John Wiley & Sons, 1992. 13 Clarke CLA and Cormack GV. “Shortest-substring retrieval and ranking.” ACM Trans. Inf. Syst. vol 18, no. 1, pp. 44–78, 2000. DOI: 10.1145/333135.333137 37 Cleverdon CW and Mills J. “The testing of index language devices”. Aslib Proceedings, vol 15, no. 4, pp. 106–130, 1963. DOI: 10.1108/eb049925 11

16.4. ...AND CONTINUING DATABASE/SEARCH CONVERGENCE

111

Codd EF. The Relational Model for Database Management, Addison-Wesley Publishing Company, 1990 19 Copestake A and Sparck Jones K. “Natural Language Interfaces to Databases”, Knowledge Engineering Review, vol. 5, no. 4, pp. 225–249, 2005. DOI: 10.1017/S0269888900005476 10 Comer D. “The ubiquituous B-tree.” ACM Computing Surveys, vol 11, no. 2, pp. 121–137, 1979. DOI: 10.1145/356770.356776 19 Councill IG, Giles CL, Iorio ED, Gori M, Maggini M and Pucci A. “Towards next generation citeseer: A flexible architecture for digital library deployment”. In Research and Advanced Technology for Digital Libraries, ECDL2006, pp. 111–122, 2006. DOI: 10.1007/11863878_10 19 DeCandia G, Hastorun D, Jampani M, Kakulapati G, Laksham A, Pilchin A, Sivasubramanian S, Vosshall P and Vogels W. “Dynamo: Amazon’s highly available key-value store.” In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles ACM Press, New York, pp. 205–220, 2007. DOI: 10.1145/1294261.1294281 25 Dedrick J, Xu S and Zhu K. “How Does Information Technology Shape Supply-Chain Structure? Evidence on the Number of Suppliers.” Journal of Management Information Systems vol 25, no. 2, pp. 41–72, 2008. DOI: 10.2753/MIS0742-1222250203 9 Deutsch P and Emtage A. “The archie System: An Internet Electronic Directory Service,” ConneXions, vol 6, no. 2, February 1992. 11 Fagin R, Kumar R, McCurley KS, Novak J, Sivakumar D,Tomlin JA and Williamson DP. “Searching the workplace web”, in Proceedings of the 12th international conference on World Wide Web, Budapest, Hungary, ACM, pp. 366–375, 2003. DOI: 10.1145/775152.775204 38 Feldman S and Reynolds H. “Worldwide Search and Discovery 2009 Vendor Shares and Update on Market Trends,” IDC #223926, July, 2010. 2, 3 Filo D and Yang J. Yahoo! unplugged. Your Discovery Guideto the Web Foster City: IDG Books Worldwide, 1995. 11 Gabrilovich E and Markovitch R. “Wikipedia-based semantic interpretation for natural language processing.” Journal of Artificial Intelligence Research, vol 34, 443–498, 2009. DOI: 10.1613/jair.2669 38 Gantz J and Reinsel D. "The Digital Universe Decade – Are You Ready?", IDC, May 2010. Sponsored by EMC Corporation. 9 Giannadakis N, Rowe A, Ghanem M and Guo Y. “InfoGrid: providing information integration for knowledge discovery,” Information Sciences, vol 155, no. 3–4, pp. 199–226, 2003. DOI: 10.1016/S0020-0255(03)00170-1 27

112

16. FUTURE DIRECTIONS

Gilbert S and Lynch N. “Brewer’s Conjecture and the feasibility of Consistent, Partition Tolerant Web Services,” ACM SIGACT News, vol 155, no. 3–4, pp. 199–226, 2003. vol 33,no. 2, pp. 51–59, 2002. DOI: 10.1145/564585.564601 24 Grefenstette G. “Comparing two language identification schemes”. In Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data ( JADT’95). pp. 263–268, 1995. 35 Grefenstette G, Qu Y, Evans DA and Shanahan JG. “Validating the Coverage of Lexical Resources for Affect Analysis and Automatically Classifying New Words along Semantic Axes.” In Qu Y., Shanahan JG and Wiede J (eds) Exploring Attitude and Affect in Text: Theories and Applications, AAAI-2004 Spring Symposium Series, pp. 1–78, 2004. DOI: 10.1007/1-4020-4102-0_9 41 Grover C, Matheson C, Mikheev A and Moens M. “LT TTT - A flexible tokenisation tool.” In LREC 2000 Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, pp 1147–1154, 2000. 36 Guttman A. “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 47–54, 1984 DOI: 10.1145/971697.602266 19 Hasselbring W. “Information system integration," Communications of the ACM vol 43, no. 6, pp. 32– 38, June, 2000. 15 Heydon A and Najork M. “Mercator: A scalable, extensible Web crawler. ” World Wide Web, vol 2, no. 43, pp. 219–229, December 1999. DOI: 10.1023/A:1019213109274 29 Hull R. “Managing Semantic Heterogeneity in Databases: A Theoretical Perspective.” In SIGART Symposium on Principles of Database Systems, pp. 51–61, 1997. DOI: 10.1145/263661.263668 24 Ilyas IF, Aref WG and Elmagarmid AK. “Supporting top-k join queries in relational databases.” VLDB Journal, vol. 13, no. 3, pp. 207–221, 2004 DOI: 10.1007/s00778-004-0128-2 43 Ji H and Grishman R. “Refining Event Extraction through Cross-Document Inference.” In Proceedings of ACL-08: HLT, pp.254–262, Columbus, OH, June, 2008 40 Kazama J and Torisawa K. “Exploiting Wikipedia as External Knowledge for Named Entity Recognition” , in Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007. 39 Kofler M. The Definitive Guide to MySQL 5, 3rd edn. Apress, Berkeley, 2005. 38 Lamel L and Gauvain JL. “Speech processing for audio indexing,” GoTAL 2008 - Advances in NLP, 5221/2008 LNCS pp. 4-15, Springer Verlag, 2008. 41 Laudon KC and Laudon JP. Essential of management information systems (9th ed). Englewood cliffs, NJ: Prentice Hall, 2010.

16.4. ...AND CONTINUING DATABASE/SEARCH CONVERGENCE

113

Leavitt N. “Will NoSQL Databases Live Up to Their Promise?” Computer, vol 43, pp. 12–14, 2010. DOI: 10.1109/MC.2010.58 24, 32 Lew MS, Sebe N, Djeraba C and Jain R. “Content-based multimedia information retrieval: State of the art and challenges.” ACM Trans. Multimedia Comput. Commun. Appl., vol. 2, no. 1, pp. 1–19, 2006 DOI: 10.1145/1126004.1126005 41 Mani I and Maybury MT. (Eds.) Advances in Automatic Text Summarization, The MIT Press, 1999. 39 Manning C and Schütze H. Foundations of Statistical Natural Language Processing. MIT Press, 1999. 36, 38 Masterman M, Needham RM and Sparck Jones K. “The analogy between mechanical translation and library retrieval”, Proceedings of the International Conference on Scientific Information (1958), National Academy of Sciences - National Research Council, Washington, D.C., vol. 2, pp. 917– 935, 1959 11 McGuinness DL. “Ontologies come of age.” In Fensel D, HendlerJ, Lieberman H, Wahlster W, editors. The semantic web: why, what, and how. Cambridge: MIT Press, 2001 Mitkov R. “Outstanding issues in anaphora resolution.” In Gelbukh, A., ed.: Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2001). Volume 2004 of Lecture Notes in Computer Science., Berlin, Springer, pp. 110–125, 2001 40 Negash S and Gray P. Business Intelligence, Springer, Berlin, Heidelberg, 2008. 13 Nijssen G and Halpin T. Conceptual Schema and Relational Database Design: a fact oriented approach Prentice-Hall, Sydney, Australia, 1989 13 Page L, Brin S, Motwani R and Winograd T “The PageRank citation ranking: bringing order to the Web. ” Technical report, Stanford Digital Library Technologies Project, 1998. 37 Pinkerton B. “Finding What People Want: Experiences with the WebCrawler.” The Second International WWW Conference Chicago, USA, October 17-20, 1994. 11 Qi X and Davison BD. “Web page classification: Features and algorithms.” ACM Comput. Surv. vol. 41, no. 2, 2009. DOI: 10.1145/1459352.1459357 41 Reese G. Database programming with JDBC and Java. Sebastopol CA: O’Reilly & Associates, 2000. 38 Russell SJ and Norvig P, Artificial Intelligence: A Modern Approach, 3rd edition, Prentice Hall, 2009. 41

114

16. FUTURE DIRECTIONS

Salton G. The SMART Retrieval System’s Experiments in Automatic Document Processing, PrenticeHall, Inc. 1971. 11 Salton G. Automatic Text Processing. Addison-Wesley, Reading, Mass, 1989. 36 Thomson, A. and Abadi, D. “The Case for Determinism in Database Systems,” Proceedings of the VLDB Endowment, Vol. 3, No. 1, 2010. http://db.cs.yale.edu/determinism-vldb10. pdf 33 Tunkelang D. Faceted Search Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers, 2009. 14 Vicknair C, Macias M, Zhao Z, Nan X, Chen Y, Wilkins D. “A Comparision of a Graph Database and a Relational Database: A Data Provenance Perspective," Proc. of the ACM Southeast Conference (ACMSE), Oxford, Mississippi, April 2010. 27 Wolf JL, Squillante MS, Yu PS, Sethuraman J, Ozsen L. “Optimal crawling strategies for web search engines.” In Proceedings of the 11th International World Wide Web conference, pp. 136–147, 2002. DOI: 10.1145/511446.511465 29 Yoshikawa M, Amagasa T, Shimura T, Uemura S. “XRel: A path-based approach to storage and retrieval of XML documents using relational databases” ACM Transactions on Internet Technology, vol 1, no. 1, pp. 110–141, August 2001 DOI: 10.1145/383034.383038

115

Authors’ Biographies GREGORY GREFENSTETTE

Gregory Grefenstette is Chief Science Officer at Exalead. He received his B.S. from Stanford University in 1978, and a Ph.D. in Computer Science from the University of Pittsburgh in 1993. He has been Principal Scientist at the Xerox Research Centre (1993-2001), with Clairvoyance (2001-3) and at the French applied research centre, the CEA (2001-8). His research interests range from most subjects in Natural Language Processing to all aspects of Information Retrieval. He serves on the editorial board of the Journal for Natural Language Engineering, and he edited the first book on Cross Language Information Retrieval (Kluwer 1998).

LAURA WILBER Laura Wilber has served as the CEO of a web development company specializing in online databases and as the VP of Marketing for a provider of SaaS software. She has also developed multimedia technology tutorials for intellectual property litigation, and worked in the federal systems engineering division of a telecommunications company. She earned her M.A. in English from the University of Maryland, where she studied medieval literature and linguistics in the PhD program. She now works as a writer and analyst at Exalead, S.A. Prior to joining Exalead, she taught business and technology courses at ISG (l’Institut Supérieur de Gestion). She lives in Paris with her husband, Joe Ross, and their children, Julien and Juliette.