This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
{0} ?o), zero-or-more paths (?s
* ?o), one-or-more paths (?s
+ ?o), and negated paths (?s !
?o). Due to their complexity, we discuss only a subset of the expressivity of these path types. With respect to “relevant” data, one-or-more paths with simple predicates (those in which the + path operator applies to an IRI) can be reduced to triple pattern matching with predicate-bound triple patterns. For example, the path <s>
+ ?o has the same relevant data as the triple pattern ?s
?o. Note that while the subject is bound to <s> in the path pattern, it must be unbound in the triple pattern equivalent as the path may be affected by triples where the subject is not <s>. Zero-length paths and negated paths require special attention. The zero-length path connects a graph node (any subject or object in the graph) to itself. Therefore, any insertion (deletion) in a graph may affect the results to a zero-length 5
Some SPARQL systems allow empty named graphs to exist. Removing all triples from a named graph would not affect the set of graph names on such systems.
770
G.T. Williams and J. Weaver
path pattern by adding (removing) a node to the graph that didn’t exist before (doesn’t exist after) the update. Similarly, a negated path ?s !
?o implies that any insertion or deletion not using the
predicate may impact the results. While the relevant data for such a negated path is a subset of all the data, we assume that realistic datasets will contain a range of predicates and so the relevant data will be very large (in many cases approximating the size of the dataset itself). Therefore, we conservatively assume that the entire dataset is relevant to a negated path pattern. DESCRIBE Queries. DESCRIBE queries present a challenge in determining relevant data. These queries involve matching a graph pattern just as SELECT queries do (a DESCRIBE query without a WHERE clause being semantically equivalent to one with an empty WHERE clause). However, the final results of a DESCRIBE query depend on the WHERE clause and the algorithm used for enumerating the RDF triples that comprise the description of a resource. A na¨ıve DESCRIBE algorithm would be to return all the triples in which the resource appeared as the subject. For our purposes, this algorithm would make this query: DESCRIBE ?s WHERE { ?s a
Maintaining and Probing Cache Status
In this section we briefly describe the algorithms used during update operations to maintain the mtime field in the search tree. We then describe the probing algorithm used to determine the effective mtime of the relevant data for a specific query.
Enabling Fine-Grained HTTP Caching of SPARQL Query Results
771
Cache Maintenance. Maintaining the mtime field in the search tree is a simple process: 1. Before each tree node is written to disk (due to an insertion of deletion), update the node’s mtime to the current time. 2. For each node that is written to disk, write its parent to disk (thereby updating its parent’s mtime). This process will ensure our condition that every tree node’s mtime is greater or equal to those of its descendants and can be used as the effective mtime of descendant, relevant data. We distinguish between the effective mtime of data matching an access pattern, and that data’s actual mtime. As discussed in section 4.1, the specific data structure used for the search tree affects the granularity (and therefore the expected accuracy) of the effective mtime. Due to their design, tries yield effective mtimes that are exactly the same as the most recent mtime of data matching an access pattern. B+ trees yield effective mtimes of matching data that may be affected by any non-matching data that is co-located on a leaf node with matching data. During the update process, we note that the parent node(s) may already need to be written to disk (in the case of a node split), so step 2 may already be required on any given update. Moreover, an update at a leaf node in appendonly and counted B+ trees cause a cascade of writes up to the (possibly new) root. In these cases, all IO incurred by the cache update algorithm is already required by the update operation, and so the cost of maintaining the cache data is effectively free. Cache Probing. The algorithm used for probing a database index to retrieve the effective mtime for a query is shown in algorithm 1. Given a query and a set of available search tree indexes, for each access pattern in the query, the algorithm probes the index that will yield the most accurate effective mtime, and returns the most recent of the mtimes. The index that will yield the most accurate effective mtime is the one with a key ordering that will allow descending as deep into the tree as there are bound terms in the access pattern. If no such index exists, a suitable replacement index is chosen that maximizes the possible depth into the tree that some subset of bound terms in the access pattern will allow. In the case of the completely unbound access pattern, the effective mtime is the same as the mtime of the entire dataset and so can be retrieved from the root node of any available index. While this algorithm describes how the effective mtime of a query may be computed, it is worth noting that the specific steps described may be implemented in more or less efficient ways. For example, the algorithm calls for finding the lowest common ancestor (LCA) of data matching the access pattern. For a system using B+ trees, a na¨ıve implementation might traverse the tree to find the leaves with matching data and then walk up the tree to find the LCA. A more efficient implementation could avoid having to find all leaves with matching data by
772
G.T. Williams and J. Weaver
Algorithm 1. Probe database for effective mtime of query results Input: A SPARQL query graph pattern query, a set of available database indexes indexes Output: ef f ectiveM time, the effective modification time of relevant data for the query 1 mtimes = ∅ ; 2 foreach ap ∈ query do 3 orderedIndexes = {i|i∈indexes, ∃s⊆boundP ositions(ap) s.t. the key order of i starts with s} ; 4 if |orderedIndexes| > 0 then index = argmax |s| ; 5
6 7 8 9 10 11 12 13 14
i∈orderedIndexes
n = LCA of data matching ap in index ; mtimes = mtimes ∪ {mtime(n)} ; else i = any index in indexes ; mtimes = mtimes ∪ {mtime(root(i))} ; end end ef f ectiveM time = M ax(mtimes) ; return effectiveMtime
traversing tree edges until finding the LCA by using the bounds data contained in internal nodes. As discussed above and in section 4.1, the soundness of results is affected by the choice of the search tree data structure used. B+ trees produce less sound results as a result of maintaining less accurate effective mtimes. Tries will result in more sound results as a result of being able to maintain accurate effective mtimes. Even though tries maintain accurate effective mtimes, their use does not guarantee perfectly sound cache validation as updates that affect data relevant to a query may not change the results to that query. This can occur when the relevant updated data does not appear in the query results due to join conditions, filter expressions, or projection. In these cases, cache validation will fail and the query must be evaluated again, despite accurate results already being cached. One final case that is worth noting is the special case of determining the effective mtime for an empty named graph pattern (GRAPH ?g ). As discussed in section 4.2, this pattern returns the set of available named graphs. If the set of available indexes are all covering indexes (using key orders that are just permutations of subject, predicate, object, and graph), then there is no way to determine an accurate effective mtime for this pattern. However, if there is an available index over just G, an accurate effective mtime for the set of named graphs is stored in the G index root node.
5
Evaluation
We evaluated the potential impact on performance of query result caching by implementing a simple SPARQL process in C with B+ tree indexes. To index
Enabling Fine-Grained HTTP Caching of SPARQL Query Results
773
data, we use the six index orderings SP OG, SGOP , P OGS, OGSP , OSP G, and GP SO. We evaluated our system using a slightly modified version of the Berlin SPARQL Benchmark6 using both the Explore (read-only) and the Explore and Update (read-write) use cases. All BSBM evaluation was performed on a dual Intel Xeon E5504 Quad Core 2.0GHz processor with 24GB of memory, with 5 warmup runs, and 10 timed runs. 5.1
Modified Berlin SPARQL Benchmark
We believe the standard BSBM benchmark fails to account for the skewed distribution of real-world queries and so, following the work in Martin[10], modified the benchmark test driver to use a Pareto distribution for benchmark queries. The evaluation tests were performed with varying query repetition as represented by the α parameter. We also modified the benchmark test driver to support HTTP caching by storing query results when they are returned with caching headers, and validating existing cached query results using conditional requests. 5.2
Explore Use Case
The Explore use case of BSBM consists of a set (“query mix”) of read-only queries that simulate a consumer looking for product information in an e-commerce setting. In our evaluation, this use case tests performance gains from caching on a static dataset. No updates (neither relevant nor irrelevant) are performed and so once cached, query results are always valid. Figure 2 shows the performance improvement of our caching system on the BSBM Explore use case (as a percentage increase over the same tests run without the use of caching). The test was run with α distribution values ranging from 0.1 to 4.0, and shows between 35–650% increase in benchmark performance. 5.3
Explore and Update Use Case
The Explore and Update use case of BSBM consists of the same queries as in the Explore use case, with occasional updates to the dataset representing new products, reviews, and offers being added to, and old offers being removed from the dataset. This use case tests performance gains from caching on both static datasets (intra-query mix) and updated dataset (inter-query mix, after an update set). The updates contain both relevant and irrelevant data to the queries in the Explore use case. Figure 3 shows the performance improvement of our caching system on the BSBM Explore and Update use case, again with α distribution values ranging from 0.1 to 4.0, and shows between 2–160% increase in benchmark performance. 5.4
Cost of Caching
To evaluate whether implementing our caching system increases overall processing cost, we evaluated the difference in performance between two versions of 6
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
774
G.T. Williams and J. Weaver
5M triples
a0.1
a0.3
a0.5
a1.0
a2.0
50
200 500
cache−on
10 20
Percent Change QMpH
50
200 500
cache−on
10 20
Percent Change QMpH
1M triples
a4.0
a0.1
a0.3
alpha
a0.5
a1.0
a2.0
a4.0
a2.0
a4.0
alpha
Fig. 2. BSBM Explore Use Case
200
5M triples
2
5 10
50
cache−on
1
2
5 10
50
Percent Change QMpH
cache−on
1
Percent Change QMpH
200
1M triples
a0.1
a0.3
a0.5
a1.0
alpha
a2.0
a4.0
a0.1
a0.3
a0.5
a1.0
alpha
Fig. 3. BSBM Explore and Update Use Case
775
50
Enabling Fine-Grained HTTP Caching of SPARQL Query Results
30 20 0
10
QMpH
40
nocache cache−off
a0.1
a0.3
a0.5
a1.0
a2.0
a4.0
alpha
Fig. 4. Cost of Caching, BSBM Explore and Update Use Case
our system. In one version (“nocache”), we compile our system without any cache-supporting code. In the other (“cache-off”), caching support is included. However, both of these versions were tested against the Explore and Update use case with no caching support enabled in the test driver. In the cache-enabled version, this tests both the cost of cache probing to generate the Last-Modified response header during the explore phase and the cost of maintaining mtimes during the update phase. As can be seen in figure 4, the cache-enabled system performs roughly the same as the version without caching (executing at times both slightly faster and slower than the baseline “nocache” system). 5.5
Discussion
The performance of this system is not competitive with existing SPARQL stores, which achieve dramatically higher scores on the Berlin SPARQL benchmark. We attribute this to our system being a testing implementation meant to demonstrate our cache-supporting indexes and algorithms, not meant to compete head to head with production systems. Specifically, our system uses a very basic B+ tree implementation with no optimization to reduce disk IO, and lacks any query optimizer or memory management of database pages. We would have liked to evaluate our caching approach using a more efficient implementation, but found that modifying the low-level index structures with mtime fields and making those fields available through many layers of API abstractions was very difficult. Overall, we suggest attention should be paid to the large relative improvement of performance with query result caching, not on the specific QMpH figure of our implementation. We believe our system’s lack of database page management may hurt overall caching performance. The ability to cache database pages in memory would not only improve overall performance, but in some cases would specifically improve performance of the cache query (probe) algorithm. Specifically, in cases where the
776
G.T. Williams and J. Weaver
upper levels of index nodes reside entirely in memory, accessing the LCA nodes that provide the mtime of relevant data may require no disk access whatsoever. Conversely, a highly optimized system tuned for very fast pattern matching and joins would narrow the performance gap between validating a conditional query with cache probing and evaluating the query in full. This situation would seem to provide less benefit from caching. However, this narrowing of performance gap is only one aspect of the benefits from caching. Even on a very efficient implementation, caching would still reduce the network IO required to transfer the query results (which our evaluation does not address) and the memory usage on the server.
6
Conclusion
Caching of SPARQL query results is a promising approach to improving scalability. In this paper we have shown that simple modification of the indexing structures commonly used in SPARQL processors can allow fine-grained caching of query results based on the freshness of data relevant to the query. We evaluated this system using the Berlin SPARQL Benchmark and found that caching can dramatically improve performance in the presence of repeated queries. Moreover, maintaining the data required for caching and using it to service conditional query requests has low cost compared to fully evaluating queries. In the future we hope to apply the presented caching structures and algorithms to an existing, optimized SPARQL processor and evaluate it using much larger datasets and more expressive queries than those provided by BSBM. We also believe this work could be improved in many ways. The precision (and therefore the soundness) of the cache probing algorithm might be improved by taking into account the ways in which relevant data is combined and modified (e.g. using joins, filters, and projection). In many cases, typical queries use only a subset of available database indexes. Cache-enabled search trees for only the most frequent access patterns could be augmented with non-tree indexes, allowing the system to leverage the benefits of certain non-tree index structures while keeping the precision of caching for frequent queries. Finally, other indexing structures might also be modified to store similar fine-grained caching data, allowing informed indexing structure choices while maintaining the benefits of caching. Acknowledgements. We thank Timothy Lebo and Lee Feigenbaum for their helpful comments and suggestions about this work.
References 1. Breslau, L., Cao, P., Fan, L., Phillips, G., Shenker, S.: Web caching and Zipflike distributions: Evidence and implications. In: Proceedings of INFOCOM 1999, Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies (1999)
Enabling Fine-Grained HTTP Caching of SPARQL Query Results
777
2. Gallego, M., Fern´ andez, J., Mart´ınez-Prieto, M., Fuente, P.: An empirical study of real-world SPARQL queries. In: USEWOD 2011 - 1st International Workshop on Usage Analysis and the Web of Data (2011) 3. Harth, A., Decker, S.: Optimized index structures for querying RDF from the web. In: Proceedings of the 3rd Latin American Web Congress (2005) 4. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. In: Proceedings of the VLDB Endowment Archive (2008) 5. Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for rdf. In: Proceedings of the VLDB Endowment Archive (2008) 6. Harris, S., Lamb, N., Shadbolt, N.: 4store: The design and implementation of a clustered rdf store. In: Proceedings of the 5th International Workshop on Scalable Semantic Web Knowledge Base Systems, SSWS 2009 (2009) 7. Goldstein, J., Larson, P.: Optimizing queries using materialized views: a practical, scalable solution. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (2001) 8. Amiri, K., Park, S., Tewari, R., Padmanabhan, S.: Dbproxy: A dynamic data cache for web applications. In: Proceedings of the 19th International Conference on Data Engineering, ICDE 2003 (2003) 9. Larson, P., Goldstein, J., Zhou, J.: Mtcache: transparent mid-tier database caching in sql server. In: Proceedings of 20th International Conference on Data Engineering, pp. 177–188 (2004) 10. Martin, M., Unbehauen, J., Auer, S.: Improving the Performance of Semantic web Applications with SPARQL Query Caching. In: Aroyo, L., Antoniou, G., Hyv¨ onen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 304–318. Springer, Heidelberg (2010) 11. Hartig, O.: How caching improves efficiency and result completeness for querying linked data. In: Proceedings of the 4th Linked Data on the Web (LDOW) Workshop (March 2011) 12. Gu´eret, C., Groth, P., Oren, E., Schlobach, S.: eRDF: A scalable framework for querying the web of data, pp. 1–17 (October 2010)
dipLODocus[RDF]—Short and Long-Tail RDF Analytics for Massive Webs of Data Marcin Wylot, Jig´e Pont, Mariusz Wisniewski, and Philippe Cudr´e-Mauroux eXascale Infolab University of Fribourg, Switzerland {firstname.lastname}@unifr.ch
Abstract. The proliferation of semantic data on the Web requires RDF database systems to constantly improve their scalability and transactional efficiency. At the same time, users are increasingly interested in investigating or visualizing large collections of online data by performing complex analytic queries. This paper introduces a novel database system for RDF data management called dipLODocus[RDF] , which supports both transactional and analytical queries efficiently. dipLODocus[RDF] takes advantage of a new hybrid storage model for RDF data based on recurring graph patterns. In this paper, we describe the general architecture of our system and compare its performance to state-of-the-art solutions for both transactional and analytic workloads.
1
Introduction
Despite many recent efforts, the lack of efficient infrastructures to manage RDF data is often cited as one of the key problems hindering the development of the Semantic Web. Last year at ISWC, for instance, the two industrial keynote speakers (from the New York Times and Facebook) pointed out that the lack of an open-source, efficient and scalable alternative to MySql for RDF data was the number one problem of the Semantic Web. The Semantic Web community is not the only one suffering from a lack of efficient data infrastructures. Researchers and practitioners in many other fields, from business intelligence to life sciences or astronomy, are currently crumbling under gigantic piles of data they cannot manage or process. The current crisis in data management is from our perspective the result of three main factors: i) rapid advances in CPU and sensing technologies resulting in very cheap and efficient processes to create data ii) relatively slow advances in primary, secondary and tertiary storage (PCM memories and SSD disks are still expensive, while modern SATA disks are singularly slow–with seek times between 5ms and 10ms typically) and iii) the emergence of new data models and new query types (e.g., graph reachability queries, analytic queries) that cannot be handled properly by legacy systems. This situation resulted in a variety of novel approaches to solve specific problem, for large-scale batch-processing [10], data warehousing [20], or array processing [8]. L. Aroyo et al. (Eds.): ISWC 2011, Part I, LNCS 7031, pp. 778–793, 2011. c Springer-Verlag Berlin Heidelberg 2011
dipLODocus[RDF]
779
Nonetheless, we believe that the data infrastructure problem is particularly acute for the Semantic Web, because of its peculiar and complex data model (which can be modeled as a constrained graph, as a ternary or n-ary relation, or as an object-oriented model depending on the context) and of the very different types of queries a typical SPARQL end-point must support (from relatively simple transactional queries to elaborate business intelligence queries). The recent emergence of distributed Linked Open Data processing and visualization applications relying on complex analytic and aggregate queries is aggravating the problem even further. In this paper, we propose dipLODocus[RDF] , a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a “vertical” analytics perspective (by storing compact lists of literal values for a given attribute). dipLODocus[RDF] trades insert complexity for analytics efficiency: isolated inserts and simple look-up are relatively complex in our system due to our hybrid model, which on the other hand enables us to considerably speed-up complex queries. The rest of this paper is structured as follows: we start by discussing related work in Section 2. Section 3 gives a high-level overview of our system and introduces our hybrid storage scheme. We give a more detailed description of the various data structures in dipLODocus[RDF] in Section 4. We describe how our system handles common operation like bulk inserts, updates, and various types of queries in Section 5. Section 6 is devoted to a performance evaluation study, where we compare the performance of dipLODocus[RDF] to state-of-the-art systems both for a popular Semantic Web benchmark and for various analytic queries. Finally, we conclude in Section 7.
2
Related Work
Approaches for storing RDF data can be broadly categorized in three subcategories: triple-table approaches, property-table approaches, and graph-based approaches. Many approaches have been proposed to optimize RDF query processing; we list below some of the most popular approaches and systems. We refer the reader to recent surveys of the field (such as [15], [13], or [16]) for a more comprehensive coverage. Triple-Table Storage: since RDF data can be seen as sets of subject-predicateobject triples, many early approaches used a giant triple table to store all data. Our GridVine [2,7] system, for instance, uses a triple-table storage approach to distribute RDF data over decentralized P2P networks using the P-Grid [1] distributed hash-table. More recently, Hexastore [21] suggests to index RDF data using six possible indices, one for each permutation of the set of columns in the triple table, leading to shorter response times but also a worst-case five-fold increase in index space. Similarly, RDF-3X [17] creates various indices from a giant triple-table, including indices based on the six possible permutations of the
780
M. Wylot et al.
triple columns, and aggregate indices storing only two out of the three columns. All indices are heavily compressed using dictionary encoding and byte-wise compression mechanisms. The query executor of RDF-3X implements a dedicated cost-model to optimize join orderings and determine the cheapest query plan automatically. Property-Table Storage: various approaches propose to speed-up RDF query processing by considering structures clustering RDF data based on their properties. Wilkinson et al. [22] propose the use of two types of property tables: one containing clusters of values for properties that are often co-accessed together, and one exploiting the type property of subjects to cluster similar sets of subjects together in the same table. Chong et al. [6] also suggest the use of property tables as materialized views, complementing a primary storage using a triple-table. Going one step further, Abadi et al. suggest a fully-decomposed storage model for RDF: the triples are in that case rewritten into n two-column tables where n is the number of unique properties in the data. In each of these tables, the first column contains the subjects that define that property and the second column contains the object values for those subjects. The authors then advocate the use of a column-store to compactly store data and efficiently resolve queries. Graph-Based Storage: a number of further approaches propose to store RDF data by taking advantage of its graph structure. Yan et al. [23] suggest to divide the RDF graph into subgraphs and to build secondary indices (e.g., Bloom filters) to quickly detect whether some information can be found inside an RDF subgraph or not. BitMat [4] is an RDF data processing system storing the RDF graph as a compressed bit matrix structure in main-memory. gStore [24] is a recent system storing RDF data as a large, labeled, and directed multi-edge graph; SPARQL queries are then executed by being transformed into subgraph matching queries, that are efficiently matched to the graph using a novel indexing mechanism. Several of the academic approaches listed above have also been fully implemented, open-sourced, and used in a number of projects (e.g., GridVine1 , Jena2 , and RDF-3X3 ). A number of more industry-oriented efforts have also been proposed to store and manage RDF data. Virtuoso4 is an object-relational database system offering bitmap indices to optimize the storage and processing of RDF data. Sesame5 [5] is an extensible architecture supporting various back-ends (such as PostgreSQL) to store RDF data using an object-relational schema. Garlik’s 4Store6 is a parallel RDF database distributing triples using a round-robin approach. It stores triple in triple-tables (or quadruple-tables more precisely). BigOWLIM7 is a scalable RDF database taking advantage of ordered indices and data statistics to optimize 1 2 3 4 5 6 7
http://lsirwww.epfl.ch/GridVine/ http://jena.sourceforge.net/ http://www.mpi-inf.mpg.de/neumann/rdf3x/ http://virtuoso.openlinksw.com/ http://www.openrdf.org/ http://4store.org/ http://www.ontotext.com/owlim/
dipLODocus[RDF]
781
queries. AllegroGraph8, finally, is a native RDF database engine based on a quadruple storage.
3
System Rationale
Our own storage system in dipLODocus[RDF] can be seen as a hybrid structure extending several of the ideas from above. Our system is built on three main structures: RDF molecule clusters (which can be seen as hybrid structures borrowing both from property tables and RDF subgraphs), template lists (storing literals in compact lists as in a column-oriented database system) and an efficient hash-table indexing URIs and literals based on the clusters they belong to. Figure 1 gives a simple example of a few molecule clusters—storing information about students—and of a template list—compactly storing lists of student IDs. Molecules can be seen as horizontal structures storing information about a given object instance in the database (like rows in relational systems). Template lists, on the other hand, store vertical lists of values corresponding to one type of object (like columns in a relational system). Hence, we say that dipLODocus[RDF] is a hybrid system, following the terminology used for approaches such as Fractured Mirrors [19] or our own recent Hyrise system [12]. Template List T01
Student Is_a StudentID 23689
FirstName Student Paul Is_a StudentID 37347
Takes
Student1787
Course1
LastName Cluster03 Smith Course2 Takes Takes
Student167
Course6
Takes FirstName Student Paul Is_a StudentID 28821
LastName Smith
Course8
Course3
Cluster02
Takes Student28 Takes
FirstName John
LastName
Course8
Doe
Cluster01
Fig. 1. The two main data structures in dipLODocus[RDF] : molecule clusters, storing in this case RDF subgraphs about students, and a template list, storing a list of literal values corresponding to student IDs
8
http://www.franz.com/agraph/allegrograph/
782
M. Wylot et al.
Molecule clusters are used in two ways in our system: to logically group sets of related URIs and literals in the hash-table (thus, pre-computing joins), and to physically co-locate information relating to a given object on disk and in mainmemory to reduce disk and CPU cache latencies. Template lists are mainly used for analytics and aggregate queries, as they allow to process long lists of literals efficiently. We give more detail about both structures below as we introduce the overall architecture of our system.
4
Architecture
Figure 2 gives a simplified architecture of dipLODocus[RDF] . The Query Processor receives the query from the client, parses it, optimizes it, and creates a query plan to execute it. The hash-table uses a lexicographical tree to assign a unique numeric key to each URI, stores metadata associated to that key, and points to two further data structures: the molecule clusters, which are managed by the Cluster Manager and store RDF sub-graphs, and the template lists, managed by the Template Manager. All data structures are stored on disk and are retrieved using a page manager and buffered operations to amortize disk seeks. Those components are described in greater detail below. 4.1
Query Processor
Queries & Inserts
Results
Query Processor Query Optimizer GetLists/ GetClusters
Template Manager Cluster Manager Update Cluster
Update Template
W Workload
The query processor receives inserts, updates, deletes and queries from the clients. It offers a SPARQL [18] interface and supports the most common features
Clusters URI Buffered operations
key
Hash-Table
disks
Template Lists
Fig. 2. The architecture of dipLODocus[RDF]
dipLODocus[RDF]
783
of the SPARQL query language, including conjunctions and disjunctions of triple patterns and aggregate operations. We use the RASQAL RDF Query Library9 to parse both incoming triples serialized in XML, as well as to parse SPARQL queries. New triples are then handed to the Template and Cluster managers to be inserted into the database. As for incoming queries, after being parsed, they are rewritten as query trees in order to be executed. The query trees are passed to the Query Optimizer, which rewrites the queries to optimize their execution plans (cf. below Section 5). Finally, the queries are resolved bottom-up, by executing the leaf-operators first in the query tree. Examples of query processing are given below in Section 5. 4.2
Template Manager
One of the key innovations of dipLODocus[RDF] revolves around the use of declarative storage patterns [9] to efficiently co-locate large collections of related values on disk and in main-memory. When setting-up a new database, the database administrator may give dipLODocus[RDF] a few hints as to how to store the data on disk: the administrator can give a list of triple patterns to specify the root nodes, both for the template lists and the molecule clusters (see for instance above Figure 1, where “Student” is the root node of the molecule, and “StudentID” is the root node for the template list). Cluster roots are used to determine which clusters to create: a new cluster is created for each instance of a root node in the database. The clusters contain all triples departing from the root node when traversing the graph, until another instance of a root node is crossed (thus, one can join clusters based on the root nodes). Template roots are used to determine which literals to store in template lists. In case the administrator gives no hint about the root nodes, the system inspects the templates created by the template manager (see below) and takes all classes as molecule roots and all literals as template roots (this is for example the case for the performance evaluation we describe in Section 6). Optimizing the automated selection of root nodes based on samples of the input data and an approximate query workload is a typical automated design problem [3] and is the subject of future work. Based on the storage patterns, the template manager handles two main operations in our system: i) it maintains a schema of triple templates in main-memory and ii) it manages template lists. Whenever a new triples enters the system, it is passed to the template manager, which associates template IDs corresponding to the triple by considering the type of the subject, the predicate, and the type of the object. Each distinct list of “(subject-type, predicate, object-type)” defines a new triple template. The triple templates play the role of an instancebased RDF schema in our system. We don’t rely on the explicit RDF schema to define the templates, since a large proportions of constraints (e.g., domains, ranges) are often omitted in the schema (as it is for example the case for the data we consider in our experiments, see Section 6). In case a new template is detected (e.g., a new predicate is used), then the template manager updates its 9
http://librdf.org/rasqal/
784
M. Wylot et al.
(Student032, FirstName, "Joe")
match
TID: 2 Student Class
insert
hash("Joe") -> TID5: (cluster032)
...
Is_a TID: 1
TID: 6 StudentID Litteral StudentInstance
... TID: 3 Takes
Hash-Table FirstName ..., hash("Joe"),... Clulster032
..., hash("Joe"),... Template List 5
Litteral TID: 5
Course Instance
LastName Litteral TID: 4
Schema Template and Template IDs (TIDs)
Fig. 3. An insert using templates: an incoming triple (upper left) is matched to the current RDF template of the database (right), and inserted into the hash-table, a cluster, and a template list
in-memory triple template schema and inserts new template IDs to reflect the new pattern it discovered. Figure 3 gives an example of a template. In case of very inhomogeneous data sets containing millions of different triple templates, wildcards can be used to regroup similar templates (e.g., “Student - likes - *”). Note that this is very rare in practice, since all the datasets we encountered so far (even those in the LOD cloud) typically consider a few thousands triple templates at most. The triple is then passed to the Cluster Manager, which inserts it in one or several molecules. If the triple’s object corresponds to a root template list, the object is also inserted into the template list corresponding to its template ID. Templates lists store literal values along with the key of their corresponding cluster root. They are stored compactly and segmented in sublists, both on disk and in main-memory. Template lists are typically sorted by considering a lexical order on their literal values—though other orders can be specified by the database administrator when he declares the template roots. In that sense, template lists are reminiscent of segments in a column-oriented database system. Finally, the triple is inserted into the hash-table as well (see Figure 3 for an example). 4.3
Cluster Manager
The Cluster Manager takes care of updating and querying the molecule clusters. When receiving a new triple from the Template Manager, the cluster manager inserts it in the corresponding cluster(s) by interrogating the hash-table (see Figure 3). In case the corresponding cluster does not exist yet, the Cluster Manager creates a new molecule cluster, inserts the triple in the molecule, and inserts the cluster in the list of clusters it maintains. Similarly to the template lists, the molecule clusters are serialized in a very compact form, both on disk and in main-memory. Each cluster is composed of two parts: a list of offsets, containing for each template ID in the molecule the offset
dipLODocus[RDF]
785
at which the keys corresponding for the template ID are stored, and the list of keys themselves. Thus, the size of a molecule, both on-disk and in main-memory, is #T EM P LAT ES + (KEY SIZE ∗ #T RIP LES), where KEY SIZE is the size of a key (in bytes), #T EM P LAT ES is the number of templates IDs in the molecule, and #T RIP LES is the number of triples in the molecule (we note that this storage structure is much more compact than a standard list of triples). To retrieve a given information in a molecule, the system first determines the position of the template ID corresponding to the information sought in the molecule (e.g., “FirstName” is the sixth template ID for the “Student” molecule above in Figure 3). It then jumps to the offset corresponding to that position (e.g., 6th offset in our example), reads that offset and the offset of the following template ID, and finally retrieves all the keys/values between those two offsets to get all the values corresponding to that template ID in the molecule. 4.4
Hash-Table
The hash-table is the central index in dipLODocus[RDF] ; the hash-table uses a lexicographical tree to parse each incoming URI or literal and assign it a unique numeric key value. The hash-table then stores, for every key and every template ID, an ordered list of all the clusters IDs containing the key (e.g., “key 10011, corresponding to a Course object [template ID 17], appears in clusters 1011, 1100 and 1101”; see also Figure 3 for another example). This may sound like a pretty peculiar way of indexing values, but we show below that this actually allows us to execute many queries very efficiently simply by reading or intersecting such lists in the hash-table directly.
5
Common Operations
Given the main components and data structures described above, we describe below how common operation such as inserts, updates, and triple pattern queries are handled by our system. 5.1
Bulk Inserts
Inserts are relatively complex and costly in dipLODocus[RDF] , but can be executed in a fairly efficient manner when considered in bulk; this is a tradeoff we are willing to make in order to speed-up complex queries using our various data structures (see below), especially in a Semantic Web or LOD context where isolated inserts or updates are from our experience rather infrequent. Bulk insert is a n-pass algorithm (where n is the deepest level of a molecule) in dipLODocus[RDF] , since we need to construct the RDF molecules in the clusters (i.e., we need to materialize triple joins to form the clusters). In a first pass, we identify all root nodes and their corresponding template IDs, and create all clusters. The subsequent passes are used to join triples to the root nodes (hence, the student clusters depicted above in Figure 1 are built in two phases, one for the Student root node, and one for the triples directly connected to the Student).
786
M. Wylot et al.
During this operation, we also update the template lists and the hash-table incrementally. Bulk inserts have been highly optimized in dipLODocus[RDF] , and use an efficient page-manager to execute inserts for large datasets that cannot be kept in main-memory. 5.2
Updates
As for other hybrid or analytic systems, updates can be relatively expensive in dipLODocus[RDF] . We distinguish between two kinds of updates: in-place and complex updates. In-place updates are punctual updates on literal values; they can be processed directly in our system by updating the hash-table, the corresponding cluster, and the template lists if necessary. Complex updates are updates modifying object properties in the molecules. They are more complex to handle than in-place updates, since they might require a rewrite of a list of clusters in the hash-table, and a rewrite of a list of keys in the molecule clusters. To allow for efficient operations, complex updates are treated like updates in a column-store (see [20]): the corresponding structures are flagged in the hashtable, and new structures are maintained in write-optimized structures in mainmemory. Periodically, the write-optimized structures are merged with the hashtable and the clusters on disk. 5.3
Queries
Query processing in dipLODocus[RDF] is very different from previous approaches to execute queries on RDF data, because of the three peculiar data structures in our system: a hash-table associating URIs and literals to template IDs and cluster lists, clusters storing RDF molecule clusters in a very compact fashion, and template lists storing compact lists of literals. We describe below how a few common queries are handled in dipLODocus[RDF] . Triple Patterns: Triple patterns are relatively simple in dipLODocus[RDF] : they are usually resolved by looking for a bound-variable (URI) in the hashtable, retrieving the corresponding cluster numbers, and finally retrieving values from the clusters when necessary. Conjunctions and disjunctions of triples patterns can be resolved very efficiently in our system. Since the RDF nodes are logically grouped by clusters in the hash-table, it is typically sufficient to read the corresponding list of clusters in the hash table (e.g., for “return all students following Course0”), or to intersect or take the union of several lists of clusters in the hash table (e.g., for “return all students following Course0 whose ´ ´’’) to answer the queries. In most cases, no join operation last names are Doe is needed since joins are implicitly materialized in the hash-table and in the clusters. When more complex join occurs, dipLODocus[RDF] resolves them using standard hash-join operators. Molecule Queries: Molecule queries or queries retrieving many values/ instances around a given instance (for example for visualization purposes) are
dipLODocus[RDF]
787
also extremely efficient in our system. In most cases, the hash-table is invoked to find the corresponding cluster, which contains then all the corresponding values. For bigger scopes (such as the ones we consider in our experimental evaluation below), our system can efficiently join clusters based on the various root nodes they contain. Aggregates and Analytics: Finally, aggregate and analytic queries can also be very efficiently resolved by our system. Many analytic queries can be solved by first intersecting lists of clusters in the hash-table, and then looking up values in the remaining molecule clusters. Large analytic or aggregate queries on literals (such as our third analytic query below, returning the names of all graduate students) can be extremely efficiently resolved by taking advantage of template lists (containing compact and sorted lists of literal values for a given template ID), or by filtering template lists based on lists of cluster IDs retrieved from the hash-table.
6
Performance Evaluation
To evaluate the performance of our system, we compared it to various RDF database systems. The details of the hardware platform, the data sets and the workloads we used are give below. 6.1
Hardware Platform
All experiments were run on a HP ProLiant DL360 G7 server with two QuadCore Intel Xeon Processor E5640, 6GB of DDR3 RAM and running Linux Ubuntu 10.10 (Maverick Meerkat). All data were stored on recent 1.4 TB Serial ATA disk. 6.2
Data Sets
The benchmark we used is one of the oldest and most popular benchmarks for Semantic Web data called Lehigh University Benchmark (LUBM) [14]. It provides an ontology describing universities together with a data generator and fourteen queries. We used two data sets, the first one consisting of ten LUBM universities (1’272’814 distinct triples, 315’003 distinct strings), and the second regrouping one hundred universities (13’876’209 distinct triples, 3’301’868 distinct strings). 6.3
Workload
We compared the runtime execution for LUBM queries and for three analytic queries inspired by an RDF analytic benchmark we recently proposed (the BowlognaBench benchmark [11]). LUBM queries are criticized by some for their reasoning coverage; this was not an issue in our case, since we focused on RDF DB query processing rather than on reasoning capabilities. We keep an in-memory representation of subsumption trees in dipLODocus[RDF] and rewrite queries automatically to support subclass inference for the LUBM queries. We manually
788
M. Wylot et al.
rewrote inference queries for the systems that do not support such functionalities (e.g., RDF-3X). The three additional analytic/aggregate queries that we considered are as follows: 1) a query returning the professor who supervises the most Ph.D. students 2) a query returning a big molecule containing all triples within a scope of 2 of Student0 and 3) a query returning all graduate students. 6.4
Methodology
As for other benchmarks (e.g., tpc-x 10 ) we include a warm-up phase before measuring the execution time of the queries. We first run all the queries in sequence once to warm-up the systems, and then repeat the process ten times (i.e., we run in total 11 batches containing all the queries in sequence for each system). We report the mean values for each query and each system below as well as a 95% confidence interval on run times. We assumed that the maximum time for each query shouldn’t exceed 2 hours (we stopped the tests if one query took more than two hours to be executed). We compared the output of all queries running on all systems to ensure that all results were correct. We tried to do a reasonable optimization job for each system, by following the recommendations given in the installation guides for each system. We did not try to optimize the systems any further, however. We performed no fine-tuning or optimization for dipLODocus[RDF] . We avoided the artifact of connecting to the server, initializing the DB from files and printing results for all systems; we measured instead the query execution times only. 6.5
Systems
We compared our prototype implementation of dipLODocus[RDF] to five other well-known database systems: Postgres, AllegroGraph, BigOWLIM, Jena, Virtuoso, and RDF 3X. We chose those systems to have different comparison points using well-known systems, and because they were all freely available on the Web. We give a few details about each system below. Postgres: We used Postgres 8.4 with Redland RDF Library 1.0.13; Postgres is a well-known relational database, but as the numbers below show, it is not optimized for RDF storage. We couldn’t run our 100-universities on it because its load time took more than one week. It also had huge difficulties to cope with some of the queries for the 10-universities data set. Since the time of query execution was particularly long for this system, we ran each query five times only and simply report the best run below. AllegroGraph: We used AllegroGraph RDFStore 4.2.1 AllegroGraph unfortunately poses some limits on the number of triples that can be stored for the free edition, such that we couldnt load the big data set. It also showed difficulty to deal with one query. For AllegroGraph, we prepared a SPARQL Python script using libraries supported by the vendor. 10
http://www.tpc.org/
dipLODocus[RDF]
789
' .)* ) )' . &- & ) & &&* * &
) *-%,, % )*+ + ', , '' . '( , &
* -.&*+ + &*(.+ + *) * (. & '* ) &
+ -) + &( * ( ( (, , & * &
(
&% ''&'%. & &,% * )+ & )&& , +) * &
&& .)' * &,( ' '* % (. + '& ) &
&' *%+) * +'+ ) &'% , &'( & -' & &
&( *.- - &'(, ) '& % -% ( '' ) &
.% , &) , ' - )& + & - &
'& % -% ( '' ) &
&'% , &'( & -' & &
'* % (. + '& ) &
)+ &
+) *
&,% *
. ..... % ..... % ..... % &, % ' , &
&
&, % ' , &
-( % -( +
- ,,-% . '++ ( -( % -( + & ) &
&,( '
)&& ,
, )*(( ' (-) & )+ % & ( & , &
& ) &
)+ % & ( & , &
-) + &( * ( ( (, , & * &
*) * (. & '* ) &
', , '' . '( , &
&-% , &') ) *' + *' ) (, .
( &-% , &') ) *' + *' ) (, . &
'++ (
(-) &
)*+ +
& ,*' ( (,%( ) -' - - . +( & &
&
% % (
! "
&
'%% %
)' . &- & ) & &&* *
)%% %
*.- -
+'+ )
.)' *
,*' (
+%% %
-' - - . +( & &
-%% %
.)* )
&%%% %
&) .% , &) , ' - )& + & - &
! "
Fig. 4. Runtime ratios for the 10 universities data set
*$*+
#"""" (+*%
'++&
("""
# %"##( )'# $'% %'% #
$ &() #)* %)+ #$#+ #
% +$& %#* ##+% $#% #
& %*%+# $(& %&+ $$+ #
' #(%#*#( '*# #"'* $$' #
%
* $%&%# *$*+ '++& %) #
+ +++++" +++++" ''( ##( #
#" #(() )#& '"*#& )$* #
#% #+($ $% '#* $" #
&) &$
$% #
+%*
#+($
#$ (+*% #$*' &+$#) '+& #
$% '#* $" #
'+&
#$*'
## #+() $'" %"'%) %+# #
#
%+# #
$'"
#+() )$*
)#&
#
''( ##( #
#(()
) #"#$' ##+# #$ #+) #
%) #
##+#
( #$' '& #"++ $) #
#$ #+) #
#"++ #$' '&
$) #
'*# #"'* $$' #
$(& %&+ $$+ #
+$& %#* ##+%
$#% #
"" %
#
$"""
&() #)* %)+ #$#+
&"""
)'# $'% %'% #
*"""
#& &) &$ +%* $% #
Fig. 5. Runtime ratios for the 100 universities data set
BigOWLIM: We used BigOWLIM 3.5.3436. OWLIM provides us with a java application to run the LUBM benchmark, so we used it directly for our tests. Jena: We used Jena-2.6.4 and the TDB-0.8.10 storage component. We created the database by using the “tdbloader” provided by Jena. We created a Java application to run and measure the execution time of each query. Virtuoso: We used Virtuoso Open-Source Edition 6.1.3. Virtuoso supports ODBC connections, and we prepared a Python script using the PyODBC library for our queries. RDF-3X: We used RDF-3X 0.3.5. For this system, we converted our dataset to NTriples/Turtle. We also hacked the system to measure the execution time of the queries only, without taking into account the initialization of the database and turning off the print-outs.
6.6
Results
Relative execution times for all queries and all systems are given below, in Figure 4 for 10 universities and in Figure 5 for 100 universities. Results are
M. Wylot et al. $!%
%!"#
#
#!
%%%%%%
!
" %
!!
$#
%
%
$!%
!
%!$%#
#$$
#
!
%
$#
"
!
#
!!
"
"
"
$
$
" %
#!
790
#
Fig. 6. Runtime ratios for 10 (left) and 100 (right) universities for the analytic/aggregate queries
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 a1 a2 a3
AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI
Load Time size
10 UNI -- Query Execution Time [s] dipLODocus AllegroGrap BigOwlim virtuoso rdf3x Jena 1.45E-05 1.09E-02 5.37E-02 1.29E-04 9.14E-04 1.20E-03 6.47E-08 4.81E-05 6.27E-05 1.00E-06 2.23E-07 7.93E-06 1.21E-02 1.14E+01 5.19E-01 4.96E-02 1.40E+00 2.19E-01 5.63E-05 2.96E-03 9.04E-04 4.72E-05 1.85E-05 1.42E-04 2.09E-05 3.78E-03 2.60E-03 1.10E-03 7.91E-04 1.10E-03 9.57E-08 8.77E-06 1.32E-05 4.00E-06 1.11E-06 5.95E-06 7.95E-05 4.62E+00 3.63E-02 1.82E-03 1.89E-03 2.20E-03 3.17E-07 8.49E-04 1.18E-04 1.04E-06 2.22E-07 7.93E-06 5.32E-05 4.74E+00 8.19E-01 2.08E-03 1.35E-03 2.90E-03 2.56E-07 1.11E-03 2.89E-04 1.71E-05 3.89E-08 1.07E-05 1.65E-02 1.40E+00 2.23E-01 6.22E-01 2.51E-02 5.52E-02 8.65E-05 3.93E-04 8.00E-04 1.58E-03 8.81E-06 2.70E-04 1.22E-03 7.03E+01 5.96E+00 1.55E-03 4.82E-03 7.14E-01 3.21E-07 1.12E-02 2.18E-03 2.10E-06 5.04E-05 2.25E-04 6.54E-03 5.09E+01 1.74E+00 5.47E-01 8.94E-03 5.43E-01 2.21E-05 8.28E-03 7.10E-04 1.09E-03 1.57E-06 1.47E-03 6.74E-02 longer than longer than 1.14E+00 1.83E-01 longer than 1.98E-06 two hours two hours 4.38E-03 9.06E-05 two hours 2.17E-05 4.80E+00 3.70E-03 8.93E-03 1.40E-03 1.00E-03 7.32E-08 1.15E-03 9.09E-06 7.49E-06 1.80E-06 0.00E+00 6.41E-05 6.04E-02 1.11E-02 2.54E-03 1.37E-03 1.60E-03 2.32E-07 5.08E-04 5.95E-06 1.76E-05 3.25E-07 1.32E-05 1.74E-05 8.81E-02 1.09E-02 2.14E-03 1.43E-03 2.10E-03 5.19E-08 8.05E-05 5.95E-06 1.93E-06 2.70E-07 5.95E-06 4.76E-05 2.85E-02 5.89E-02 3.82E-03 1.06E-03 1.00E-03 1.18E-07 8.19E-05 5.71E-05 8.27E-07 7.82E-08 0.00E+00 1.29E-02 1.17E+00 1.90E-01 5.37E-01 2.28E-02 3.62E-02 6.00E-05 3.79E-04 2.17E-04 1.91E-03 9.93E-06 1.82E-04 1.16E-03 not run not run 7.50E-01 9.93E-01 1.07E+03 2.24E-06 4.90E-04 3.90E-03 3.42E-02 5.07E-05 not run not run 7.85E-04 1.28E-02 1.10E-03 1.72E-07 4.62E-06 1.25E-05 5.95E-06 1.07E-02 not run not run 4.13E-01 1.10E-02 8.01E-02 1.57E-07 2.30E-03 2.49E-06 7.00E-04
dipLODocus AllegroGrap BigOwlim 31s 13s 50s 87MB 696MB 209MB
virtuoso 88s 140MB
rdf3x 16s 66MB
Jena 98s 118MB
q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 a1 a2 a3
AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI AVG CI
100 UNI -- Query Execution Time [s] dipLODocus BigOwlim virtuoso rdf3x 1.73E-05 5.21E-02 4.38E-04 6.10E-04 5.17E-08 5.35E-05 5.88E-07 3.53E-07 1.27E-01 5.94E+00 4.83E+00 1.55E+01 6.41E-05 3.79E-03 8.82E-04 1.52E-04 3.14E-05 2.90E-03 3.75E-03 6.68E-04 2.54E-08 1.07E-05 6.01E-07 1.56E-07 8.33E-05 3.20E-01 2.91E-03 1.90E-03 7.15E-08 5.44E-04 3.32E-06 1.29E-07 5.34E-05 8.71E+00 5.65E-03 1.20E-03 6.16E-08 2.28E-03 4.37E-06 1.61E-07 1.42E-01 1.77E+00 1.56E+01 3.77E-01 1.64E-05 1.15E-03 1.25E-02 5.24E-05 2.63E-03 6.34E+01 3.11E-03 5.18E-02 3.39E-05 1.55E-02 2.09E-06 9.53E-04 6.34E-03 1.49E+01 3.80E+00 2.36E-02 1.71E-06 3.93E-03 7.60E-04 6.69E-06 2.61E-01 longer than 1.45E+01 3.01E+00 2.29E-05 two hours 1.92E-03 1.07E-03 1.68E-05 2.80E-03 8.54E-02 1.22E-03 1.19E-08 7.93E-06 3.59E-06 1.49E-07 6.00E-05 1.18E-02 1.83E-01 2.35E-03 1.54E-08 7.93E-06 4.72E-05 7.76E-07 1.79E-05 1.25E-02 8.81E-02 1.06E-03 5.95E-09 3.68E-05 8.23E-05 1.06E-06 5.62E-04 1.10E-01 2.91E-02 1.11E-03 1.99E-08 2.39E-04 5.56E-05 1.01E-07 1.41E-01 6.68E-01 1.33E+01 3.27E-01 2.18E-06 1.21E-03 6.99E-03 1.30E-04 1.04E-02 not run 1.45E+01 9.93E+01 1.22E-07 6.52E-04 1.05E-03 6.50E-05 not run 3.19E-03 1.16E-01 1.54E-08 2.35E-06 1.21E-04 1.55E-01 not run 6.72E+00 1.49E-01 2.65E-07 1.94E-03 3.61E-05
dipLODocus BigOwlim Load Time 427s 748s size on disk 913MB 2012MB
virtuoso 914s 772MB
rdf3x 214s 694MB
Jena 1.30E-03 9.09E-06 2.27E+00 4.09E-04 1.00E-03 0.00E+00 2.20E-03 7.93E-06 3.10E-03 2.07E-05 7.61E-01 7.28E-03 7.46E+00 1.54E-03 5.26E+00 5.19E-03 longer than two hours 1.20E-03 7.93E-06 1.50E-03 9.91E-06 2.30E-03 9.09E-06 1.30E-03 9.09E-06 5.98E-01 6.59E-03 longer than two hours 1.40E-03 9.71E-06 7.25E-01 2.88E-03
Jena 1146s 1245MB
Fig. 7. Absolute query execution and load times [s], plus size of the databases on disk for both data sets
given as runtime ratios, with dipLODocus[RDF] taken as a basis for ratio 1.0 (i.e., a bar indicating 752.3 means that the execution time of that query on that system was 752.3 times slower than the dipLODocus[RDF] execution). Figure 6 gives relative execution times for analytics executed on a selection of the fastest systems. Absolute times with confidence intervals at 95%, database sizes on disk and load times are given in Figures 7 for both datasets. We observe that dipLODocus[RDF] is generally speaking very fast, both for bulk inserts, for LUBM queries and especially for analytic queries. dipLODocus[RDF] is not the fastest system for inserts, and produces slightly larger databases on disk than some other systems (like RDF-3x), but performs overall very-well
dipLODocus[RDF]
791
for all queries. Our system is on average 30 times faster than the fastest RDF data management system we have considered (i.e., RDF-3X) for LUBM queries, and on average 350 times faster than the fastest system (Virtuoso) on analytic queries. Is is also very scalable (both bulk insert and query processing scale gracefully from 10 to 100 universities).
7
Conclusions
In this paper, we have described dipLODocus[RDF] , a new RDF management system based on a hybrid storage model and RDF templates to execute various kinds of queries very efficiently. In our performance evaluation, dipLODocus[RDF] is on average 30 times faster than the fastest RDF data management system we have considered (i.e., RDF-3X) on LUBM queries, and on average 350 times faster than the fastest system we have considered on analytic queries. More importantly, dipLODocus[RDF] is the only system to consistently show low processing times for all the queries we have considered (i.e., our system is the only system being able to answer any of the queries we considered in less than one second), thus making it an extremely versatile RDF management system capable of efficiently supporting both short and long-tail queries in real deployments. This impressive performance can be explained by several salient features of our system, including: its extremely compact structures based on molecule templates to store data, its redundant structures to optimize different types of operations, its very efficient ways of coping with disk and memory reads (avoiding seeks and memory jumps as much as possible since they are extremely expensive on modern machines), and its way of materializing various joins in all its data structures. This performance is counterbalanced by relatively complex and expensive updates and inserts, which can however be optimized if considered in bulk. In the near future, we plan to work on cleaning, proof-testing, and extending our code base to deliver an open-source release of our system as soon as possible11 . We also have longer-term research plans for dipLODocus[RDF] ; our next research efforts will revolve around parallelizing many of the operations in the system, to take advantage of multi-core architectures on one hand, and large cluster of commodity machines on the other hand. Also, we plan to work on the automated database design problem in order to automatically suggest sets of optimal root nodes to the database administrator given some sample input data and an approximate query workload. Acknowledgment. This work is supported by the Swiss National Science Foundation under grant number PP00P2 128459.
References 1. Aberer, K., Cudr´e-Mauroux, P., Datta, A., Despotovic, Z., Hauswirth, M., Punceva, M., Schmidt, R.: P-grid: A self-organizing structured p2p system. ACM SIGMOD Record 32(3) (2003) 11
visit http://diuf.unifr.ch/xi/diplodocus for updates.
792
M. Wylot et al.
2. Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: Building Internet-Scale Semantic Overlay Networks. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 107–121. Springer, Heidelberg (2004) 3. Agrawal, S., Chaudhuri, S., Narasayya, V.: Automated selection of materialized views and indexes in SQL databases. In: International Conference on Very Large Data Bases, VLDB (2000) 4. Atre, M., Chaoji, V., Weaver, J., Williamss, G.: Bitmat: An in-core rdf graph store for join query processing. In: Rensselaer Polytechnic Institute Technical Report (2009) 5. Broekstra, J., Kampman, A., Harmelen, F.V.: Sesame: An architecture for storing and querying rdf data and schema information. In: Semantics for the WWW. MIT Press (2001) 6. Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient sql-based rdf querying scheme. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005, pp. 1216–1227. VLDB Endowment (2005) 7. Cudr´e-Mauroux, P., Agarwal, S., Aberer, K.: Gridvine: An infrastructure for peer information management. IEEE Internet Computing 11(5) (2007) 8. Cudr´e-Mauroux, P., Lim, K., Simakov, R., Soroush, E., Velikhov, P., Wang, D.L., Balazinska, M., Becla, J., DeWitt, D., Heath, B., Maier, D., Madden, S., Patel, J.M., Stonebraker, M., Zdonik, S.: A Demonstration of SciDB: A Science-Oriented DBMS. Proceedings of the VLDB Endowment (PVLDB) 2(2), 1534–1537 (2009) 9. Cudr´e-Mauroux, P., Wu, E., Madden, S.: The Case for RodentStore, an Adaptive, Declarative Storage System. In: Biennial Conference on Innovative Data Systems Research, CIDR (2009) 10. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) 11. Demartini, G., Enchev, I., Gapany, J., Cudr´e-Maurox, P.: BowlognaBench— Benchmarking RDF Analytics. In: SIMPDA 2011: First International Symposium on Process Data (2011) 12. Grund, M., Kr¨ uger, J., Plattner, H., Zeier, A., Cudr´e-Mauroux, P., Madden, S.: Hyrise - a main memory hybrid storage engine. PVLDB 4(2), 105–116 (2010) 13. Guo, Y., Pan, Z., Heflin, J.: An Evaluation of Knowledge Base Systems for Large OWL Datasets. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 274–288. Springer, Heidelberg (2004) 14. Guo, Y., Pan, Z., Heflin, J.: Lubm: A benchmark for owl knowledge base systems. Web Semant. 3, 158–182 (2005) 15. Haslhofer, B., Roochi, E.M., Schandl, B., Zander, S.: Europeana RDF Store Report. University of Vienna, Technical Report (2011), http://eprints.cs.univie.ac.at/2833/1/europeana_ts_report.pdf 16. Liu, B., Hu, B.: An evaluation of rdf storage systems for large data applications. In: First International Conference on Semantics, Knowledge and Grid, SKG 2005, p. 59 (November 2005) 17. Neumann, T., Weikum, G.: RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment (PVLDB) 1(1), 647–659 (2008) 18. Prud’hommeaux, E., Seaborne van Harmelen, A. (eds.): SPARQL Query Language for RDF. W3C Candidate Recommendation (April 2006), http://www.w3.org/TR/rdf-sparql-query/ 19. Ramamurthy, R., DeWitt, D.J., Su, Q.: A case for fractured mirrors. In: CAiSE 2002 and VLDB 2002. VLDB Endowment, pp. 430–441 (2002)
dipLODocus[RDF]
793
20. Stonebraker, M., Abadi, D.J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S.R., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., Zdonik, S.: C-Store: A Column Oriented DBMS. In: International Conference on Very Large Data Bases, VLDB (2005) 21. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proceeding of the VLDB Endowment (PVLDB) 1(1), 1008–1019 (2008) 22. Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient rdf storage and retrieval in jena2. In: SWDB 2003, pp. 131–150 (2003) 23. Yan, Y., Wang, C., Zhou, A., Qian, W., Ma, L., Pan, Y.: Efficient indices using graph partitioning in rdf triple stores. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 1263–1266. IEEE Computer Society, Washington, DC, USA (2009) 24. Zou, L., Mo, J., Chen, L., Oezsu, M.T., Zhao, D.: gstore: Answering sparql queries via subgraph matching. PVLDB 4(8) (2011)
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Yang Yu and Jeff Heflin Department of Computer Science and Engineering, Lehigh University 19 Memorial Drive West, Bethlehem, PA 18015 {yay208,heflin}@cse.lehigh.edu
Abstract. Data quality issues arise in the Semantic Web because data is created by diverse people and/or automated tools. In particular, erroneous triples may occur due to factual errors in the original data source, the acquisition tools employed, misuse of ontologies, or errors in ontology alignment. We propose that the degree to which a triple deviates from similar triples can be an important heuristic for identifying errors. Inspired by functional dependency, which has shown promise in database data quality research, we introduce value-clustered graph functional dependency to detect abnormal data in RDF graphs. To better deal with Semantic Web data, this extends the concept of functional dependency on several aspects. First, there is the issue of scale, since we must consider the whole data schema instead of being restricted to one database relation. Second, it deals with multi-valued properties without explicit value correlations as specified as tuples in databases. Third, it uses clustering to consider classes of values. Focusing on these characteristics, we propose a number of heuristics and algorithms to efficiently discover the extended dependencies and use them to detect abnormal data. Experiments have shown that the system is efficient on multiple data sets and also detects many quality problems in real world data. Keywords: value-clustered graph functional dependency, abnormal data in RDF graphs.
1
Introduction
Data quality (DQ) research has been intensively applied to traditional forms of data, e.g. databases and web pages. The data are deemed of high quality if they correctly represent the real-world construct to which they refer. In the last decade, data dependencies, e.g. functional dependency (FD) [1] and conditional functional dependency (CFD) [2, 3], have been used in promising DQ research efforts on databases. Data quality is also critically important for Semantic Web data. A large amount of heterogeneous data is converted into RDF/OWL format by a variety of tools and then made available as Linked Data1 . During the creation or conversion of this data, numerous data quality problems can arise. 1
http://linkeddata.org
L. Aroyo et al. (Eds.): ISWC 2011, Part I, LNCS 7031, pp. 794–809, 2011. c Springer-Verlag Berlin Heidelberg 2011
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
795
Some works [4–6] began to focus on the quality of Semantic Web data, but such research is still in its very early stages. No previous work has utilized the fact that RDF data can be viewed as a graph database, therefore we can benefit from traditional database approaches, but we must make special considerations for RDF’s unique features. Since the Semantic Web represents many points of view, there is no objective measure of correctness for all Semantic Web data. Therefore, we focus on the detection of abnormal triples, i.e., triples that violate certain data dependencies. This in turn is used as a heuristic of a potential data quality problem. We recognize that not all abnormal data is incorrect (in fact, in some scenarios the abnormal data may be the most interesting data) and thus leave it up to the application to determine how to use the heuristic. A typical data dependency in databases is functional dependency [7]. Given a relation R, a set of attributes X in R is said to functionally determine another attribute Y , also in R, (written X → Y ), if and only if each X value is associated with precisely one Y value. An example FD zipCode → state means, for any tuple, the value of zipCode determines the value of state. RDF data also has various dependencies. But RDF data has a very different organization and FD cannot be directly applied because RDF data is not organized into relations with a fixed set of attributes. We propose value-clustered graph functional dependency (VGFD) based on the following thoughts. First, FD is formally defined over one entire relation. However RDF data can be seen as extremely decomposed tables where each table is a set of triples for a single property. Thus we must look for dependencies that cross these extremely decomposed tables and extend the concept of dependency from a single database relation to a whole data set. Second, the correlation between values is trivially determined in a database of relational tuples. But in RDF data, it is non-trivial to determine the correlation, especially for multi-valued properties. For example, in DBPedia, the properties city and province do not have cardinality restrictions, and thus instances can have multiple values for each property. This makes sense, considering that some organizations can have multiple places. Yet finding the correlation between the different values of city and province becomes non-trivial. Third, traditionally value equality is used to determine FD. However, this is not appropriate for real world, distributed data. Consider (1) for floating point numbers, rounding and measurement errors must be considered. (2) Sometimes dependencies are probabilistic in nature, and one-to-one value correspondences are inappropriate. For example, the days needed for processing an order before shipping for a certain product is usually limited to a small range but not an exact value. (3) Sometimes certain values can be grouped to form a more abstract value. In sum, our work makes the following contributions. – – – –
we we we we
automatically find optimal clusters of values efficiently discover VGFDs over clustered values use the clusters and VGFDs to detect outliers and abnormal data conducts experiments on three data sets that validate the system
796
Y. Yu and J. Heflin
The rest of the paper is as follows. Section 2 discusses related work. Section 3 describes how to efficiently discover VGFDs while Section 4 discusses categorizing property values for their use. Sections 5 and 6 give the experimental results and the conclusion.
2
Related Work
Functional dependencies are by far the most common integrity constraints for databases in the real world. They are very important when designing or analyzing relational databases. Most approaches to find FD [8–10] are mainly based on the concept of an agree set [11]. Given a pair of tuples, the agree set is all the attributes for which these tuples have the same values. Since the search for FDs occurs over a given relation and each tuple has at most one value for each attribute, then each tuple can be placed into exactly one cluster where all tuples in the cluster have the same agree set with all other tuples. Agree sets are not very useful when applied to the extensions of RDF properties, which are equivalent to relations with only two attributes (i.e. the subject and object of the triple). Furthermore, many properties in RDF data are multi-valued and so the correlation between values of different properties becomes more complex. Finally, since most RDF properties are designed just for a subset of instances in the data set, an agree set-based approach will partition many instances based on null values is common. RDF graphs are more like graph database models. The value functional dependency (VFD) [12] defined for the object-oriented data model can have multi-valued properties on the right-hand side, e.g. title → authors. However the dependencies we envision can have multi-valued properties on both sides and our system can determine the correlation between each value in both sets. The path functional dependency (PFD) [13] defined for semantic data models considered multiple attributes on a path, however the PFD did not consider multi-valued attributes. FDXML is the FD’s counterpart in XML [14] where its left-hand side is a path starting from the XML document root which essentially is another form of a record in a database. Hartmann et al. [15] generalized the definitions of several previous FDs in XML from a graph matching perspective. As mentioned previously, the basic equality comparison of values used in FD is limited in many situations. Algebraic constraints [16, 17] in database relations are about the algebraic relation between two columns in a database and are often used for query optimization. The algebraic relation can be +, −, ×, /. However these works are limited to numerical attribute values and the mapping function can only be defined using several algebraic operators. The reason is that numerical columns are more often indexed and queried over as selective conditions in databases than strings. In contrast, we try to find a general mapping function between the values of different properties, both numbers and strings. Additionally, for the purpose of query optimization, they focus on pairs of columns with top ranked relational significance, the major parts in each of these pairs and the data related to dependencies that is often queried over, rather than all possible pairs of properties and all pairs of values existing in the data set.
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
797
Data dependencies have recently shown promise for data quality management in databases. Bohannon et al. [1] focuses on repairing inconsistencies based on standard FDs and inclusion dependencies, that is, to edit the instance via minimal value modification such that the updated instance satisfies the constraints. A CFD [2, 3] is more expressive than a FD because it can describe a dependency that only holds for a subset of the tuples in a relation, i.e., those that satisfy some condition. Fan et. al [2] gave a theoretical analysis and algorithms for computing implications and minimal cover of CFDs; Cong et al. [3], similar to Bohannon et al., focused on repairing inconsistent data. The CFD discovery problem has high complexity; it is known to be more complex than the implication problem, which is already coNP-complete [2]. In contrast to them, we are trying to both automatically find fuzzy constraints, i.e. those that hold for most of the data, and report on exceptional data for applications. Our work incorporates advantages from both FD and CFD, i.e. fast execution and the ability to tolerate exceptions. With respect to data quality on the Semantic Web, Sabou et al. [4] evaluate semantic relations between concepts in ontologies by counting the similar axioms (both explicit and entailed) in online ontologies and their derivation length. For instance data, previous evaluations mainly focused on two types of errors: explicit inconsistency with the syntax of the ontologies and logical inconsistency that can be checked by DL reasoners. However, many Linked Data ontologies do not fully specify the semantics of the terms defined, and OWL cannot specify axioms that only hold part of the time. Our work focuses on detecting abnormal semantic data by automatically discovering probabilistic integrity constraints (IC). Tao et al. [6] proposed an IC semantics for ontology modelers and suggested that it is useful for data validation purposes. But the precursor problem of how to discover these ICs is not addressed. Furber et al. [5] also noticed that using FD could be helpful for data quality management on the Semantic Web, but do not give an automatic algorithm to find such FDs and, more importantly, direct application of FD to RDF data may not capture the unique characteristics of RDF data.
3
Discovering VGFDs
We begin with some definitions. Definition 1. An RDF graph is G := I, L, R, E, where three sets I, L and R are instance, literal and relation identifiers and the set of directional edges is E ⊆ I × R × (I ∪ L). Let G be the set of all possible graphs and G ∈ G. Let R− = {r− |r ∈ R}. Definition 2. A Path c in graph G is a tuple I0 , r1 , I1 , ..., rn , In where Ii ∈ I, ri ∈ R ∪ R− , and ∀i, 0 i < n, if ri ∈ R then (Ii , ri+1 , Ii+1 ) ∈ E or if ri+1 ∈ R− then (Ii+1 , ri+1 , Ii ) ∈ E; ∀j, if i = j then Ii = Ij . Paths are acyclic and directional, but can include inverted relations of the form r− .
798
Y. Yu and J. Heflin
Definition 3. A Composite Property (Pcomp ) r◦ in graph G is r1 ◦r2 ...rn , where ∃I0 , ..., In and I0 , r1 , I1 , ..., rn , In is a Path in G. Let R◦ be all possible Pcomp s. Given r◦ ∈ R◦ , Inst(r◦ , G) = {I0 , r◦ , In |I0 , r1 , I1 , r2 , I2 , ..., rn , In is a Path in G}. Length(r◦ ) = n. ∀r ∈ R, r ∈ R◦ and Length(r) = 1. Definition 4. A Conjunctive Property (Pconj ) r+ in graph G is a set {r1 , r2 , ..., rn } (written r1 + r2 + ... + rn ), where ∀i, ri ∈ R◦ and ∃I s.t. ∀1 ≤ i≤ n, I , ri , Ii ∈ Inst(ri , G). Let R+ be all possible Pconj s. Size(r+ ) = ri ∈r + Length(ri ). A Composite Property is a sequence of edges on a Path. The subject and object of a Pcomp are the first and last objects on the Paths consisting of this sequence of edges. Every property is a special case of Pcomp . A Conjunctive Property groups a set of Pcomp s that have a common subject I . Note, each original r ∈ R is also r ∈ R◦ and each r◦ ∈ R◦ is also r◦ ∈ R+ . Definition 5. Given i ∈ I and r◦ ∈ R◦ , V ◦ (i, r◦ ) = {i |∃i, r◦ , i ∈ Inst(r◦ , G)}. Given r+ ∈ R+ , V + (i, r+ ) is a tuple V ◦ (i, r1 ), ..., V ◦ (i, r1 ) where ∀j, rj ∈ R+ . Given a Pcomp , value function V ◦ returns the objects connected with a subject through Pcomp , and given a Pconj , the value function V + returns the set of objects connected with a subject through Pconj . Definition 6. Given i, j ∈ I and r◦ ∈ R◦ , Dependency Equality (DE) be. tween i and j on r◦ is: V (i, r◦ ) = V (j, r◦ ) ⇐⇒ (∀x ∈ V ◦ (i, r◦ ) ⇐⇒ ◦ ◦ ∃y ∈ V (j, r ), C(x) = C(y)), where C(x) is the dependency cluster of x (discussed in Section 4). With a slight abuse of notation for DE, given r+ ∈ R+ , . . V + (i, r+ ) = V + (j, r+ ) ⇐⇒ ∀rk ∈ r+ , V ◦ (i, rk ) = V ◦ (j, rk ). Definition 7. A value-clustered graph functional dependency (VGFD) s in . graph G is X → Y , where X ∈ R+ , Y ∈ R◦ and ∀i, j ∈ I, if V + (i, X) = . ◦ + ◦ V (j, X) then V (i, Y ) = V (j, Y ). These definitions state that for all instances, if the values of the left-hand side (LHS) Pcomp of a given VGFD satisfy Dependency Equality (DE), then there is a DE on the right-hand side (RHS) Pconj . Note, due to the union rule of Armstrong’s axioms used to infer all the functional dependencies, if α → β and α → γ hold, then α → βγ holds. Therefore, it is enough to define the VGFD whose right-hand side (RHS) is each single element of a set of properties. In this work, DE includes basic equality for both object and datatype property values, transitivity of the sameAs relation for instances and clustering for datatype values. Shown in Algorithm 1, this section introduces the VGFD search (line 8-15) and the next section introduces value clustering (line 2-5) which is used to detect dependencies. To efficiently discover a minimum set of VGFDs which is a cover of the whole set of VGFDs, our approach essentially is computed level-wise. Each level Li consists of VGFDs with LHS of size i (Fig. 1 gives an example). The computation of VGFDs with smaller sets of LHS properties can be used
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
799
Algorithm 1. Search V GF Ds(G, α, β, γ), G = (I, L, R, E) is a graph; α is the confidence threshold for a VGFD; β is the sampling size; γ is the threshold for pre-clustering. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
S ← ∅, C ← ∅ for each r ∈ R s.t. r is a datatype property do groups ← P reclustering(Range(r), γ) Cr ← Optimal Kmeans(Range(r), groups) C ← C ∪ Cr i=0 Li ← ∅ repeat i=i+1 Li ← Generate Level with Static P runing(Li−1 , E) for each s ∈ Li do if Runtime P runing(s, α, β, E, C) = F ALSE then if (M ← Compute V GF D(s, α, E, C)) = ∅ then S ← S ∪ (s, M ) //M is the set of value mappings of s. until Li = ∅ or i >= DEP T H LIM IT return S
when computing children VGFDs that have a superset of those LHS properties. A similar level-wise search was proposed for the Tane algorithm [9], but each node in Tane corresponds to a subset of our nodes whose LHS is based on single properties instead of Pcomp s. In contrast, our nodes are finer grained which leads to more opportunities for pruning. Our algorithm starts with level 0. On each new level, it first generates possible VGFDs on this level based on the results of previous levels and it also eliminates many potential VGFDs from further consideration based on some easily computed heuristics (Section 3.1). Then, a runtime pruning (Section 3.3) and a detailed computation (Section 3.2) are conducted on each candidate VGFD. The whole process can terminate at a specified level, or after all levels, although the latter is usually unnecessary and unrealistic. The process returns each VGFD and its value mappings which is used for detecting violations. 3.1
Heuristics for Static Pruning
We first define the discriminability for a property as the number of distinct values divided by the sum of the property extension, and when it is compared between properties, it is over the instances they have in common. Then, static pruning heuristics used to eliminate potential VGFDs from further consideration are: 1. insufficient subject overlap between the LHS and the RHS, 2. the LHS or RHS has too high a discriminability, 3. the discriminability of the LHS is less than that of the RHS. The information for rule 1 can be acquired from an ontology (e.g. using domain and range information) or a simple relational join on data. Here insufficient
800
Y. Yu and J. Heflin
overlap means too few common subjects, e.g. 20. For rule 2, if the discriminability is close to one, e.g. 0.95 which means 95%, the property functions like a superkey in a database. Since such keys identify an individual, they are not useful for detecting abnormal data. For rule 3, if there is a mapping between two such properties, some values of the property with smaller discriminability must be mapped to different values on the RHS which would not be a functional mapping. In order to apply these heuristics, we make the additional observations: 1. The discriminability of a Pcomp (Pconj resp.) is generally no greater than (no less than resp.) that of each property involved. 2. A Pconj (Pcomp resp.) cannot be based on two properties that have few common subjects (objects and subjects resp.). 3. All children of a true VGFD on the level-wise search graph are also true VGFDs, but are not minimal. For example, given a Pcomp A ◦ B, its values all come from the values of B and its extension is a subset of the Cartesian product between objects of A and subjects of B, then its discriminability, i.e. the distinct values divided by the usages, should be no greater than that of each component. A similar explanation applies for Pconj in observation 1. An extension of the observation 2 is that a Pconj cannot be followed by other properties in a property chain, e.g. (A+B)◦C, since its values are tuples (e.g. the values of A + B) as opposed to the instances and literals that are the subjects of another property (e.g. subjects of C). Level 0 A B
A C
A D
B A
B C
B D
……
Level 1
A+C B A○C B A+D B A○D B A+B C A○B C A+D C A○D C A+B D A○B D A+C D A○C D … … Level 2 (A○B)+D C
(A○B)○D C
(A○D)+B C
(A○D)+B C
Level 3
Fig. 1. An example of level-wise discovering process. We suppose that (1) property A and B have few common subjects, (2) the discriminability of B is less than that of C and (3) D has a high discriminability.
Fig.1 is an example showing how these heuristics are useful in the level-wise searching. Each edge is from a VGFD to a VGFD with an LHS that is a superset of the parent LHS and the two VGFDs have the same RHS. Note, our current algorithm does not support the use of composite properties on the RHS. The VGFDs pruned by the above heuristics are in dotted boxes and dotted lines pointing to the children pruned. For this example, we make assumptions typical of real RDF data. For instance, in DBPedia less than 2% of all possible pairs of properties share sufficient common instances. So following our heuristics, four VGFDs on level 1 are pruned: A → B is due to heuristic rule 1, B → C is due to rule 3 and the other two are due to rule 2. Then the children of A → B and
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
801
A → D are pruned due to the same reason as their parents. A + B → C on level 2 and (A ◦ D) + B → C on level 3 are pruned due to the first assumption plus the observation 2. Finally, A + D → C on level 2 and (A ◦ B) + D → C on level 3 are pruned due to the observation 1 and heuristic rule 2. From this example, we can see simple conditions can reduce the level-wise search space greatly based on these heuristics. 3.2
Handling Multi-valued Properties
The fundamental difference between VGFD and FD when computing VGFD is that we consider multi-valued properties. When finding FDs in databases, the multi-valued attributes either are not considered (if they are not in the same relation), or the correlation of their values is given by having separate tuples for each value. RDF frequently has multi-valued properties without any explicit correlation of values, e.g. in DBPedia, more than 60% properties are multi-valued. When computing a VGFD, we try to find a functional mapping Table 1. The left table is the triple list. The right table is mapping count. deptNo deptName subject object subject object A 1 A CS A 2 A EE B 1 B EE C 2 C CS D 2 D EE
Candidate Value Mapping Count 1→ EE 2 2→ EE 2 2→ CS 2 1→ CS 1
from each LHS value to an RHS value such that this mapping maximizes the number of correlations. We consider any two values for a pair of multi-valued properties that share a subject to be a possible mapping. Then we greedily select the LHS value that has the most such mappings and remove all other possible mappings for this LHS value. If multiple RHS values are tied for the largest number of mappings, then we pick the one that appears in the fewest mappings so far. Consider Table 1 which analyzes the dependency deptNo → deptName. The triples are given to the left and each possible value mapping and its maximal possible count are listed in descending order to the right. The maximal count of 1 → EE is 2, because these two values co-occur for instances A and B once for each. We first greedily choose mapping 1 → EE, because it has the largest count among all mappings for depN o = 1. After this selection, the mapping 1 → CS is removed since deptN o = 1 has been mapped. Then for deptN o = 2, to maximize the number of distinct values being matched on both sides, we choose (2, CS) since CS has been mapped to by fewer LHS values than EE. Note the basic equality used here is a special case of cluster-based Dependency Equality. For example, if CS and EE are clustered together, then the mappings will be 1 → EECS and 2 → EECS, where EECS is the cluster.
802
Y. Yu and J. Heflin
Our confidence in a VGFD depends on how often the data agree with it, i.e., the total matches divided by the sum of the LHS’s extension, e.g. the VGFD above has the confidence of 4/5 = 0.8. In this work, we set the confidence threshold α = 0.9 to ensure that patterns are significant, while allowing for some variation due to noise, input errors, and exceptions. 3.3
Run-time Pruning +
In the worst case, the expensive full scan of value pairs must occur |R | · |R◦ | times. So we propose to use mutual information (MI) computed over sampled value pairs for estimating the degree of dependency. In Algorithm 2, given a candidate VGFD s X → Y , we start by randomly selecting a percentage β of the instances. In line 2, for each instance i, we randomly pick a pair of values from V + (i, X) and V ◦ (i, Y ). Distribution() also applies the clusters CX and CY and returns these pairs in lieu of the actual values. In information theory, aMI IXY of two random variables X and Y is formally defined as IXY = i j pi,j log (pi,j /pi pj ), where pi , pj are the marginal probability distribution functions of X and Y , and pi,j is the joint probability distribution function of X and Y respectively. Intuitively, MI measures how much knowing one of these variables reduces our uncertainty about the other. Furthermore, the entropy coefficient (EC), using MI, measures the percentage reduction in uncertainty in predicting the dependent variable based on knowledge of the independent variable. When it is zero, the independent variable is of no help in predicting the dependent variable; and when it is one, there is a full dependency. The EC is directional and EC(X|Y ) for predicting the variable X with respect to variable Y is defined as IXY /EY , where EY is the entropy of variable Y , formally j pj log 1/pj . Because IXY also can be expressed as EX + EY − EXY which has a easier form to compute.
Algorithm 2. Runtime P runing(s, α, β, E, C), s is a candidate VGFD X → Y ; α is the confidence threshold for a VGFD; β is the sampling size as a percentage; E is a set of triples. C is a set of cluster sets for each property. 1: I ← Sampling Subjects(s, β, E)// Sampled subjects shared by the LHS and RHS. //A list of value pairs where each pair 2: {(Xi , Yi )} ← Distribution(s, I, E, C) consists of two single sampled values of LHS and RHS for the same subject. |{Xi |Xi =x}| i |Xi =x}| · log 3: EX = − distinct x∈{Xi } |{X|{X |{Xi }| i }| i |Yi =y}| i |Yk =i}| 4: EY = − distinct y∈{Yi } |{Y|{Y · log |{Y|{Y i }| i }| i )|Xi =x∧Yi =y}| i )|Xi =x∧Yi =y}| 5: EXY = − distinct (x,y)∈{(Xi,Yi )} |{(Xi ,Y|{(X · log |{(Xi ,Y|{(X i ,Yi )}| i ,Yi )}| 6: if (EX + EY − EXY )/EX < α − 0.2 then 7: return TRUE 8: return FALSE
We note that Paradies et al. [18] also used entropy to estimate the dependency between two columns in databases. Since they want to determine attribute pairs
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
803
that can be estimated with high certainty, i.e. focusing on precision of the positives, they need a complex statistical estimator. In contrast, our aim is a fast filter that is good enough to remove most negatives, i.e. independent pairs, thus a statistical estimator is not necessary. We can avoid missing positives by setting a low enough threshold. In our experiments, the difference between EC for a 20% sample and EC of full data is less than 0.15 on average and the estimated values typically have higher ECs. For example, it is very rare that a VGFD estimated lower than 0.7 has an actual value above 0.9. Therefore, a threshold of 0.2 less than α (line 6) is a reasonable lower bound for filtering out independent pairs.
4
Clustering Property Values
As introduced in Section 1, we must cluster property values in order to discover dependencies that allow for rounding errors, measurement errors, and distributions of values. For object property values, clustering groups all identifiers that stand for the same real world object by computing the transitive closure of sameAs. The rest of this section discusses clustering the values for each datatype property. This is used to determine Dependency Equality (Definition 6) between two objects. 4.1
Pre-clustering
The pre-clustering process is a light-weight computation that provides two benefits for finer clustering later: the minimum number of clusters and reserves expensive distance calculations for pairs of points within the same pre-cluster. Since the pre-clustering is used for VGFD discovery, there are three thoughts. First, the values to be clustered are from various properties and have very different features. So the clustering process needs to be generic in two aspects: (1) a pair-wise distance metric that is suitable for different types of values and multiple feature dimensions, and (2) suitable for the most common distribution in real world, i.e. the normal distribution. Second, we prefer a comparatively larger number of clusters where elements are really close (if not, they may not be clustered). The reason is that the clusters will be used as class types for detecting dependencies. Larger values of k generate finer-grained class types, which in turn allow us to generate more precise VGFDs, albeit at the risk of bluring boundaries between classes and making it harder to discover some dependencies. This point also makes our approach different from many other pre-clustering approaches, e.g. [19], because their results of pre-clustering can be overlapped and rigid clustering later could merge these groups into fewer clusters. Based on the above thoughts, specifically, given a list of values, the process first selects a value that is closest to the center (we choose the mean for numeric values and discuss strings in the next paragraph), and then moves it from the list to be the centroid of a new group. Second, for each value on the list, if the distance to this centroid is within the threshold (we use the standard deviation), it will be moved from the list to a new group. Finally, the above process is repeated
804
Y. Yu and J. Heflin
if the list is not empty. Thus the process generally finds the cluster around the original center first, and then the clusters further away from the center. This is much better than random selection, because if an outlier is selected, then most instances remain on the list for clustering after this round of computation. To compute the center and distance of string values, we compute the weight of each token in a string according to its frequency in values for the property. Then we pick the string that has the largest sum of weights divided by the number of tokens in it as the center and the distance between two strings is the sum of weights of the different tokens in them. The intuition is that by taking these strings as a class, the most representative one is the one with the most common words. For example, the property color in DBPedia has values “light green”, “lime green”, etc. Then, the representative of these two strings is the common word “green”. For “light green”, the distance to ”lime green” will be less than that to “light red”, since ’‘red” and “green” are more common and have larger weights. 4.2
Optimal k-Means Clustering
There are several popular clustering methods, e.g. k-Means, Greedy Agglomerative Clustering, etc. However most of them need a parameter for the number of resulting clusters. To automatically find the most natural/best clusters, we designed the following unsupervised method of finding optimal clusters. The approach is inspired by the gap statistic [20] which is used to cluster numeric values with a gradually increasingly number of clusters. The idea is that when we increase k to above the optimum, e.g. adding a cluster center in the middle of an already ideal cluster, the pooled within-cluster sum of squares around the cluster mean decreases more slowly than its expected rate. Thus the gap between the expectation and actual improvement over different k will be in a shape with an inflexion which indicates the best k. Our approach improves upon this idea in three aspects: we start at the number of pre-clusters instead of 1; in each round of k-Means, the initial centroids are selected according to pre-clusters; and the distance computation is only made among points within the same pre-cluster. Our Optimal kM eans algorithm is presented as Algorithm 3. At first, k is set to the number of pre-clusters. At each iteration, we increment k and select k random estimated centroids mi , each of which starts a new cluster ci . Init() selects the centroids from the pre-clusters in proportion to their sizes. In each inner loop (line 8-13), every value is labeled as a member of the cluster whose centroid has the shortest distance to this instance among all centroids that are within the same pre-cluster as that value (line 10). Then each centroid is recomputed based on the cheap distance metric until the centroid does not change. After each round of modified k-Means clustering, we compute the difference on Gap(k) and stop the process if it is an inflexion point. Since the clustering is used to detect abnormal data in which string values are expected to be caused by accidental input or data conversion, in this clustering, we use edit distance as the distance metric for string values as opposed to the above pre-clustering.
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
805
Algorithm 3. Optimal kM eans(L, groups), L is a set of literal values; groups is a set of pre-clustered groups of L. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
5
k = |groups| Gap(k) = Gap Statistic(groups) tmpC ← groups repeat k = k + 1, C ← tmpC, tmpC ← ∅ //tmpC is the set of k clusters for each i ≤ k do Init(mi ), ci ← ci ∪ mi , tmpC ← tmpC ∪ ci //mi is the center of each cluster repeat for each x ∈ L do ci ← ci ∪ arg minmi ∈Group(x) Distance(x, mi ) for each i ≤ k do mi = M ean(ci ) until ∀i ≤ k, mi converges Gap(k) = Gap Statistic(tmpC) until Gap(k) < Gap(k − 1) return C
Experimental Results
For our experiments, we selected the Semantic Web Research Corpus2 (SWRC), DBPedia and RKB3 data sets. All of them are widely used subsets of Linked Data that cover different domains. Experiments were conducted on a Sun workstation with 8 Xeon 2.8G cores and 6G memory. We observed that there are few dependencies with an LHS size larger than four and that such dependencies tend to have less plausible meanings. For this reason, we set the maximal size of a VGFD to four in our experiments. Based on clusters and VGFDs, abnormal data has two types: one is far away from other clusters and the other is a violation of VGFDs. Specifically, in this work, a triple is reported as an outlier if its value is the only element of some cluster whose distance to the nearest cluster centroid is above twice of the average distances between all clusters for this property. A triple is reported as abnormal due to violation of VGFDs only when its value conflicts with a value mapping determined by some VGFD and this value mapping is confirmed by other triple usages more than twice. In our first experiment, we compared the overall performance of the system on three data sets. The sampling size β used in runtime pruning is 20%. In Table 2, we can see that the running time appears to be more heavily influenced by the number of properties than the data set size. Note that RKB has more triples but fewer properties than DBPedia, and thus has more triples per property. This leads to a longer clustering time, but thanks to static and runtime pruning, the total time to find VGFDs is less. Table 3 gives some VGFDs from the three data sets and their short descriptions. In DBPedia, among 200 samples out of 2868 abnormal triples, 173 of them 2 3
http://data.semanticweb.org/ http://www.rkbexplorer.com/data/
806
Y. Yu and J. Heflin Table 2. System overall performance on SWRC, DBPedia and RKB data sets SWRC DBPedia RKB Number of Triples (M) / Properties 0.07 / 112 10 / 1114 38 / 54 Discovered VGFDs on Level 1 12 228 6 Discovered VGFDs on Level 2 37 304 3 Discovered VGFDs on Level 3 2 126 0 Discovered VGFDs on Level 4 0 53 0 Time for Clustering (s) 18 114 396 Time for Level 1 (s) 11 172 67 Time for Level 2 (s) 20 246 44 Time for Level 3 (s) 4 108 0 Time for Level 4 (s) 1 47 0 Total Time (s) / Discovered VGFDs 54 / 51 687 / 721 507 / 9 Reported Abnormal Triples 75 2868 227
Table 3. Some VGFDs from the three data sets. The first and second group of VGFDs are of size 1 and 2. The third group is a set of VGFDs with clustered values. VGFD genus→family writer→genre teamOwner→chairman composer→mediaType militaryRank→title location→nearestCity topic→primaryTopic manufacturer+oilSystem →compressionRatio publisher ◦ country →language article-of-journal+has-volume →has date faculty→budget militaryRank→salary occupation→salary type→upperAge
Description Organisms in the same genus also have the same family. A work’s writer determines the work’s genre. The teams with the same owner also have the same chairman. The works by the same composer have the same media type. The people of the same military rank also have the same title. The things on the same location have the same nearest city. The papers with the same topic have the same primary topic. The manufacturer and oil system determine the engine’s compression ratio. The publisher’s country determines the language of that published work. The volume number of a journal where an article is published determines the published date of this article. The size of the faculty determines the budget range. The military rank determines the range of salary. The occupation determines the range of salary. A school’s type determines the range of upper age.
(86.5%) are confirmed to be true errors in the original data. The correctness of 10 of the remaining triples was difficult to judge. SWRC and RKB have 51% and 62% precision respectively. We believe the lower precision for SWRC is because it has a higher initial data quality and its properties have a much smaller set of possible values than those of DBPedia. We list a number of confirmed erroneous triples in Table 4. For example, the first triple is reported as an outlier after automatic clustering. The second triple violates the VGFD that a school’s type determines the cluster of its upper age, because the triple’s subject is a certain type of school while its value is not in the cluster of values for the same type of schools. Next, to check the impact of our pruning algorithms, we performed an ablation study using DBPedia that removes these steps. The left part of Table 5 shows that using static and runtime pruning respectively saves over 62% and
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
807
Table 4. Some confirmed erroneous triples in the three data sets, where r, o, i, p, s are prefixes for http://www.dbpedia.org/resource/, http://www.dbpedia.org/ontology/, http://acm.rkbexplorer.com/id/, http://www.aktors.org/ontology/portal/ and http://data.semanticweb.org/ 1 2 3 4 5 6 7 8 9 10 11 12 13 14
55% of time compared to using neither. Because they utilize different characteristics, using them together saves 85% over neither. When we do not prune, the few additional VGFDs discovered lead to fewer abnormal triples than those discovered with pruning (on average 2.2 per VGFD vs. 3.97 per VGFD). Thus the pruning techniques not only save time but do not affect the abnormality detection much. Table 5. The left table is showing the impact of our pruning techniques. The right table is comparing our preclustering with an alternative called SortSeq on VGFDs using the clusters and abnormal data found based on these VGFDs. None Static Runtime Both Time (s) 4047 1529 1817 687 VGFDs 746 741 729 721 Abnormal 2923 2915 2887 2868
Time (s) VGFDs Abnormal
Preclustering SortSeq 114 83 42 23 625 391
Besides pruning, we also checked the impact of our pre-clustering. Because our approach is based on a generic pair-wise distance, we wanted to compare it with a simpler one based on the linear ordering of values where the distance is just the difference between numbers. After each iteration of clustering around the mean, this alternative, referred to as SortSeq, recursively clusters on two remaining value sets: one is above the mean and the other below the mean. To handle strings in this approach, we sort them alphabetically and assign each a sequence number. The right of Table 5 shows that VGFDs and abnormal data that are based on the baseline clustering are both less than that of our approach. Among
808
Y. Yu and J. Heflin
600 500 400 300
level 4 level 3 level 2 level 1
200 100 0 100 200 300 400 500 600 700 800 900 1000
Number of Properties
800 700 600 500 400 300 200 100 0
700 600 500 400
Time VGFDs
300 200 100 0
Runniing Time (sec)
700
Disco overed VGFDs
Running Time (sec)
the VGFDs not found by the SortSeq, most are for string values. SortSeq finds fewer VGFDs and less abnormal data, because it naively assumes that the more common leading characters two strings have, the more similar they are. Thus, our pre-clustering using cheap and generic computation captures the characteristics of different property values.
1% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50%
Sampling Size in Runtime Pruning
Fig. 2. The left is the effect of number of properties on the VGFD searching time. The right is the effect of sampling size in runtime pruning on the VGFD searching time.
Knowing that pre-clustering and pruning are useful for the system, we systematically checked the trend of system performance, especially time, by using these techniques. To be comparable on data set size, we picked subsets of properties from DBPedia. For each size, we randomly draw 10 different groups of this size and average the time over 10 runs. The left of Fig. 2 shows that the time for every level almost follows a linear trend. The right of Fig. 2 shows the effect of sampling size β used in runtime pruning on the system. We see that the running time is in linear proportion to the sampling size. As the VGFD curve shows, β = 0.2 is sufficient to find most dependencies for DBPedia.
6
Conclusion
We have presented a system to detect Semantic Web data that are abnormal and thus likely to be incorrect. Inspired by functional dependency in databases, we introduce value-clustered graph functional dependency which has three fundamental differences with functional dependency in order to better deal with Semantic Web data. First, the properties involved in a VGFD are across the whole data set schema instead of a single relation. Second, property value correlations, especially for multi-valued properties, are not explicitly given in RDF data. Third, using clusters for values greatly extends the detection of dependencies. Focusing on these unique characteristics, our system efficiently discovers VGFDs and effectively detects abnormal data, as shown in experiments on three Linked Data sets. In the future we plan to use subclass relationships to further generalize object property values. We also would like to take into account other features when clustering, for example the string patterns.
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs
809
References 1. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD 2005, pp. 143–154. ACM, New York (2005) 2. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33, 6:1–6:48 (2008) 3. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB 2007, pp. 315–326. VLDB Endowment (2007) 4. Sabou, M., Fernandez, M., Motta, E.: Evaluating semantic relations by exploring ontologies on the Semantic Web, pp. 269–280 (2010) 5. F¨ urber, C., Hepp, M.: Using SPARQL and SPIN for Data Quality Management on the Semantic Web. In: Abramowicz, W., Tolksdorf, R. (eds.) BIS 2010. LNBIP, vol. 47, pp. 35–46. Springer, Heidelberg (2010) 6. Tao, J., Sirin, E., Bao, J., McGuinness, D.L.: Integrity constraints in OWL. In: Fox, M., Poole, D. (eds.) AAAI. AAAI Press (2010) 7. Codd, E.F.: Relational completeness of data base sublanguages. In: Database Systems, pp. 65–98. Prentice-Hall (1972) 8. Mannila, H., R¨ aih¨ a, K.J.: Algorithms for inferring functional dependencies from relations. Data Knowl. Eng. 12(1), 83–99 (1994) 9. Huhtala, Y., Krkkinen, J., Porkka, P., Toivonen, H.: Tane: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal 42(2), 100–111 (1999) 10. Lopes, S., Petit, J.M., Lakhal, L.: Efficient Discovery of Functional Dependencies and Armstrong Relations. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 350–364. Springer, Heidelberg (2000) 11. Beeri, C., Dowd, M., Fagin, R., Statman, R.: On the structure of Armstrong relations for functional dependencies. J. ACM 31, 30–46 (1984) 12. Levene, M., Poulovanssilis, A.: An object-oriented data model formalised through hypergraphs. Data Knowl. Eng. 6(3), 205–224 (1991) 13. Weddell, G.E.: Reasoning about functional dependencies generalized for semantic data models. ACM Trans. Database Syst., 32–64 (1992) 14. Li Lee, M., Ling, T.W., Low, W.L.: Designing functional dependencies for XML. ˇ In: Jensen, C.S., Jeffery, K., Pokorn´ y, J., Saltenis, S., Hwang, J., B¨ ohm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, pp. 124–141. Springer, Heidelberg (2002) 15. Hartmann, S., Link, S., Kirchberg, M.: A subgraph-based approach towards functional dependencies for XML. In: Computer Science and Engineering: II. SCI, vol. IX, pp. 200–211. IIIS (2003) 16. Brown, P.G., Hass, P.J.: Bhunt: automatic discovery of fuzzy algebraic constraints in relational data. In: VLDB 2003, pp. 668–679. VLDB Endowment (2003) 17. Haas, P.J., Hueske, F., Markl, V.: Detecting attribute dependencies from query feedback. In: VLDB 2007, pp. 830–841. VLDB Endowment (2007) 18. Paradies, M., Lemke, C., Plattner, H., Lehner, W., Sattler, K.U., Zeier, A., Krueger, J.: How to juggle columns: an entropy-based approach for table compression. In: IDEAS 2010, pp. 205–215. ACM, New York (2010) 19. McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000, pp. 169–178. ACM, New York (2000) 20. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal Of The Royal Statistical Society Series B 63(2), 411–423 (2001)
Author Index
Abecker, Andreas II-1 Abel, Fabian I-1, I-289 Abiteboul, Serge I-697 Acosta, Maribel I-18 Alani, Harith I-35 Alarotu, V¨ anni II-221 Alferes, Jos´e J´ ulio I-338 Anaya, Julia II-157 Anderson, Philip I-714 Angeletou, Sofia I-35 Anyanwu, Kemafor I-193 Arenas, Marcelo I-421 Atencia, Manuel I-51 Auer, S¨ oren I-454 d’Aquin, Mathieu I-470, II-49 Bail, Samantha I-67, I-241 Bal, Henri I-730 Bicer, Veli II-1 Blanco, Roi I-83 Byrne, Brian I-146 Casagni, Cristiano II-17 Castillo, Julio I-18 Celik, Ilknur I-1 Chapman, Peter I-257 Cheng, Gong I-98, I-114 Ciancarini, Paolo I-520 Cimiano, Philipp I-665 Ciravegna, Fabio I-209 Costabello, Luca II-269 Cudr´e-Mauroux, Philippe I-778 Cuenca Grau, Bernardo I-273, I-681 Daniel, Desiree I-537 Dao-Tran, Minh I-370 Decker, Stefan I-504, II-33 Delbru, Renaud II-33 Del Vescovo, Chiara I-130 Di Francescomarino, Chiara Ding, Li II-253 Dr˘ agan, Laura II-33 Dragoni, Mauro II-17 Duan, Songyun I-146
II-17
Eißing, Daniel I-569 Ell, Basil I-162 Euzenat, J´erˆ ome I-51 Fan, James II-144 Fernandez, Miriam II-49 Ferr´e, S´ebastien I-177 Ferreira, Filipe I-746 Fiorentini, Licia II-17 Fokoue, Achille I-146 Franci, Luca II-17 Fu, Haizhou I-193 Fu, Linyun II-253 Gall, Harald C. II-112 Gangemi, Aldo I-520 Gentile, Anna Lisa I-209 Gerosa, Matteo II-17 Gessler, Damian D.G. I-130 Ghezzi, Giacomo II-112 Ghidini, Chiara II-17 Gil, Yolanda II-65 Glimm, Birte I-487 Gomez-Perez, Jose Manuel I-470 Gong, Saisai I-98 G¨ orlitz, Olaf I-585 Grimshaw, Robert II-97 Gross-Amblard, David I-697 Groza, Tudor II-33, II-81 Gr¨ uninger, Michael I-225 Gupta, Shubham II-65 Haase, Peter I-585, I-601 Haglich, Peter II-97 Hall, Wendy I-553 Hammar, Karl II-277 Handschuh, Siegfried I-504, II-33 van Harmelen, Frank I-730, II-389 Harmon, Thomas C. II-65 Hauswirth, Manfred I-370 Heflin, Jeff I-649, I-794 Heino, Norman II-189 Helminen, Matti II-221 Hermann, Alice I-177 Hert, Matthias II-112
812
Author Index
Hitzler, Pascal I-617 Hoekstra, Rinke II-128 Hogan, Aidan I-421 Hollink, Laura I-665 Honavar, Vasant I-389 Horridge, Matthew I-67, I-241 Horrocks, Ian I-681 Hose, Katja I-601 Houben, Geert-Jan I-1 Howse, John I-257 Hunter, Jane II-81 Hyp´en, Kaisa II-173 Hyv¨ onen, Eero II-173 Iannone, Luigi
M¨ akel¨ a, Eetu II-173 Mallea, Alejandro I-421 Mazumdar, Suvodeep I-209 McGuinness, Deborah L. II-253 Mendez, Victor I-470 Mika, Peter I-83 Mikroyannidi, Eleni I-438 Morsey, Mohamed I-454 Motik, Boris I-681 Motta, Enrico I-470, II-49 Mulholland, Paul I-470 Mulwad, Varish II-317 Murdock, J. William II-144 Muslea, Maria II-65
I-438
Jim´enez-Ruiz, Ernesto I-273 Joshi, Karuna P. II-285 Kaltenb¨ ock, Martin II-189 Kalyanpur, Aditya II-144 K¨ ampgen, Benedikt II-301 Kapanipathi, Pavan II-157 Kawase, Ricardo I-289 Kazakov, Yevgeny I-305 Kharlamov, Evgeny I-321 Khuller, Samir I-714 Kirrane, Sabrina II-293 Klinov, Pavel I-130 Knoblock, Craig A. II-65 Knorr, Matthias I-338 Koskinen, Kari II-221 Koul, Neeraj I-389 Krisnadhi, Adila Alfa I-617 Kr¨ otzsch, Markus I-305, I-354 Ladwig, G¨ unter I-585 Lampo, Tomas I-18 Lanfranchi, Vitaveska I-209 Laˇsek, Ivo II-309 Lebo, Timothy II-253 Lehmann, Jens I-454 Le-Phuoc, Danh I-370 Li, Yuan-Fang II-81 Lin, Harris T. I-389 Liu, Chang I-405 Liu, Qing II-253 Luciano, Joanne S. II-253 Lyko, Klaus II-189 Lyles, Bryan II-97
Navlakha, Saket I-714 Nedkov, Radoslav II-1 Ngonga Ngomo, Axel-Cyrille II-189 Nikitina, Nadeschda I-487 Niu, Xing II-205 Nodine, Marian II-97 Nov´ aˇcek, V´ıt I-504 Nurmi, Juha II-221 Nuzzolese, Andrea Giovanni Nyk¨ anen, Ossi II-221 Ortmann, Jens
I-454,
I-520
I-537
Palonen, Tuija II-221 Papadakis, George I-289 Parsia, Bijan I-67, I-130, I-241 Passant, Alexandre II-157 Pattanaphanchai, Jarutas II-325 Patton, Evan W. II-253 Paul, Razan II-333 Pentland, Alex II-390 Peroni, Silvio I-470 Pirr` o, Giuseppe I-51 Pohjolainen, Seppo II-221 Polleres, Axel I-421 Pont, Jig´e I-778 Popov, Igor O. I-553 Presutti, Valentina I-520 Qi, Guilin I-405, II-205 Qu, Yuzhong I-98, I-114 Ranta, Pekka II-221 Raschid, Louiqa I-714 Ratnakar, Varun II-65
Author Index Rector, Alan I-438 Rizzoli, Federica II-17 Rokala, Markus II-221 Rong, Shu II-205 Rospocher, Marco II-17 Rousset, Marie-Christine I-51 Rovella, Anna II-17 Rowe, Matthew I-35 Ruckhaus, Edna I-18 Rudolph, Sebastian I-487 Rula, Anisa II-341
Steiner, Thomas II-365 Stevens, Robert I-438 Stoilos, Giorgos I-681 Suchanek, Fabian M. I-697 Sun, Xinruo II-205 Szekely, Pedro II-65
Saha, Barna I-714 Salonen, Jaakko II-221 Sattler, Ulrike I-67, I-130, I-241 Schenkel, Ralf I-601 Scherp, Ansgar I-569 Schlobach, Stefan I-730 Schmidt, Michael I-585, I-601 Schneider, Patrik II-349 Schneider, Thomas I-130 Sch¨ onfelder, Ren´e II-237 Schraefel, M.C. I-553 Schwarte, Andreas I-585, I-601 Seitz, Christian II-237 Sengupta, Kunal I-617 Serafini, Luciano II-17 Shadbolt, Nigel I-553 Shekarpour, Saeedeh II-357 Shen, Yi-Dong I-633 Sheth, Amit II-157 Siehndel, Patrick I-1 Silva, Fabio II-65 Simanˇc´ık, Frantiˇsek I-305 Simperl, Elena I-162 Slatkin, Brett II-157 Song, Dezhao I-649 Sparaco, Stefania II-17 Speck, Ren´e II-189 Speiser, Sebastian I-354 Spohr, Dennis I-665 Srinivas, Kavitha I-146 Staab, Steffen I-569 Stapleton, Gem I-257
Urbani, Jacopo
Tabarroni, Alessandro II-17 Taylor, Kerry I-257 Thor, Andreas I-714 Tran, Thanh I-114, I-585, II-1 I-730
Vidal, Maria-Esther I-18 Viegas Dam´ asio, Carlos I-746 Vigna, Sebastiano I-83 Villamizar, Sandra II-65 Vrandeˇci´c, Denny I-162 Wang, Haofen I-405, II-205 Wang, Kewen I-633 Wang, Ping II-253 Weaver, Jesse I-762 Weikum, Gerhard II-391 Welty, Christopher II-144 Wilder, Steven II-97 Williams, Gregory Todd I-762 Winget, Andrew I-130 Wisniewski, Mariusz I-778 W¨ ursch, Michael II-112 Wylot, Marcin I-778 Xavier Parreira, Josiane Yu, Yang Yu, Yong
I-370
I-794 I-405, II-205
Zablith, Fouad I-470 Zankl, Andreas II-81 Zhang, Xiao-Ning I-714 Zheleznyakov, Dmitriy I-321 Zheng, Jin Guang II-253 Zhu, Man II-373 Ziaimatin, Hasti II-381
813