This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
16
C2
@key=”456" 19
B3 <e>22
17
Order Class C1 Class C2 1 a a 2 a/b{@key} a/j 3 a/b/c a/j/k 4 a/b/c/e a/j/k/m 5 a/b/c/e/f a/j/k/m/n
26
Fig. 3. Branch Classifications
Fig 3 depicts a sample XML document showing three branch instances, B1-B3 (left) and the extended DataGuides associated with two branch classes, C1 and C2 (right). Note that the order of the extended DataGuides associated with each branch class is important. After classification, if B1 and B3 have an identical set of descendant branch instances, they will be instances of the C1 class, while branch B2 is an instance of the C2 class. Finally, the process that maintains parent-child relationships between branch instances (discussed earlier), must be replaced with one that maintains parentchild relationships between branch classes. The ancestor-descendant relationships are then generated for branch classes in the same manner as they were for branch instances.
4
Index Deployment and Query Processing
In this section, we describe the indexing constructs resulting from the indexing process in §3. Following this, we give an overview of our query processing approach and continue with a worked example to illustrate how query optimization is achieved. Using the sample XML document in Fig. 4, Tables 4-4 illustrate the NODE, NCL (Name/Class/Level), and CLASS index respectively. The NODE index contains an entry for each node in the XML document. The NCL is generated by selecting each distinct name, class, level and type from the NODE index. The CLASS index contains ancestor-descendant mappings between branch classes, where the attributes ac and dc are the ancestor-or-self classes and descendant-or-self classes respectively. The NCL index allows us to bypass, i.e. avoid processing, large numbers of nodes (discussed shortly). In the traditional approach to XPath query processing, there is a two step process: (1) retrieve nodes (based on the XPath axis and NodeTest), (2) input these nodes to the subsequent step (i.e. context nodes), or return them as the result set (if the current step is the rightmost step in the path expression). In partitioning approaches, a third step is added. Thus, the query process is performed in the following steps:
Classification of Index Partitions to Boost XML Query Performance
413
1. Identify the relevant partitions, i.e. prune the search space. 2. Retrieve the target nodes from these partitions, i.e. by checking their labels (e.g. pre/post, dewey). 3. Input these nodes to the subsequent step, or return them as the result set. The NODE and CLASS indexes are sufficient to satisfy all three steps, where the CLASS index prunes the search space (step 1 ), thus optimizing step 2. However, ultimately we are only concerned with the nodes that are output from the rightmost step in an XPath expression, as these will form the result set for the query. Nodes that are processed as part of the preceding steps are only used to navigate to these result nodes. Using the NCL index instead of the NODE index (where possible), enables us to bypass (or avoid processing) many of these nodes that are only used to navigate to the result set, thus step 2 is optimized further.
1
2
<article>
3
4
<sub>
2
7
5
0
6
<article>
11
8
9
<sub>
12
10
<article>
13
4
Fig. 4. XML Snippet taken from the DBLP Dataset Table 1. Node Index
Table 2. NCL Index
label name type level class value (0,13)| 0 dblp 1 0 n/a (1,4)| 0.0 article 1 1 5 (2,0)| 0.0.0 author 1 2 1 (3,3)| 0.0.1 title 1 2 4 (4,1)| 0.0.1.0 sub 1 3 2 2 (5,2)| 0.0.1.1 i 1 3 3 (6,9)| 0.1 article 1 1 5 (7,5)| 0.1.1 author 1 2 1 (8,8)| 0.1.2 title 1 2 4 (9,6)| 0.1.2.1 sub 1 3 2 4 (10,7)| 0.1.2.2 i 1 3 3 (11,12)| 0.2 article 1 1 7 (12,10)| 0.2.1 author 1 2 8 (13,11)| 0.2.2 title 1 2 6 -
NAME CLASS LEVEL TYPE author 1 2 1 sub 2 3 1 i 3 3 1 title 4 2 1 article 5 1 1 title 6 2 1 article 7 1 1 author 8 2 1
Table 3. CLASS Index ac dc 1 1 2 2 3 3 4 2 4 3 4 4 5 1 5 2 5 3 5 4 5 5 6 6 7 6 7 7 7 8 8 8
Bypassing is not possible across all steps in an XPath expression. Therefore, a selection process is required to choose which steps must access the NODE index, and which steps can access the (much smaller) NCL index instead. We are currently in the process of formally defining this process across all steps. Thus, in this paper we present the rules for the selection process that we have currently defined: 1. The NODE index must be used at the rightmost step in the path expression, i.e. to retrieve the actual result nodes. For example, see the rightmost step (/education) in Q1 (Fig 5).
414
G. Marks, M. Roantree, and J. Murphy NCL
NCL
NCL
NCL
NCL
NODE
Q1: //people//person[.//address//zipcode]//profile/education Filter NCL
NODE
Filter NCL
NODE
Q2: //people//person[.//address//zipcode = ‘
NCL
NODE
’]//profile/education
Fig. 5. Index Selection
2. If the query does not evaluate a text node, the NCL index can be used in all but the rightmost step. For example, Q1 does not evaluate a text node, thus only the rightmost step accesses the node index as required by rule 1. 3. All steps that evaluate a text node must use the NODE index, e.g. //zipcode = ‘17’ (Q2). 4. A step that contains a predicate filter that subsequently accesses a text node must use the NODE index, e.g. step two in Q2. NODE index accesses are required to filter nodes based on the character content of text nodes, i.e. the VALUE attribute (Table 4), or to retrieve the result set for the rightmost step. The character content of text nodes was not considered during the branch classification process (§3) in order to keep the number of branch classes, and therefore, the size of the CLASS index small. However, NODE index accesses (based on the character content of text nodes) are efficient as they usually have high selectivity. In fact, where the character content of text nodes that do not have high selectivity can be identified, e.g. gender has only 2 possible values, they can be included as part of the classification process ensuring high selectivity for all remaining NODE accesses. However, we are currently examining the cost/benefit aspects of including text nodes in our classification process. Example 1. //people//person 1. SELECT * FROM NODE SRM WHERE SRM .TYPE = 1 AND SRM .NAME = ‘person’ 2. AND SRM .BRANCH IN ( 3. SELECT C1 .DC FROM NCL N1 , CLASS C1 4. WHERE N1 .NAME = ‘people’ 5. AND N1 .CLASS = C1 .AC 6. AND SRM .LEVEL > N1 .LEVEL 7. ) 8. ORDER BY SRM .PRE
In Example 1, notice that the NODE index is only accessed in the rightmost step (line 1 ). The layout of the final branch partitions (see Fig 2) enables us to evaluate the ancestor (or self), descendant (or self), parent or child axis by checking the LEVEL attribute (line 6 ). Note, this would not be possible if we allowed overlap between branches (discussed in §3). Similar approaches must evaluate unique node labels, e.g. pre/post or dewey. An additional benefit of the fact that we do not allow overlap between branch classes is that the inefficient DISTINCT clause that is required by related approaches [11, 5] to remove duplicates from the result set can be omitted. Also, as large numbers of nodes are bypassed, the IN clause is efficient as the sub-query usually returns a small number of branch classes.
Classification of Index Partitions to Boost XML Query Performance
5
415
Experiments
In this section, we compare our branch based approach to similar (lab-based) approaches. Following this, we evaluate how our approach performs against vendor systems. Experiments were run on identical servers with a 2.66GHz Intel(R) Core(TM)2 Duo CPU and 4GB of RAM. For each query, the time shown includes the time taken for: (1) the XPath-to-SQL transformation, (2) the SQL query execution, and (3) the execution of the SQL count() function on the PRE column of the result set. The latter is necessary as some SQL queries took longer to return all rows than others. Each query was executed 11 times ignoring the first execution to ensure hot cache result times across all queries. The 10 remaining response times were then averaged to produce the final time in milliseconds. Finally, we placed a 10 minute timeout on querys. Table 4. XPath Queries XMark Q01 /site/regions/africa Q02 /site/people/person[@id = ’person0’] Q03 //regions/africa//item/name Q04 //person[profile/@income]/name Q05 //people/person[profile/gender][profile/age]/name Q06 /site/keyword/ancestor::listitem/text/keyword Q07 /site/closed auctions/closed auction//keyword Q08 /site/closed auctions/closed auction[./descendant::keyword]/date Q09 /site/closed auctions/closed auction/annotation/description/text/keyword Q10 /site/closed auctions/closed auction[annotation/description/text/keyword]/date
5.1
Comparison Tests with Lab-Based Systems
In this section, we will evaluate the performance of a traditional node based approach to XPath: Grust07 [6], and the partitioning approach most similar to ours: Luoma07 [11]. For Grust07, we built the suggested partitioned B-trees: Node(level,pre), Node(type,name,pre) and Node(type,name,level,pre). Additionally we built indexes on size, name, level, value and type. For Luoma07, we used partitioning factors 20, 40, 60, and 100. As suggested in this work, Node(pre) is a primary key. Node(part) is a foreign key reference to the primary key Part(part) and indexes were built on Node(post), Node(name), Node(part), Part(pre), and Part(post). Our overall findings for both approaches are that they do not scale well even for relatively small XML documents. As such, we had to evaluate these approaches using a relatively small dataset. Later in this section, we evaluate large XML datasets across vendor systems. For the following experiments, we generated an XMark dataset of just 115 MB in size and tested both approaches against queries from the XPathMark [13] benchmark and Grust07 (Table 4). In Fig 6, the query response time for each of these queries is shown. These results show the following:
416
G. Marks, M. Roantree, and J. Murphy 1,000,000 ExecutionTime(ms)
100,000 10,000 1,000 100 10 1
Q1
Q2
Q9
Q10
20
211
223
53 481 53,481
5 198 5,198
126 190 126,190
151 728 151,728
40
263
307
Q3
Q4
61,458
9,168
197,386
140,133
60
260
1,452
52,423
10,492
124,019
132,178
80
262
1,200
78,719
10,215
21,281
114,020
161,539
100
267
1,134
53,596
290,967
18,289
112,605
166,413
Grust07
136
259
BranchIndex
16
92
1,371
192
996
292,528
Q5
Q6
Q7
Q8
23,842 63
81
3,274
896
229
Fig. 6. Query Response Times for Luoma07, Grust07 and BranchIndex
– Grust07 timed out on all but Q01, Q02, Q07. – In Luoma07, a partitioning factor of 100 returned results for the greatest number of queries: Q01, Q02, Q04, Q05, Q07, Q08, Q09, Q10. Q07 shows an increase in processing times as the partitioning factor increased, whereas Q09 showed a decrease. The remaining queries do not provide such a pattern. – Luoma07 returned results for a greater number of queries then Grust07 across all partitioning factors. – The BranchIndex is orders of magnitude faster across all queries. Queries Q01 and Q02 have high selectivity as they return a single result node. Also, the first two steps in Q7, i.e. /site and /closed auctions, both access a single node. We attribute the fact that Grust07 returned results for queries Q01, Q02 and Q03 to the high selectivity of these queries. As the second bullet point indicates that there is no consistent pattern between the incrementing partitioning factors, we suggest that a single partitioning factor per dataset is not ideal. Luoma07 provides superior results than Grust07, both in terms of the query response times, and number of queries that returned a result within 10 minutes. However, the exhaustive experimentation required to identify suitable partition factors is infeasible. Both approaches do not scale well for queries that have low selectivity, even for relatively small XML datasets, e.g. 115 MB, the query response times are relatively large. 5.2
Comparison Tests with Vendor Systems
In this section, we will evaluate the branch index against a leading commercial XML database solution (Microsoft SQL Server 2008 ) and a leading open source XML database (MonetDB/XQuery) [2] using the XPathMark [13] benchmark. The XPathMark Benchmark. The standard XPath benchmark (XPathMark [13]) consists of a number of categories of queries across the synthetic
Classification of Index Partitions to Boost XML Query Performance
417
ExecutionTime(ms)
1,000,000 100,000 10,000 1,000 100 10 1
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
BranchIndex
, 1,515
1,625 ,
581
2,347 ,
5,893 ,
11226
3,297 ,
5273
1,033 ,
3277
MonetDB/XQuery
2025
114
49,905
11350
11084
10,788
10774
284436
5443
5,711
5699
SQLServer
1046
8,954
9036
317363
20095
14719
Fig. 7. XMark Query Response Times
XMark dataset. In this paper, we are examining the performance of the ancestor, ancestor-or-self, descendant, descendant-or-self, parent and child axes. The queries in Table 4 where chosen for this purpose. Fig 7 shows the following: – SQL Server threw an exception on Q6 as it contains the ancestor axes. – Q1 and Q2 have high selectivity (discussed earlier), thus all three systems took a small amount of time to return the result. – In queries Q3, Q4, Q6, Q7, and Q9 the BranchIndex shows orders of magnitude improvements over the times returned by SQL Server and MonetDB/XQuery. – In queries Q5, the branch index is almost twice as efficient as MonetDB/ XQuery and three times as efficient as SQL Server. – In Q8 and Q10, the BranchIndex and SQL Server returned similar times, and MonetDB/XQuery took twice as long. The branch index is the preferred option across all queries except Q2, in which case the time difference is negligible. SQL Server performs well across queries that have multiple parent-child edges, e.g. Q8 Q9 and Q10, which we attribute to the secondary PATH index we built. For instance, SQL Server performs very poorly in Q3, which has an ancestor-descendant join on the third step. MonetDB/XQuery is quite consistent across all queries, i.e. taking around 10/11 seconds across all low selectivity queries. However, it performs particularly poorly in Q6, which could indicate that it does not evaluate the ancestor axis efficiently.
6
Conclusions
In this paper, we presented a partitioning approach for XML documents. These partitions are used to create an index that optimizes XPath’s hierarchical axes. Our approach differs from the only major effort in this area in that we do not need to analyze the document in advance to determine efficient partition sizes. Instead, our algorithms are dynamic, thus they create partitions based on document characteristics, e.g. structure and node layout. This provides for a fully automated process for creating the partition index. We obtain further optimization by compacting the partition index using a classification process. As each
418
G. Marks, M. Roantree, and J. Murphy
identical partition will generate identical results in query processing, we need only a representative partition (a branch class) for all partitions of equivalent structure. We then demonstrated the overall optimization gains through experimentation. Our current work focuses on evaluating non-hierarchical XPath axes, e.g. following, preceding, and on using real-world datasets (sensor-based XML output) to test different XML document formats and to utilize real world queries to understand the broader impact of our work.
References 1. Xrel: a path-based approach to storage and retrieval of xml documents using relational databases. ACM Trans. Internet Technol. 1(1), 110–141 (2001) 2. Boncz, P., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: SIGMOD 2006: Proceedings of the, ACM SIGMOD international conference on Management of data, pp. 479–490. ACM Press, New York (2006) 3. Marks, G., Roantree, M., Murphy, J.: Classification of Index Partitions. Technical report, Dublin City University (2010), http://www.computing.dcu.ie/~isg/ publications/ISG-10-03.pdf 4. Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) Proceedings of 23rd International Conference on Very Large Data Bases, VLDB 1997, Athens, Greece, August 25-29, pp. 436–445. Morgan Kaufmann, San Francisco (1997) 5. Grust, T.: Accelerating XPath Location Steps. In: SIGMOD 2002: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 109–120. ACM, New York (2002) 6. Grust, T., Rittinger, J., Teubner, J.: Why off-the-shelf RDBMSs are better at XPath than you might expect. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 949–958. ACM Press, New York (2007) 7. Grust, T., van Keulen, M., Teubner, J.: Staircase Join: Teach a Relational DBMS to Watch Its (axis) Steps. In: VLDB 2003: Proceedings of the 29th international conference on Very large data bases, pp. 524–535. VLDB Endowment (2003) 8. Lu, J., Chen, T., Ling, T.W.: Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 533–542. ACM, New York (2004) 9. Lu, J., Chen, T., Ling, T.W.: TJFast: Effective Processing of XML Twig Pattern Matching. In: WWW 2005: Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1118–1119. ACM, New York (2005) 10. Luoma, O.: Xeek: An efficient method for supporting xpath evaluation with relational databases. In: ADBIS Research Communications (2006) 11. Luoma, O.: Efficient Queries on XML Data through Partitioning. In: WEBIST (Selected Papers), pp. 98–108 (2007) 12. O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: ORDPATHs: Insert-Friendly XML Node Labels. In: SIGMOD 2004: Proceedings of the, ACM SIGMOD international conference on Management of data, pp. 903–908. ACM, New York (2004) 13. XPathMark Benchmark. Online Resource, http://sole.dimi.uniud.it/~massimo.franceschet/xpathmark/
Specifying Aggregation Functions in Multidimensional Models with OCL Jordi Cabot1 , Jose-Norberto Maz´on2 , Jes´ us Pardillo2, and Juan Trujillo2 1
´ INRIA - Ecole des Mines de Nantes (France) jordi.cabot@inria.fr 2 Universidad de Alicante (Spain) {jnmazon,jesuspv,jtrujillo}@dlsi.ua.es
Abstract. Multidimensional models are at the core of data warehouse systems, since they allow decision makers to early define the relevant information and queries that are required to satisfy their information needs. The use of aggregation functions is a cornerstone in the definition of these multidimensional queries. However, current proposals for multidimensional modeling lack the mechanisms to define aggregation functions at the conceptual level: multidimensional queries can only be defined once the rest of the system has already been implemented, which requires much effort and expertise. In this sense, the goal of this paper is to extend the Object Constraint Language (OCL) with a predefined set of aggregation functions. Our extension facilitates the definition of platform-independent queries as part of the specification of the conceptual multidimensional model of the data warehouse. These queries are automatically implemented with the rest of the data warehouse during the code-generation phase. The OCL extensions proposed in this paper have been validated by using the USE tool.
1
Introduction
Data warehouse systems support decision makers in analyzing large amounts of data integrated from heterogeneous sources into a multidimensional model. Several authors [1,2,3,4] and benchmarks for decision support systems (e.g., TPC-H or TPC-DS [5]) have highlighted the great importance of aggregation functions during this analysis to compute and return a unique summarized value that represents all the set, such as sum, average or variance. Although it is widely accepted that multidimensional structures should be represented in an implementation-independent conceptual model in order to reflect real-world situations as accurately as possible [6], multidimensional queries that satisfy information needs of decision makers are not currently expressed at the conceptual level but only after the rest of the data warehouse system has been developed. Therefore, the definition of these queries is implementation-dependent which requires a lot of effort and expertise in the target implementation platform. The main drawback of this traditional way of proceeding is that it avoids designers to properly validate if the conceptual schema meets the requirements of decision makers before the final implementation. Therefore, if any change is J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 419–432, 2010. c Springer-Verlag Berlin Heidelberg 2010
420
J. Cabot et al.
found out after the implementation, designers must start the whole process from the early stages, thereby dramatically increasing the overall cost of data warehouse projects. As stated by Oliv´e [7], this main drawback comes from the little importance given to the informative function of the information system, that is, to the definition of queries at the conceptual level that must be provided to the users in order to satisfy their information needs. To overcome this drawback in the data warehouse scenario, multidimensional queries must be defined at the conceptual level. The main restriction for defining multidimensional queries at the conceptual level is the rather limited support offered by current conceptual modeling languages [8,9,10,11], that exhibit a lack of rich constructs for the specification of aggregation functions. So far, researchers have focused on using a small subset of them, namely sum, max, min, avg and count [12] (and most modeling languages do not even cover all these basic ones). However, data warehouse systems require aggregation functions for a richer data analysis [6,4]. Therefore, we believe that it is highly important to be able to provide a wide set of aggregation functions as predefined constructs offered by the modeling language used in the specification of the data warehouse so that the definition of multidimensional queries can be carried out at the conceptual level. This way, designers can define and validate them regardless the final technology platform chosen to implement the data warehouse. To this aim, in this paper, the standard Object Constraint Language (OCL [13]) library is extended with a new set of aggregation functions in order to facilitate the specification of multidimensional queries as part of the definition of UML conceptual schemas. In our work, we will use the operations in combination with our UML profile for multidimensional modeling [14]. Nevertheless, our OCL extension is independent of the UML profile and could be used in the definition of any standard UML model. Our new OCL operations have been tested and implemented in the USE tool [15] in order to ensure their well-formedness and to validate them on sample data from our running example (see Sect. 2). Our work is aligned with current Model-Driven Development (MDD) approaches, such those of [16,17], where the implementation of the system is supposed to be (semi)automatically generated from its high-level models. The definition of all multidimensional queries at the conceptual level permits a more complete codegeneration phase, including the automatic translation of these queries from their initial platform-independent definition to the final (platform-dependent) implementation, as we describe later in the paper. Therefore, code can be easily generated for implementing multidimensional queries in several languages, such as MDX or SQL. The remainder of this paper is structured as follows: a motivating example is presented in the next section to illustrate the benefits of our proposal throughout the paper. Our OCL extension to model this kind of queries at the conceptual level is presented in Sect. 3, while its validation is carried out in Sect. 4. Sect. 5 defines how to automatically implement it. Finally, Sect. 6 comments the related work and Sect. 7 presents the main conclusions and sketches future work.
Specifying Aggregation Functions in Multidimensional Models with OCL
2
421
Motivating Example
To motivate the importance of our approach and illustrate its benefits, consider the following example, which is inspired in one of the scenarios described in [18]: an airline’s marketing department wants to analyze the flight activity of each member of its frequent flyer program. The department is interested in seeing what flights the company’s frequent flyers take, which planes they travel with, what fare basis they pay, how often they upgrade, and how they earn their frequent flyer miles1 . A possible conceptual model for this example is shown in Fig. 1 as a class diagram annotated and displayed using the multidimensional UML profile presented in [14]. The figure represents a multidimensional model of the flight legs taken by frequent flyers in the FrequentFlyerLegs Fact class. This class contains several FactAttribute properties: Fare, Miles and MinutesLate. These properties are measures that can be analyzed according to several aspects as the origin and destination airport (Dimension class Airport ), the Customer, FareClass, Flight and Date (these two last Dimension classes are not detailed in the diagram).
Fig. 1. Conceptual multidimensional model for our frequent flyer scenario
Given this conceptual multidimensional model, decision makers can request a set of queries to retrieve useful information from the system. For instance, they are probably interested in knowing the miles earned by a frequent flyer in his/her trips from a given airport ( e.g., airports located in Denver) in a given fare class. Many other multidimensional queries can be similarly defined. These kind of queries are usually of particular interest for the decision makers because they (i) aggregate the data (e.g., the earned miles in the previous example) and (ii) summarize values by means of different aggregation functions. For example, it is likely that decision makers will be interested in knowing the total number of miles earned by the frequent flyer, a ranking of frequent flyers per number 1
Note that, in this case study, the interest is in actual flight activity, but not in reservation or ticketing activity.
422
J. Cabot et al.
of miles earned, the average number of earned miles, several percentiles on the number of miles and so forth. Interestingly, these multidimensional queries are related to several concepts [19]: – Phenomenon of interest, which is the measure or set of measures to analyze (FactAttribute properties in Fig. 1). Miles are the phenomenon of interest in the previous defined query. – Category attributes, which are the context for analyzing the phenomenon of interest (Dimension and Base classes in Fig. 1). E.g., FareClass and Airport are category attributes. – Aggregation sets, which are subsets of the phenomenon of interest according to several category attributes. In our sample query, the aggregation set only contains miles obtained by frequent flyers that depart from Denver. – Aggregation functions, which are predefined operators that can be applied on the aggregation sets to summarize or analyze their factual data. E.g., the sum, avg or percentile operators above-mentioned. The first two aspects (i.e., the definition of the category attributes and the phenomenon of interest) can be easily modeled in UML (as we have already accomplished in Fig. 1). Furthermore, a method for defining aggregation sets in OCL has been proposed in [16]. With regard to aggregation functions, so far, researchers and practitioners have focused on using a small subset of them, namely sum, max, min, avg and count [12]. Moreover, query-intensive applications, such as data warehouses or OLAP systems, require other kind of statistical functions for a richer data analysis (e.g., see [4]). However, support for statistical functions is very limited (e.g., OCL does not even support all of the basic aggregation functions) which hinders designers wanting to directly implement the kind of queries presented above and preventing them from easily satisfying the user requirements. Therefore, we believe that it is highly important to be able to provide all kinds of aggregation functions as predefined constructs offered by the modeling language (UML and OCL in our case) so that the definition of multidimensional queries can be carried out at the conceptual level in order to define and validate them regardless the final technology platform chosen to implement the data warehouse. In the rest of the paper, we propose an extension for the OCL language to solve this issue.
3
Extending OCL with Aggregation Functions
Conceptual modeling languages based on visual formalisms are commonly managed together with textual formalisms, since some model elements are not easily or properly mapped into the graphical constructs provided by the modeling language [20]. For UML schemas, OCL [13] is typically used for this purpose. The goal of this section is to extend the OCL with a new set of predefined aggregation functions to facilitate the definition of multidimensional queries on UML schemas.
Specifying Aggregation Functions in Multidimensional Models with OCL
423
The set of core aggregation functions included in our study are those among the most used in data analysis [4]. To simplify their presentation, we classify these functions in three different groups, following [21,3]: – Distributive functions, which can be defined by structural recursion, i.e., the input collection can be partitioned into subcollections that can be individually aggregated and combined. – Algebraic functions, which are expressed as finite algebraic expressions over distributive functions, e.g., average is computed using count and sum. – Holistic functions, which are all other functions that are not distributive nor algebraic. These functions can be combined to provide many other advanced operators. An example of such an operator is top(x) which uses the rank operation to return a subset of the x highest values within a collection. 3.1
Preliminary OCL Concepts
OCL is a rich language that offers predefined mechanisms for retrieving the values of the attributes of an object, for navigating through a set of related objects, for iterating through collection of objects (e.g., by means of the forAll, exist and select iterators) and so forth. As part of the language, a standard library including a predefined set of types and a list of predefined operations that can be applied on those types is also provided. The types can be primitive (Integer, Real, Boolean and String) or collection types (Set, Bag, OrderedSet and Sequence). Some examples of operations provided for those types are: and, or, not (Boolean), +, −, ∗, >, < (Real and Integer), union, size, includes, count and sum (Set). All these constructs can be used in the definition of OCL constraints, derivation rules, queries and pre/post-conditions. In particular, definition of queries follows the template: context Class::Q(p1:T1, . . . , pn:Tn): Tresult body: Query-ocl-expression
where the query Q returns the result of evaluating the Query−ocl−expression by using the arguments passed as parameters in its invocation on an object of the context type Class. Apart from the parameters p1 . . . pn, in query-ocl-expression designers may use the implicit parameter self (of type Class) representing the object on which the operation has been invoked. As an example, the previous query total miles earned by a frequent flyer in his/her trips from Denver in a given fare can be defined as follows: context Customer::sumMiles(FareClass fc) body: self.frequentFlyerLegs−>select(f | f.fareClass=fc and f.origin.city.name=’Denver’)−>sum()
424
J. Cabot et al.
Unfortunately, many other interesting queries cannot be similarly defined since the operators required to define such queries are not part of the standard library (e.g. the average number of miles earned by a customer in each flight leg, since the average operation is not defined in OCL). In the next section, we present our extension to the OCL standard library to include them as predefined operators available to all users of this language. 3.2
Extending the OCL Standard Library
Multidimensional queries cannot be easily defined in OCL since the aggregation functions required to specify them are not part of the standard library and thus, they must be manually defined by the designer every time they are needed which is an error-prone and time-consuming activity (due to the complexity of some aggregation functions). To solve this problem, we propose in this section an extension to the OCL Standard Library by predefining a list of new aggregation functions that can be reused by designers in the definition of their OCL expressions. The new operations are formally defined in OCL by specifying their operation contract, exactly in the same style that existing operations in the library are defined in the OCL official specification document. Our extension does not change the OCL metamodel and thus, it does not risk the standard level of UML/OCL models using it. In fact, our operations could be regarded as new user-defined operations, a possibility which is supported by most current OCL tools. Therefore, our extension could be easily integrated in those tools. Each operation is attached to the most appropriate (primitive or collection) type. As usual, functions defined on a supertype can be applied on instances of the subtypes. For each operation we indicate the context type, the signature and the postcondition that defines the result computed by it. When required, preconditions restricting the operation application are also provided. Note that some aggregation functions may have several slightly different alternative definitions in the literature. Due to space limitations we stick to just one of them. These functions can be called within OCL expressions in the same way as any other standard OCL operation. See an example in Sect. 3.3. Distributive Functions – MAX: Returns the element in a non-empty collection of objects of type T with the highest value. T must support the >= operation. If several elements share the highest value, one of them is randomly selected. context Collection::max():T pre: self−>notEmpty() post: result = self−>any(e | self−>forAll(e2 | e >= e2))
– MIN: Returns the element with the lowest value in the collection of objects of type T . T must support the <= operation. If several elements share the lowest value, one of them is randomly selected.
Specifying Aggregation Functions in Multidimensional Models with OCL
425
context Collection::min():T pre: self−>notEmpty() post: result = self−>any(e | self−>forAll(e2 | e <= e2))
– SUM: Returns the sum value of the elements in the collection. Already part of the OCL Standard Library, and thus, we do not need to redefine it. – COUNT: Returns the number of elements in a collection. Equivalent to the existing OCL size operation. – COUNT DISTINCT: Returns the number of different elements in a collection. To implement this operation we convert the collection to a set (to remove repeated elements) and apply the OCL size operation to the resulting set. context Collection::countDistinct(): Integer post: result = self−>asSet()−>size()
Algebraic Functions – AVG: Returns the arithmetic average value of the elements in the non-empty collection. The type of the elements in the collection must support the + and / operations. context Collection::avg():Real pre: self−>notEmpty() post: result = self−>sum() / self−>size()
– VARIANCE: Returns the variance of the elements in the collection. The type of the elements in the collection must support the +, −, ∗ and / operations. The function accumulates the deviation of each element regarding the average collection value (this is computed by using the iterate operator: for each element e in the collection, the acc variable is incremented with the square result of substracting the average value from e). Note that this function uses the previously defined avg function. context Collection::variance():Real pre: self−>notEmpty() post: result = (1/(self−>size()-1)) * self−>iterate(e; acc:Real =0 | acc + (e - self−>avg()) * (e - self−>avg()))
– STDDEV: Returns the standard deviation of the elements in the collection. context Collection::stddev():Real pre: self−>notEmpty() post: result = self−>variance().sqrt()
– COVARIANCE: Returns the covariance value between two ordered sets (or sequences). We present the version for OrderedSets. The version for the Sequence type is exactly the same, only the context type changes. The standard at operation returns the position of an element in the ordered set. As guaranteed by the operation precondition, both input collections must have the same number of elements.
426
J. Cabot et al. context OrderedSet::covariance(Y: OrderedSet):Real pre: self−>size() = Y−>size() and self−>notEmpty() post: let avgY:Real = Y−>avg() in let avgSelf:Real = self−>avg() in result = (1/self−>size()) * self−>iterate(e; acc:Real=0 | acc + ((e - avgSelf) * (Y−>at(self−>indexOf(e)) - avgY))
Holistic Functions – MODE: Returns the most frequent value in a collection. context Collection::mode(): T pre: self−>notEmpty() post: result = self−>any(e | self−>forAll(e2 | self−>count(e) >= self−>count(e2))
– DESCENDING RANK: Returns the position (i.e., ranking) of an element within a Collection. We assume that the order is given by the >= relation among the elements (the type T of the elements in the collection must support this operator). The input element must be part of the collection. Repeated values are assigned the same rank value. Subsequent elements have a rank increased by the number of elements in the upper level. As mentioned above, this is just one of the possible existing interpretations for the rank function. Others would be similarly defined. context Collection::rankDescending(e: T): Integer pre: self−>includes(e) post: result = self−>size() - self−>select(e2 | e >= e2)−>size() + 1
– ASCENDING RANK: Inverse of the previous one. The order is now given by the <= relation. context Collection::rankAscending(e: T): Integer pre: self−>includes(e) post: result = self−>size() - self−>select(e2 | e <= e2)−>size() + 1
– PERCENTILE: Returns the value of the percentile p, i.e., the value below which a certain percent p of elements fall. context Collection::percentile(p: Integer): T pre: p >= 0 and p <= 100 and self−>notEmpty() post: let n: Real = (self−>size()-1) * 25 / 100 + 1 in let k : Integer = n.floor() in let d : Real = n - k in let s: Sequence(Integer) = self−>sortedBy(e | e) in if k = 0 then s−>first() * 1.0 else if k = s−>size() then s−>last() * 1.0 else s−>at(k) + d * (s−>at(k+1) - s−>at(k) ) endif endif
Specifying Aggregation Functions in Multidimensional Models with OCL
427
– MEDIAN: Returns the value separating the higher half of a collection from the lower half, i.e., the value of the percentile 50. context Collection::median(): T pre: self−>notEmpty() post: result = self−>percentile(50)
3.3
Applying the Operations
As we above-commented, these operations can be used exactly in the same way as any other standard OCL function. As an example, we show the use of our avg function to compute the average number of miles earned by a customer in each flight leg. context Customer::avgMilesPerFlightLeg():Real body: self−>frequentFlyerLegs.Miles−>avg()
4
Validation
Our OCL extension has been validated by using the UML Specification Environment (USE) tool [15]. As a first step, we have implemented our aggregation operations as new user-defined functions in USE. Thanks to the syntactic analysis performed by USE, the syntactic correctness of our functions has been proved in this step. Additionally, in order to also prove that our functions behave as expected (i.e. to check that they are also semantically correct), we have evaluated them over sample scenarios and evaluated the correctness of the results (i.e., we have compared the result returned by USE when executing queries including our operations with the expected result as computed by ourselves). Fig. 2 shows more details of the process. In the background of the USE environment we can see the implementation of the multidimensional conceptual schema of Fig. 1 in USE (left-hand side) and the script that loads the data provided in [18] (objects and links, which have been obtained by using the operations described in [16]) into the corresponding classes and associations (righthand side). In the foreground we show one of the queries we have used to test our functions (in this case the query is used to check our avg function) together with the resulting collection of data returned by the query. Interested readers can download2 the scripts and data of our running example together with the definition of our library of aggregation functions. It is worth noting that during the validation process we have overcome some limitations of the USE tool, since it neither provides the indexOf nor Cartesian product functions. Therefore, functions that make use of these OCL operators needed to be slightly redefined for their implementation in USE, e.g., the covariance function. To create the queries to test our operations we have used as a base query the query defined in Sect. 1 (miles earned by a frequent flyer in his/her trips from Denver according to their fare). Test queries have been created by applying on 2
http://www.lucentia.es/index.php/OCL_Statistics_Library
428
J. Cabot et al.
Fig. 2. Conceptual querying of frequent flyer legs implemented in USE Table 1. Collections of miles by fare class when the departure’s city is Denver City FareClass Miles Denver Economy ∅ Business {61,61,61,1634,1634,1906} First {977,977,1385} Discount {992,1432} Table 2. Results for distributive and algebraic statistical functions miles Economy Business First Discount
sum 0 5357 2406 2424
max N/A 1906 977 1432
min avg var stddev covar. N/A N/A N/A N/A N/A 61 892,8333 840200,5667 916,6246 248379,4444 452 802 91875 303,1089 20650 992 1212 96800 311,1270 9240
this base query a different aggregation function every time. The results returned by the base query are shown in Table 1. Then, Tab. 2 and 3 show the results of applying our aggregation functions3 on the collection of values of Tab. 1. The results returned by our functions were the ones expected (according to the underlying data) in all cases. 3
Fare per frequent flyer is used as an additional collection to compute the covariance.
Specifying Aggregation Functions in Multidimensional Models with OCL
429
Table 3. Results for holistic statistical functions miles Economy Business First Discount
5
mode perc.(25) median N/A N/A N/A 61 61 847,5 977 714,5 977 992 1102 1212
Automatic Code Generation
This section shows how our “enriched” schema can be used in the context of a MDD process. In fact, conceptual schemas containing queries defined using our aggregation functions can be directly implemented in any final technology platform by using exactly the same existing MDD methods and tools able to generate code from UML/OCL schemas. These methods do not need to be extended to cope with our aggregation functions. An automatic code-generation is possible thanks to the fact that (i) our library is defined at the model-level and thus it is technologicallyindependent, and (ii) aggregation functions are specified in terms of standard OCL operations. More specifically, given a query operation q including an OCL aggregation operation s, q can be directly implemented in a technology platform p (for instance a relational database or a object oriented Java program) if p offers a native support for s. In that case, we just need to replace the call to s with the call to the corresponding operation in p as part of the usual translation process followed to generate the code for implementing OCL queries in that platform. Otherwise, i.e., p does not support s, we need to first unfold s in q by replacing the call to s with the body condition of s. After the unfolding, q only contains standard OCL functions and therefore can be implemented in p as explained in the former case. As an example we show in Fig. 3 the implementation of the query average miles per flight leg specified in OCL in Sect. 3.3. Fig.3 (a) shows the implementation for a relational database, while Fig.3 (b) shows it for a Java program. In the database implementation, queries could be translated as views. The generation of the relational tables (for the classes and associations in the conceptual schema) and the views for the query operations can be generated with the DresdenOCL tool [22] (among others). Since database management systems usually offer statistical packages for all of our functions, the avg operation in the query is directly translated by calling the predefined SQL AVG function in the database (see Fig.3 (a)). For the Java example, queries are translated as methods in the class owning the query. Java classes and methods can be generated from a UML/OCL specification using the same DresdenOCL tool or other OCL-to-Java tools (see a list in [23]. However, in this case we need to first unfold the definition of avg in the query since Java does not directly support aggregation operations. The new OCL query body becomes:
430
J. Cabot et al.
context Customer::avgMilesPerFlightLeg():Real post: result = self−>frequentFlyerLegs.Miles−>sum() / self−>frequentFlyerLegs.Miles−>size()
This new body is the one passed over to the Java code-generation tool to obtain the corresponding Java method, as can be seen in Fig. 3 (b). All nonstandard Java operations (e.g., sumMiles) are implemented by the own OCLto-Java tool during the translation (basically they traverse the AST of the OCL expression and generate a new auxiliary method for each node in the tree without a exact mapping to one of the predefined methods in the Java API). Obviously, different tools will generate different Java code excerpts. create view AvgMilesFlight as { select avg(l.miles) from customer c, frequentflyerlegs l where c.id=l.customer } (a) DBMS code
class Customer { int id; String name; Vector
Fig. 3. Code excerpts for an OCL query using the avg function
6
Related Work
Multidimensional modeling languages (and modeling languages in general) offer a limited support for the definition of aggregation operations at the conceptual level. Early approaches [9,10,24] are only concerned about static aspects and lack of mechanisms to properly model multidimensional query behavior. At most, these approaches suggest a limited set of predefined aggregation functions but without providing a formal definition. Recently, other approaches have been trying to use more expressive constructs to model aggregation functions at the conceptual level by extending the UML [8,14,11]. They all propose to use OCL to complete the multidimensional model with information about the applicable aggregation functions in order to define multidimensional queries in a proper manner. They also suggest that aggregation functions should be defined in the UML schema, but unfortunately, they do not provide any mechanisms to carry it out. Therefore, to overcome this drawback, we define in this paper how to extend OCL with new aggregation functions in order to query multidimensional schemas at the conceptual level. A subset of these functions was presented in a preliminary short paper [25].
7
Conclusions and Future Work
Aggregation functions should be part of the predefined constructs provided by existing languages for multidimensional modeling to allow designers to specify
Specifying Aggregation Functions in Multidimensional Models with OCL
431
queries at the conceptual level. However, due to the current lack of support in modeling languages, queries are not currently defined as part of the conceptual schema but added only after the schema has been implemented in the final platform. In this paper, we address this issue by providing an OCL extension that predefines a set of aggregation functions that facilitate the definition of platform-independent queries as part of the specification of the multidimensional conceptual schema of the data warehouse. These queries can be then animated and validated at design-time and automatically implemented along with the rest of the system during the code-generation phase. Our short term future work is to better integrate these aggregation functions with OLAP operations already presented in [16] to provide a more complete definition of the CS Furthermore, definition of multidimensional queries at the conceptual level opens the door to the development of systematic techniques for the treatment of aggregation problems in data analysis at the conceptual level, as a way to evaluate the overall quality of the data warehouse at design time. Finally, we are also concerned about developing mechanisms that help users to define their own ad-hoc ocl queries in a more intuitive manner.
Acknowledgements Work supported by the projects: TIN2008-00444, ESPIA (TIN2007-67078) from the Spanish Ministry of Education and Science (MEC), QUASIMODO (PAC080157-0668) from the Castilla-La Mancha Ministry of Education and Science (Spain), and DEMETER (GVPRE/2008/063) from the Valencia Government (Spain). Jes´ us Pardillo is funded by MEC under FPU grant AP2006-00332.
References 1. Cabibbo, L.: A framework for the investigation of aggregate functions in database queries. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 383– 397. Springer, Heidelberg (1999) 2. Lenz, H.J., Thalheim, B.: OLAP databases and aggregation functions. In: SSDBM, pp. 91–100. IEEE Computer Society, Los Alamitos (2001) 3. Lenz, H.-J., Thalheim, B.: OLAP schemata for correct applications. In: Draheim, D., Weber, G. (eds.) TEAA 2005. LNCS, vol. 3888, pp. 99–113. Springer, Heidelberg (2006) 4. Ross, R.B., Subrahmanian, V.S., Grant, J.: Aggregate operators in probabilistic databases. J. ACM 52(1), 54–101 (2005) 5. TPC: Transaction Processing Performance Council, http://www.tpc.org 6. Rizzi, S., Abell´ o, A., Lechtenb¨ orger, J., Trujillo, J.: Research in data warehouse modeling and design: dead or alive? In: DOLAP, pp. 3–10 (2006) ` Conceptual schema-centric development: A grand challenge for informa7. Oliv´e, A.: ´ Falc˜ tion systems research. In: Pastor, O., ao e Cunha, J. (eds.) CAiSE 2005. LNCS, vol. 3520, pp. 1–15. Springer, Heidelberg (2005) 8. Abell´ o, A., Samos, J., Saltor, F.: YAM2 : a multidimensional conceptual model extending UML. Inf. Syst. 31(6), 541–567 (2006)
432
J. Cabot et al.
9. Golfarelli, M., Maio, D., Rizzi, S.: The dimensional fact model: A conceptual model for data warehouses. Int. J. Cooperative Inf. Syst. 7(2-3), 215–247 (1998) 10. H¨ usemann, B., Lechtenb¨ orger, J., Vossen, G.: Conceptual data warehouse modeling. In: DMDW, 6 (2000) 11. Prat, N., Akoka, J., Comyn-Wattiau, I.: A UML-based data warehouse design method. Decision Support Systems 42(3), 1449–1473 (2006) 12. Shoshani, A.: OLAP and statistical databases: Similarities and differences. In: PODS, pp. 185–196. ACM Press, New York (1997) 13. Object Management Group: UML 2.0 OCL Specification (2003) 14. Luj´ an-Mora, S., Trujillo, J., Song, I.Y.: A UML profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59(3), 725–769 (2006) 15. Gogolla, M., B¨ uttner, F., Richters, M.: USE: A UML-based specification environment for validating UML and OCL. Sci. Comput. Program. 69(1-3), 27–34 (2007) 16. Pardillo, J., Maz´ on, J.N., Trujillo, J.: Extending OCL for OLAP querying on conceptual multidimensional models of data warehouses. Information Sciences 180(5), 584–601 (2010) 17. Maz´ on, J.N., Trujillo, J.: An MDA approach for the development of data warehouses. Decis. Support Syst. 45(1), 41–58 (2008) 18. Kimball, R., Ross, M.: The Data Warehouse Toolkit. Wiley & Sons, Chichester (2002) 19. Rafanelli, M., Bezenchek, A., Tininini, L.: The aggregate data problem: A system for their definition and management. SIGMOD Record 25(4), 8–13 (1996) 20. Embley, D., Barry, D., Woodfield, S.: Object-Oriented Systems Analysis. A ModelDriven Approach. Youdon Press Computing Series (1992) 21. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997) 22. Software Technology Group - Technische Universitat Dresden: Dresden OCL toolkit, http://dresden-ocl.sourceforge.net/ 23. Cabot, J., Teniente, E.: Constraint support in mda tools: A survey. In: Rensink, A., Warmer, J. (eds.) ECMDA-FA 2006. LNCS, vol. 4066, pp. 256–267. Springer, Heidelberg (2006) 24. Sapia, C., Blaschka, M., H¨ ofling, G., Dinter, B.: Extending the E/R Model for the Multidimensional Paradigm. In: ER Workshops, 105–116 (1998) 25. Cabot, J., Maz´ on, J.-N., Pardillo, J., Trujillo, J.: Towards the conceptual specification of statistical functions with OCL. In: CAiSE Forum, pp. 7–12 (2009)
The CARD System Faiz Currim1, Nicholas Neidig2, Alankar Kampoowale3, and Girish Mhatre4 1
Department of Management Sciences, Tippie College of Business, University of Iowa, Iowa City IA, USA 2 Software Engineer, Kansas City MO, USA 3 State Hygienic Lab, University of Iowa, Iowa City IA, USA 4 Department of Internal Medicine, Carver College Of Medicine, University of Iowa, Iowa City IA, USA {faiz-currim,alankar-kampoowale,girish-mhatre}@uiowa.edu, nneidig@netzero.net
Abstract. We describe a CASE tool (the CARD system) that allows users to represent and translate ER schemas, along with more advanced cardinality constraints (such as participation, co-occurrence and projection [1]). The CARD system supports previous research that proposes representing constraints at the conceptual design phase [1], and builds upon work presenting a framework for establishing completeness of cardinality and the associated SQL translation [2]. From a teaching perspective, instructors can choose to focus student efforts on data modeling and design, and leave the time-consuming and error-prone aspect of SQL script generation to the CARD system. Graduate-level classes can take advantage of support for more advanced constraints. Keywords: Keywords: conceptual design, schema import, relational translation, CASE tool, cardinality, triggers.
1 Introduction Cardinality constraints have been a useful and integral part of conceptual database diagrams since the original entity-relationship (ER) model proposed by Chen [3]. A variety of papers since then have examined cardinality constraints in more detail, and many frameworks and taxonomies have been proposed to comprehensively organize the types of cardinality constraints [4, 5]. Cardinality captures the semantics of realworld business rules, and is needed for subsequent translation into the logical design and implementation in a database. In previous research we suggested it was useful to explicitly model constraints at the conceptual stage to permit automation in translation into logical design [1]. We also proposed an approach for establishing completeness of cardinality constraint classifications [2]. This allowed us to come up with a well-defined SQL mapping for each constraint type. The CARD (Constraint Automated Representation for Database systems) project seeks to use knowledge of the SQL mapping to automatically generate database trigger code. Our software aims to go beyond freely available CASE tools by also generating triggers that manage the more complicated constraints. This J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 433–437, 2010. © Springer-Verlag Berlin Heidelberg 2010
434
F. Currim et al.
automated approach to code generation has a two-fold advantage of improving both database integrity and programmer productivity. In the next section we describe the architecture of the CARD system.
2 System Architecture The architecture of the CARD system is summarized Fig. 1. The user interacts with the system via the web front-end which provides options to both import as well as manually create schemas. Changes to the schema may also be made through the web-interface. For example, the user may wish to change an attribute name, or add constraints that are not typically shown on an ER diagram. The SQL generator module takes an existing schema and translates it into tables and associated triggers (for constraints). The purpose of the data access layer is to provide standardized access to schema repository by the different modules. This also allows for future additions to client software while re-using functionality. An in-depth description of the complete CARD system would be beyond the scope of this demo proposal, and we focus on the schema import and SQL generation aspects.
Fig. 1. System Architecture
2.1 The Visio Import Layer The current prototype allows the import of ER schemas developed in MS-Visio. We provide a stencil of standard shapes (downloadable from our website) for entity classes, relationships and attributes based on the an extended ER [6] grammar. In the future, we would like to support UML as well. Some of the basic stencil shapes are shown in Fig. 2 and a screen capture of the import process is in Fig. 3. Next, we briefly explain how the Visio parser functions. Our Visio import layer uses JDOM to parse the Visio XML drawings (.vdx files) in two stages. First each master shape and its correspondence with ER modeling constructs is identified. Then,
The CARD System
ENTITY CLASS
attribute
WEAK ENTITY CLASS
identifier
435
Relationship
multi-valued attribute
Fig. 2. Sample Visio stencil shapes
Fig. 3. Importing a Visio XML Drawing
we map the items in the drawing to corresponding master types (e.g., entity classes or attributes). After parsing diagram objects (e.g., entity classes) into data structures, we analyze what shapes are connected to one another to determine relationships and attributes. Further processing transforms the parsed objects into XML elements that associate different roles played by the components of the specific database schema. For example, we track which class in an inclusion relationship serves as the superclass and which others serve as subclasses. We use a variety of heuristics to determine the root of a superclass-subclass hierarchy as also for a composite-attribute tree. To allow for complex diagrams and eliminate ambiguity, we introduce a few syntactic requirements over standard ER modeling grammars, such as having a directed arrow from a weak entity class to its identifying relationship. The Visio parser also does some structural validation of the diagram. For example, we test whether lines are connected at both ends to a valid shape (e.g., a relationship should not be connected to another relationship, but only to one or more entity classes). In order to facilitate easy SQL generation, we check names given to the objects on the diagram (and raise warnings if special characters are used). The output of our Visio parser, is an XML document that contains information about the different schema constructs. This is passed on to the XML Importer.
436
F. Currim et al.
2.2 The XML Importer The XML import layer of the CARD project is responsible for parsing an XML document representing a user’s ER schema. While a user is allowed to directly edit and provide such a document for import to the CARD system, in most cases we assume the schema is generated by our Visio parser. We provide an XML Schema document which we use to check the validity of the incoming XML document. Next, we evaluate the document for ER structural validity (e.g., strong entity classes should have an identifier) and satisfaction of constraints limiting the types and numbers of individual constructs in a relationship and the cardinalities allowed. If a document fails to adhere to a constraint, the user is notified with a message. The process of updating the schema repository goes hand in hand with the parsing. If at any point the document is found to be invalid, the transaction updating the repository is aborted. The XML layer can also act as canonical standard for the specification of data that an ER diagram is designed to store. This feature allows for future development of additional user-defined input file formats. The only requirement is to create a parser that converts from the desired representation format to the structure enforced by our XML Schema document. The Visio layer of the project is an example of one such method of conversion and further establishes the flexibility and versatility of the project. Schema SQL Writing The system supports writing the SQL representation of an entity relationship schema. This involves the implicit relational conversion, and the associated SQL generation. The writing of a schema begins with the schema’s entity classes. The strong entity classes are translated first, followed by related subclasses, and weak entity classes. Multi-valued attributes are also written right after their parent class. The translation includes definition of primary and foreign keys. We provide support for translation of a variety of relationships, including interaction, grouping, composition and instantiation. We fully support unary, binary, as well as higher-order relationships. Further, we generate trigger code to manage participation constraints for the general n-ary relationship. We are in the process of developing code to handle co-occurrence and projection constraints. A minimum cardinality specification of > 0, implies the need for an ON DELETE trigger, while a maximum cardinality of < many requires an INSERT trigger. If the database allows updates to primary key values or constraint specification with predicates then an UPDATE trigger must be used. A row trigger is written for the tables corresponding to a relationship. The core SQL mapping for each of these constraint types has been previously discussed [2]. Since a relationship may have multiple constraints on it (corresponding to different predicate conditions), our triggers call a constraint check procedure that is written for each relationship to verify that the affected rows do not violate the cardinality of any other constraints on the relationship.
3 Summary The CARD prototype has been developed in Java, and is available over the web at: http://www.iowadb.com/ . For the demonstration, we would like to present the ER
The CARD System
437
schema development and import process, the options for constraint annotation, as well as the generation of table creation SQL code and associated triggers. While the SQL table creation code is designed to be ANSI compliant and work across platforms, the constraint triggers currently use PL/SQL and currently are Oracle specific (however we feel that the core logic can be modified in a straightforward manner for other platforms). Since it is an online tool, we feel it will be suitable for an audience interested in conceptual data modeling research as well as instructors of database classes.
References 1. Currim, F., Ram, S.: Modeling Spatial and Temporal Set-based Constraints during Conceptual Database Design. Information Systems Research (forthcoming) 2. Currim, F., Ram, S.: Understanding the concept of Completeness in Frameworks for Modeling Cardinality Constraints. In: 16th Workshop on Information Technology and Systems, Milwaukee (2006) 3. Chen, P.P.: The Entity-Relationship Model - Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976) 4. Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality Constraints in Semantic Data Models. Data and Knowledge Engineering 11, 235–270 (1993) 5. Ram, S., Khatri, V.: A Comprehensive Framework for Modeling Set-based Business Rules during Conceptual Database Design. Information Systems 30, 89–118 (2005) 6. Ram, S.: Intelligent Database Design using the Unifying Semantic Model. Information and Management 29, 191–206 (1995)
AuRUS: Automated Reasoning on UML/OCL Schemas Anna Queralt, Guillem Rull, Ernest Teniente, Carles Farré, and Toni Urpí Universitat Politècnica de Catalunya - BarcelonaTech {aqueralt,grull,teniente,farre,urpi}@essi.upc.edu
Abstract. To ensure the quality of an information system, the conceptual schema that represents its domain must be semantically correct. We present a prototype to automatically check whether a UML schema with OCL constraints is right in this sense. It is well known that the full expressiveness of OCL leads to undecidability of reasoning. To deal with this problem, our approach finds a compromise between expressiveness and decidability, thus being able to handle very expressive constraints guaranteeing termination in many cases.
1 Introduction We present a tool that allows to assess the semantic quality of a conceptual schema, consisting of a UML class diagram complemented with a set of arbitrary OCL constraints. Semantic quality can be seen from two perspectives. First, the definition of the schema cannot include contradictions or redundancies. In other words, the schema must be right. Second, the schema must be the right one, i.e. it must correctly represent the requirements. Due to the high expressiveness of the combination of UML and OCL, it becomes very difficult to manually check the correctness of a conceptual schema, specially when the set of constraints is large, so it is desirable to provide the designer with automated support. Our approach consists in checking a set of properties on a schema, both to assess that it is right and that it is the right one. Most of the questions that allow checking these properties are automatically drawn from the schema. Additionally, we provide the designer with the ability to define his own questions to check the correspondence of the schema with the requirements in an interactive validation process. It is well-known that the problem of automatically reasoning with integrity constraints in their full generality is undecidable. This means that it is impossible to build a reasoning procedure that deals with the full expressiveness of OCL, and that always terminates and answers whether the schema satisfies a certain property or not. Thus, the problem has been approached in the following ways in the literature: 1. 2. 3.
Allowing general constraints without guaranteeing termination [3, 5]. Allowing general constraints without guaranteeing completeness [1, 4, 7, 11]. Ensuring both termination and completeness by allowing only specific kinds of constraints, such as cardinalities or identifiers [2].
Approaches of the first kind are able to deal with arbitrary constraints but do not guarantee termination, that is, a result may not be obtained in some particular cases. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 438–444, 2010. © Springer-Verlag Berlin Heidelberg 2010
AuRUS: Automated Reasoning on UML/OCL Schemas
439
The second kind of approaches always terminate, but they may fail to find that a schema satisfies a certain property when it does. Finally, the third kind of approaches guarantee both completeness and termination by disallowing arbitrary constraints. AuRUS combines the benefits of the first and the third approaches. It admits highly expressive OCL constraints, for which decidability is not guaranteed a priori, and determines whether termination is ensured for each schema to be validated. The theoretical background of this tool was published in [8-10]. AuRUS is an extension of our relational database schema validation tool [6] to the context of conceptual schemas. It handles UML schemas with arbitrary OCL constraints, and provides the analysis of termination. Also, additional predefined properties are included, both to check that the schema is right and that it is the right one. Moreover, the explanations for unsatisfiable properties are given in natural language.
2 Description of the Validation Process In this section we describe how a schema can be validated using AuRUS, which corresponds to the demonstration we intend to perform. We will use the schema in Fig. 1, specified using the CASE tool Poseidon, with a set of OCL constraints that can be introduced as comments. The schema must be saved as XMI (cf. Section 3).
1.context Department inv UniqueDep: Department.allInstances()->isUnique(name)
8.context Boss inv BossIsManager: self.managedDep->notEmpty()
2.context Employee inv UniqueEmp: Employee.allInstances()->isUnique(name)
9.context Boss inv BossHasNoSuperior: self.superior->isEmpty()
3.context WorkingTeam inv UniqueTeam: WorkingTeam.allInstances()->isUnique(name)
10.context Boss inv SuperiorOfAllWorkers: self.subordinate-> includesAll(self.managedDep.worker)
4.context Department inv MinimumSalary: self.minSalary > 1000 5.context Department inv CorrectSalaries: self.minSalary < self.maxSalary 6.context Department inv ManagerIsWorker: self.worker->includes(self.manager) 7.context Department inv ManagerHasNoSuperior: self.manager.superior->isEmpty()
11.context WorkingTeam inv InspectorNotMember: self.employee->excludes(self.inspector) 12.context Member inv NotSelfRecruited: self.recruiter <> self 13.context WorkingTeam inv OneRecruited: self.member-> exists(m | m.recruiter.workingTeam=self)
Fig. 1. Class diagram introduced in Poseidon and OCL integrity constraints
440
A. Queralt et al.
Once the schema is loaded in AuRUS, the tool shows a message informing whether termination of the reasoning is guaranteed. The user can proceed in any case, knowing that he will get an answer in finite time if the result if positive. In our example we can check any (predefined or user-defined) property with the guarantee of termination, as can be seen in Fig. 2.
Fig. 2. Message informing that reasoning on our schema is always finite
Fig. 3 shows the interface of AuRUS. The schema is represented in the tree at the left, containing the class diagram and the constraints. When clicking on a class, its attributes appear below; for associations, the type and cardinality of their participants is shown, and for constraints, their corresponding OCL expressions appear.
Fig. 3. Is the schema right? properties, with results for liveliness
AuRUS: Automated Reasoning on UML/OCL Schemas
441
The first step is to check whether the schema is right (Is the schema right? tab). The user can choose the properties to be checked, expressed in the form of questions. We check all of them, and we see for example that the answer to the question Are all classes and associations lively? is negative. This takes 7,6 seconds, as can be seen at the bottom of the window. The Liveliness tab below shows that class Boss (in red) is not lively, and clicking on it we get an explanation consisting in the set of constraints (graphical, OCL or implicit in the class diagram) that do not allow to instantiate it. This means we must remove or modify any of these constraints to make Boss lively. The rest of classes and associations are lively, and clicking on each of them we obtain a sample instantiation as a prove. Both the explanation for the unliveliness of Boss and a sample instantiation for the liveliness of Department are shown in Fig. 3. The constants used in the instantiations are real numbers, since this facilitates the implementation and does not affect the results. Instances of classes include the values of their attributes as parameters, as well as a first value representing the Object Identifier (OID). Instances of associations include the OIDs of the instances they link. For instance, the instantiation for Department in Fig. 3 shows that in order to have a valid instance of this class (in this case, a department with 0.0 as OID) we also need an employee with OID 0.0010, who must be linked to this department by means of the associations WorksIn and Manages, due to the cardinalities and to the OCL constraint 6. Also, the minimum salary of the department must be over 1000 (constraint 4), and lower than the maximum salary (constraint 5). This is shown in the values 1000.001 and 1000.002 given to its attributes. We can also see that there is some redundant constraint (details are given in the Non-redundancy tab). In particular, the OCL constraint 9 (BossHasNoSuperior) is redundant. Fig. 4 shows the explanation computed by AuRUS, consisting on the constraints that cause the redundancy: BossIsManager and ManagerHasNoSuperior, together with the fact that Boss is a subclass of Employee, which is the class in which the latter constraint is defined. This redundancy means we can remove BossHasNoSuperior from the schema, thus making it simpler and easier to maintain while preserving its semantics. The rest of properties are satisfied (see Fig. 3).
Fig. 4. Explanation for the redundancy of the OCL constraint BossHasNoSuperior
When all the properties in Is the schema right? tab are satisfied, we can check the ones in Is it the right schema? to ensure that it represents the intended domain. As shown in Fig. 5(a) (Predefined validation tab), some predefined questions help the designer to check whether he has overlooked something. As a result we get that all classes have an identifier (the answer to Is some identifier missing? is No), but some
442
A. Queralt et al.
other constraints might be missing. For instance, the answer Maybe to the question Is some irreflexive constraint missing? warns us that some recursive association in the schema can link an object to itself. In the Irreflexivity tab, the highlighted line tells us that an employee can be related to himself through the association WorksFor. The designer must decide whether this is correct according to the domain and add an appropriate constraint if not. Finally, the designer may want to check additional ad-hoc properties, such as May a superior work in a different department than his subordinates?. This test can be introduced in the Interactive validation tab (Fig. 5(b)), where instantiations for classes, association classes or associations can be edited to formalize the desired questions. Questions consist in a set of instances that must (not) hold in the schema, and these instances may be specified using variables, as shown in the figure. In this case we want to check whether the schema admits an employee, formalized using the variable Superior, that has a Subordinate (instance of the association WorksFor), such that Superior works in a department Dept (instance of WorksIn) and Subordinate does not work in the department Dept (another instance of WorksIn, which this time must be negated). This partial instantiation of the schema, shown at the bottom of Fig. 5(b), is satisfiable, which means that in our schema a superior may work in a department that is different from that of his subordinates.
(a) Predefined validation
(b) Interactive validation
Fig. 5. Is it the right schema? properties
In the same way, specific instances can be tested to see whether they are accepted by the schema, by providing constants instead of variables when defining the question.
AuRUS: Automated Reasoning on UML/OCL Schemas
443
3 AuRUS Overview AuRUS works as a standalone application. Its input is an XMI file with the UML/OCL schema. This file is loaded into a Java library that implements the UML 2.0 and OCL 2.0 metamodels [12]. Since the XMI generated by different CASE tools are usually not compatible, this library implements its own XMI. Models can be constructed using the primitives offered by the library, or can be drawn in Poseidon and then imported using a converter [12]. Poseidon does not support some UML constructs, such as n-ary association classes. If required, they can be added using the primitives in the library. Currently, only the converter from Poseidon is available, but we plan to provide converters for other popular tools. All the components of AuRUS have been implemented in Java, except for the reasoning engine, which is implemented in C#. It can be executed in any system featuring the .NET 2.0 framework and the Java Runtime Environment 6.
Acknowledgements Our thanks to Lluís Munguía, Xavier Oriol and Guillem Lubary for their work in the implementation of this tool, and to Albert Tort and Antonio Villegas for their help. We also thank the people in the FOLRE and GMC research groups. This work has been partly supported by the Ministerio de Ciencia y Tecnología under the projects TIN2008-03863 and TIN2008-00444, Grupo Consolidado, and the FEDER funds.
References 1. Anastasakis, K., Bordbar, B., Georg, G., Ray, I.: On Challenges of Model Transformation from UML to Alloy. Software and System Modeling 9(1), 69–86 (2010) 2. Berardi, D., Calvanese, D., De Giacomo, G.: Reasoning on UML Class Diagrams. Artificial Intelligence 168(1-2), 70–118 (2005) 3. Brucker, A.D., Wolff, B.: The HOL-OCL Book. Swiss Federal Institute of Technology (ETH),525 (2006) 4. Cabot, J., Clarisó, R., Riera, D.: Verification of UML/OCL Class Diagrams Using Constraint Programming. In: Proc. Workshop on Model Driven Engineering, Verification and Validation, MoDEVVa 2008 (2008) 5. Dupuy, S., Ledru, Y., Chabre-Peccoud, M.: An Overview of RoZ: A Tool for Integrating UML and Z Specifications. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, pp. 417–430. Springer, Heidelberg (2000) 6. Farré, C., Rull, G., Teniente, E., Urpí, T.: SVTe: A Tool to Validate Database Schemas Giving Explanations. In: Proc. International Workshop on Testing Database Systems DBTest, p. 9 (2008) 7. Gogolla, M., Büttner, F., Richters, M.: USE: A UML-based Specification Environment for Validating UML and OCL. Science of Computer Programming 69(1-3), 27–34 (2007) 8. Queralt, A., Teniente, E.: Reasoning on UML Class Diagrams with OCL Constraints. In: Embley, D.W., Olivé, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 497–512. Springer, Heidelberg (2006)
444
A. Queralt et al.
9. Queralt, A., Teniente, E.: Decidable Reasoning in UML Schemas with Constraints. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 281–295. Springer, Heidelberg (2008) 10. Rull, G., Farré, C., Teniente, E., Urpí, T.: Providing Explanations for Database Schema Validation. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2008. LNCS, vol. 5181, pp. 660–667. Springer, Heidelberg (2008) 11. Snook, C., Butler, M.: UML-B: Formal Modeling and Design Aided by UML ACM Trans. on Soft. Engineering and Methodology 15(1), 92–122 (2006) 12. UPC, UOC. EinaGMC, http://guifre.lsi.upc.edu/eina_GMC
How the Structuring of Domain Knowledge Helps Casual Process Modelers Jakob Pinggera1, Stefan Zugal1 , Barbara Weber1 , Dirk Fahland2 , Matthias Weidlich3 , Jan Mendling2 , and Hajo A. Reijers4 1
University of Innsbruck, Austria {jakob.pinggera,stefan.zugal,barbara.weber}@uibk.ac.at 2 Humboldt-Universit¨ at zu Berlin, Germany {fahland@informatik.hu-berlin.de,jan.mendling}@wiwi.hu-berlin.de 3 Hasso-Plattner-Institute, University of Potsdam, Germany matthias.weidlich@hpi.uni-potsdam.de 4 Eindhoven University of Technology, The Netherlands h.a.reijers@tue.nl Abstract. Modeling business processes has become a common activity in industry, but it is increasingly carried out by non-experts. This raises a challenge: How to ensure that the resulting process models are of sufficient quality? This paper contends that a prior structuring of domain knowledge, as found in informal specifications, will positively influence the act of process modeling in various measures of performance. This idea is tested and confirmed with a controlled experiment, which involved 83 master students in business administration and industrial engineering from Humboldt-Universit¨ at zu Berlin and Eindhoven University of Technology. In line with the reported findings, our recommendation is to explore ways to bring more structure in the specifications that are used as input for process modeling endeavors.
1
Introduction
Business process modeling is the task of creating an explicit, graphical model of a business process from internalized knowledge on that process. This type of conceptual modeling has recently received considerable attention in information systems engineering due to its increasing importance in practice [1]. Business process modeling typically involves two specific, associated roles. A domain expert concretizes domain knowledge into an informal description, which is abstracted into a formal model by a system analyst. This works well if the domain expert and the process modeler closely interact with each other, and if both the domain expert and the process modeler attain a high level of expertise. However, these two conditions are often not met. Increasingly, casual modelers are involved in process modeling initiatives, who are neither domain nor process modeling experts. Many organizations do not reserve the time or resources for iterative and consensus-seeking approaches. To illustrate, we are in contact with a financial services provider that employs over 400 business professionals of which only two are skilled process modelers. Process modeling activities in this organization are J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 445–451, 2010. c Springer-Verlag Berlin Heidelberg 2010
446
J. Pinggera et al.
often carried out by IT specialists with a low process modeling expertise. As a consequence, process modeling is driven by informal requirement specifications as provided by domain experts, and models are generated in a small number of cycles with little opportunity for feedback and interaction. Requirements engineering points to the importance of structure in requirements specifications [2,3]. Along these lines, our central idea is that the quality of process models may be improved through providing inexperienced, casual process modelers with well-structured domain descriptions. To investigate this contention we designed an experiment in which we observed and measured how a process modeler creates a formal process model from an informal requirements specification. We varied the level of content organization of an informal requirements specification (serving as proxy for differently skilled domain experts) and traced its impact on process model quality. The subjects in our experiment were 83 students who received only limited prior process modeling training. The paper is structured as follows. Section 2 discusses the background of our research. Section 3 describes our experimental framework. Section 4 covers the execution and results of our experiment. Section 5 concludes the paper.
2
Background
For discussing factors that influence the creation of a formal process model, it is necessary to first reflect on conceptual modeling in general. Conceptual models are developed and used during the requirements analysis phase of information systems development [4]. At that stage, conceptual modeling is an exchange process between a domain expert on the one hand and a system analyst on the other hand [3,5]. Typically, a domain expert can be characterized as someone with (1) superior, detailed knowledge of the object under consideration but often (2) minor powers of abstraction beyond that knowledge. The strengths of the system analyst are exactly the opposite. In this sense, the domain expert is mainly concerned with concretization, which refers to the act of developing an informal description of the situation under consideration. The system analyst, in contrast, is concerned with abstraction, i.e., using the informal description to derive a formalized model of the object. The interaction between domain expert and analyst comprises elicitation, mapping, verification, and validation. In the elicitation step the domain expert produces an initial problem specification, also referred to as the dialogue document. As natural language is human’s essential vehicle to convey ideas, this is normally written in natural language [3]. The primary task of the system analyst is to map the sentences of this informal dialogue document onto concepts of a modeling technique. The resulting formal model can be verified using the syntax rules of the technique. The formal model, in turn, can be translated into a comprehensible format for validation purposes. DeMarco states that a dialogue document is not the problem in the analysis if it is a “suitably partitioned spec with narrative text used at the bottom level” [6]. This statement is in line with more general insights from cognitive psychology that the presentation of
How the Structuring of Domain Knowledge Helps Casual Process Modelers
447
a problem has a significant impact on the solution strategy [7]. This need for a good organization of the requirements is reflected in various works that suggest guidelines and automated techniques for increasing the quality of informal descriptions, particularly of use cases [8,9]. The quality of the dialogue document can be improved using a multitude of requirements elicitation techniques [10]. In the situation of casual modeling the steps of mapping, verification, and validation are conducted by a system analyst with limited abstraction capabilities. Currently, we lack a detailed understanding of what the process of mapping an informal dialogue document to a process model looks like and what exactly the abstraction capabilities entail that we expect from a good system analyst or process modeler. In this paper, we focus on the organization of domain knowledge as it has to be done during mapping to a formal model. To investigate its impact on the creation of process models, we provide different dialogue documents in an experiment that have different degrees of internal organization. For subjects like graduate students without established expertise in modeling, we should be able to observe the consequences of a lack of content organization. Insights from this investigation might improve guidelines on organizing a dialogue document and the effectiveness of approaches supporting the modeling process.
3
Research Setup
The main goal of our experiment is to investigate the impact of content organization of the dialogue document on the modeling outcome and the modeling process. To this end, we designed a task of creating a formal process model in BPMN syntax from an informal dialogue document under varying levels of content organization. To investigate the very process of process modeling, we recorded every modeling step in the experiment in a log. In this section the setup of our experiment is described in conformance with the guidelines of [11]. Subjects: In our experiment, subjects are 66 students of a graduate course on Business Process Management at Eindhoven University of Technology and 17 students of a similar course at the Humboldt-Universit¨ at zu Berlin. Participation in the study was voluntary. The participants conducted the modeling in the Cheetah BPMN Modeler [12] which is a graphical process editor, specifically designed for conducting controlled experiments. Objects: The object to be modeled is an actual process run by the “Task Force Earthquakes” of the German Research Center for Geosciences (GFZ), who coordinates the allocation of an expert team after catastrophic earthquakes. The task force runs in-field missions for collecting and analyzing data, including seismic data of aftershocks, post-seismic deformation, hydrogeological data, damage distribution, and structural conditions of buildings at major disaster sites [13]. In particular, subjects were asked to model the “Transport of Equipment” process of the task force. The task force needs scientific equipment in the disaster area to complete its mission. We provided a description of how the task force transports its equipment from Germany to the disaster area.
448
J. Pinggera et al.
Factor and Factor Levels: The considered factor in our experiment is the organization of the dialogue document. We provided our subjects with dialogue documents with varying degrees of content organization simulating the structuring capabilities of domain specialists. The documents differ in the order in which the process is described (Factor Levels: breadth-first, depth-first and random order description). For all three dialogue document variants, we created a natural language description of the process from a set of elementary text blocks, each block describing one activity of the process. Depending on the factor level, the text blocks were ordered differently. The breadth-first description begins with the start activity and then explains the entire process by taking all branches into account. The depth-first description, in turn, begins with the start activity and then describes the entire first branch of the process before moving on with other branches. Finally, the random description yields a dialogue document for which the order of activity text blocks does not correlate with the structure of the process model. Response Variables: As response variable we considered accuracy of the resulting model, estimated by comparing each model to a reference model of the process. Here, we relied on the graph-edit distance, which defines the minimal number of atomic graph operations needed to transform one graph into another. The graph-edit distance, in turn, can be leveraged to define a similarity metric [14]. For our setting, we weighted insertion and deletion operations of edges and nodes equally, whereas node substitutions are not taken into account as they have been established manually for corresponding pairs of activities. The corresponding hypothesis is: Null Hypothesis H0 : There is no significant difference in the accuracy of the created process models between the three groups.
4
Performing the Experiment
This section describes the preparation and execution of the experiment, as well as the analysis and discussion of the results. Preparation: As part of the set-up of the intended experiment, we provided the task to model the “Transport of Equipment” process from a natural language description. Three variants of the task were created: A depth first description (Variant D), a breadth first description (Variant B) and a random description (Variant R). To ensure that each description is understandable and can be modeled in the available amount of time, we conducted a pre-test with 14 graduate students at the University of Innsbruck. Based on their feedback, the modeling task descriptions were refined in several iterations. Execution: The experiment was conducted at two distinct, subsequent events. The first event took place early November 2009 in Berlin, the second was performed a few days later in Eindhoven. The modeling session started with a demographic survey and was followed by a modeling tool tutorial in which the basic functionality of the BPMN Modeler was explained to our subjects. This was followed by the actual modeling task in which the students had to model
How the Structuring of Domain Knowledge Helps Casual Process Modelers
449
the “Transport of Equipment” process. Roughly a third of the students were randomly assigned to the D variant of the modeling task, another third to the B variant and the remaining third to the R variant. After completing the modeling task, the students received a questionnaire on cognitive load. Data Validation: Once the experiment was carried out, logged data was analyzed. We discarded data of 8 students because respective data was incomplete. Finally, data provided by 66 Eindhoven students and 17 Berlin students was used in our data analysis. Data Analysis: In total 83 students participated in our experiment. Out of the 83 students 27 worked on the breadth-first description, 25 on the depthfirst description and 31 on the random description. To assess how accurately the 83 models reflect the “Transport of Equipment” process, we compared each model to a reference model of the process using the graph-edit distance. A statistical analysis revealed a significant difference in accuracy between the three groups in terms of this similarity metric (p=0.0026), see Fig. 1. Pairwise MannWhitney tests showed a significant difference between breadth-first and random (p=0.0021 < 0.05/3) and between depth-first and random (p=0.0073 < 0.05/3). No difference can be observed between breadth-first and depth-first (p=0.4150). Discussion of Results. Our concern in this paper is the impact of the organization level of an informal specification on the outcome of a modeling process. This contention seems to be confirmed: An explicit ordering of the specification is positively related to the accuracy of the process model that is derived from it. The models created from a breadth-first and depth-first description are significantly more similar to the reference model than those created on the basis of the randomized description. The group dealing with the latter had to re-organize the dialogue document quite extensively. This suggests that casual modelers would perform better when presented with well structured specifications. How can the insights from this study be exploited? First of all, it seems reasonable to be selective with respect to the domain experts that will be involved in drawing up the informal specifications. After all, some may be more apt than others to bring structure to such a document. Secondly, it may be feasible to instruct domain experts on how to bring structure to their specifications. In a research stream that is concerned with structuring use cases [9,15], various
Breadth-first
Depth-first
Random
0,11
0,16
0,21
0,26 Similarity
Fig. 1. Accuracy of Models
0,31
0,36
450
J. Pinggera et al.
measures can be distinguished to ease the sense-making of these. For example, one proposal is to create pictures along with use cases that sketch the hierarchical relations between these or simply use numbering to both identify and distinguish between logical fragments of the use cases.
5
Summary and Outlook
This paper presented findings from an experiment investigating the impact of content organization of the informal dialogue document to both the modeling outcome and the modeling process. Apparently, a breadth-first organization was best suited to yield good results, indicating that industrial practice of process modeling can be improved when selecting domain specialists with good content organization skills. Our future work aims at further investigating the process of creating process models. Acknowledgements. We thank H. Woith and M. Sobesiak for providing us with the expert knowledge of the disaster management process used in our experiment.
References 1. Indulska, M., Recker, J., Rosemann, M., Green, P.: Business process modeling: Current issues and future challenges. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.) CAiSE 2009. LNCS, vol. 5565, pp. 501–514. Springer, Heidelberg (2009) 2. Davis, A., et al.: Identifying and measuring quality in a software requirements specification. In: Proc. METRICS, pp. 141–152 (1993) 3. Frederiks, P.J.M., van der Weide, T.P.: Information Modeling: The Process and the Required Competencies of Its Participants. DKE 58, 4–20 (2006) 4. Wand, Y., Weber, R.: Research Commentary: Information Systems and Conceptual Modeling - A Research Agenda. ISR 13, 363–376 (2002) 5. Hoppenbrouwers, S., Proper, H., Weide, T.: A fundamental view on the process of conceptual modeling. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, ´ (eds.) ER 2005. LNCS, vol. 3716, pp. 128–143. Springer, Heidelberg J., Pastor, O. (2005) 6. DeMarco, T.: Software Pioneers. Contributions to Software Engineering (2002) 7. Lindsay, P., Norman, D.: Human Information Processing: An introduction to psychology. Academic Press, London (1977) 8. Cockburn, A.: Writing Effective Use Cases. Addison-Wesley, Reading (2000) 9. Rolland, C., Achour, C.B.: Guiding the Construction of Textual Use Case Specifications. DKE 25, 125–160 (1998) 10. Davis, A.M., et al.: Effectiveness of requirements elicitation techniques: Empirical results derived from a systematic review. In: Proc. RE, pp. 176–185 (2006) 11. Wohlin, C., et al.: Experimentation in Software Engineering: an Introduction. Kluwer, Dordrecht (2000)
How the Structuring of Domain Knowledge Helps Casual Process Modelers
451
12. Pinggera, J., Zugal, S., Weber, B.: Investigating the Process of Process Modeling with Cheetah Experimental Platform. Accepted for ER-POIS 2010 (2010) 13. Fahland, D., Woith, H.: Towards Process Models for Disaster Response. In: Proc. PM4HDPS 2008, pp. 254–265 (2008) 14. Dijkman, R., Dumas, M., Garc´ıa-Ba˜ nuelos, L.: Graph Matching Algorithms for Business Process Model Similarity Search. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) Business Process Management. LNCS, vol. 5701, pp. 48–63. Springer, Heidelberg (2009) 15. Constantine, L., Lockwood, L.: Structure and style in use cases for user interface design. In: Object Modeling and User Interface Design, pp. 245–280 (2001)
SPEED: A Semantics-Based Pipeline for Economic Event Detection Frederik Hogenboom, Alexander Hogenboom, Flavius Frasincar, Uzay Kaymak, Otto van der Meer, Kim Schouten, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands {fhogenboom,hogenboom,frasincar,kaymak}@ese.eur.nl, {276933rm,288054ks,305415dv}@student.eur.nl
Abstract. Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets. Therefore, it is important to be able to automatically and accurately identify events in news items in a timely manner. For this, one has to be able to process a large amount of heterogeneous sources of unstructured data in order to extract knowledge useful for guiding decision making processes. We propose a Semantics-based Pipeline for Economic Event Detection (SPEED), aiming to extract financial events from emerging news and to annotate these with meta-data, while retaining a speed that is high enough to make real-time use possible. In our implementation of the SPEED pipeline, we reuse some of components of an existing framework and develop new ones, e.g., a high-performance Ontology Gazetteer and a Word Sense Disambiguator. Initial results drive the expectation of a good performance on emerging news.
1
Introduction
In today’s information-driven society, machines that can process natural language can be of great importance. Decision makers are expected to be able to extract information from an ever increasing amount of data such as emerging news, and subsequently to be able to acquire knowledge by applying reasoning to the gathered information. In today’s global economy, it is of utmost importance to have a complete overview of the business environment to enable effective, well-informed decision making. Financial decision makers thus need to be aware of events on their financial market, which is often extremely sensitive to economic events like stock splits and dividend announcements. Proper and timely event identification can aid decision making processes, as these events provide means of structuring information using concepts, with which knowledge can be generated by applying inference. Hence, automating information extraction and knowledge acquisition processes can be a valuable contribution. This paper proposes a fully automated framework for processing financial news messages gathered from RSS feeds. These events are represented in a J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 452–457, 2010. c Springer-Verlag Berlin Heidelberg 2010
SPEED: A Semantics-Based Pipeline for Economic Event Detection
453
machine-understandable way. Extracted events can be made accessible for other applications as well through the use of Semantic Web technologies. Furthermore, it is aimed that the framework is able to handle news messages at a speed useful for real-time use, as new events can occur any time and require decision makers to respond in a timely and adequate manner. Our proposed framework (pipeline) identifies the concepts related to economic events, which are defined in a domain ontology and are associated to synsets from a semantic lexicon (e.g., WordNet [1]). For concept identification, lexicosemantic patterns based on concepts from the ontology are employed in order to match lexical representations of concepts retrieved from the text with eventrelated concepts that are available in the semantic lexicon, and thus aim to maximize recall. The identified lexical representations of relevant concepts are subject to a Word Sense Disambiguation (WSD) procedure for determining the corresponding sense, in order to maximize precision. In order for our pipeline to be real-time applicable, we also aim to minimize the latency, i.e., the time it takes for a new news message to be processed by the pipeline. The remainder of this paper is structured as follows. Firstly, Sect. 2 discusses related work. Subsequently, Sects. 3 and 4 elaborate on the proposed framework and its implementation, respectively. Finally, Sect. 5 wraps up this paper.
2
Related Work
This section discusses tools that can be used for Information Extraction (IE) purposes. Firstly, Sect. 2.1 discusses the ANNIE pipeline and subsequently, Sects. 2.2 and 2.3 elaborate on the CAFETIERE and KIM frameworks, respectively. Finally, Sect. 2.4 wraps up this section. 2.1
The ANNIE Pipeline
The General Architecture for Text Engineering (GATE) [2] is a freely available general purpose framework for IE tasks, which provides the possibility to construct processing pipelines from components that perform specific tasks, e.g., linguistic, syntactic, and semantic analysis tasks. By default, GATE loads the A Nearly-New Information Extraction (ANNIE) system, which consists of several key components, i.e., the English Tokenizer, Sentence Splitter, Part-Of-Speech (POS) Tagger, Gazetteer, Named Entity (NE) Transducer, and OrthoMatcher. Although the ANNIE pipeline has proven to be useful in various information extraction jobs, its functionality does not suffice when applied to discovering economic events in news messages. An important lacking component is one that can be employed for WSD, although some disambiguation can be done using JAPE rules in the NE Transducer. This is however a cumbersome and ineffective approach where rules have to be created manually for each term, which is prone to errors. Furthermore, ANNIE lacks the ability to individually look up concepts from a large ontology within a limited amount of time. Despite its drawbacks, GATE is highly flexible and customizable, and therefore ANNIE’s components are either usable, or extendible and replaceable in order to suit our needs.
454
2.2
F. Hogenboom et al.
The CAFETIERE Pipeline
An example of an adapted ANNIE pipeline is the Conceptual Annotations for Facts, Events, Terms, Individual Entities, and RElations (CAFETIERE) relation extraction pipeline [3], which consists of an ontology lookup process and a rule engine. Within CAFETIERE, the Common Annotation Scheme (CAS) DTD is applied, which allows for three layers of annotation, i.e., structural, lexical, and semantic annotation. CAFETIERE employs extraction rules defined at lexicosemantic level which are similar to JAPE rules. Nevertheless, the syntax is at a higher level than is the case with JAPE, resulting in more easy to express, but less flexible rules. As knowledge is stored in an ontology using Narrative Knowledge Representation Language (NKRL), Semantic Web ontologies are not employed. NKRL has no formal semantics and there is no reasoning support, which is desired when identifying for instance economic events. Furthermore, gazetteering is a slow process when going through large ontologies. Finally, the pipeline also misses a WSD component. 2.3
The KIM Platform
The Knowledge and Information Management (KIM) platform [4] combines GATE components with semantic annotation techniques in order to provide an infrastructure for IE purposes. The framework focuses on automatic annotation of news articles, where entities, inter-entity relations, and attributes are discovered. For this, the authors employ a pre-populated OWL upper ontology. In the back-end, a semantically enabled GATE pipeline, which utilizes semantic gazetteers and pattern-matching grammars, is invoked for named entity recognition using the KIM ontology. Furthermore, GATE is used for managing the content and annotations within the back-end of KIM’s architecture. The middle layer of the KIM architecture provides services that can be used by the topmost layer, e.g., semantic repository navigation, semantic indexing and retrieval, etcetera. The front-end layer of KIM embodies front-end applications, such as the Annotation Server and the News Collector. The differences between KIM and our envisaged approach are in that we aim for a financial event-focused information extraction pipeline, which is in contrast to KIM’s general purpose framework. Hence, we employ a domainspecific ontology rather than an upper ontology. Furthermore, we focus on event extraction from corpora, in contrast to mere (semantic) annotation. Finally, the authors do not mention the use of WSD, whereas we consider WSD to be an essential component in an IE pipeline. 2.4
Conclusions
The IE frameworks discussed in this section have their merits, yet each framework fails to fully address the issues we aim to alleviate. The frameworks incorporate semantics only to a limited extent, e.g., they make use of gazetteers or
SPEED: A Semantics-Based Pipeline for Economic Event Detection
455
knowledge bases that are either not ontologies or ontologies that are not based on OWL. Being able to use a standard language as OWL fosters application interoperability and the reuse of existing reasoning tools. Also, existing frameworks lack a feed-back loop, i.e., there is no knowledge base updating. Furthermore, WSD appears not to be sufficiently tackled in most cases. Finally, most existing approaches focus on annotation, rather than event recognition. Therefore, we aim for a framework that combines the insights gained from the approaches that are previously discussed, targeted at financial event discovery in news articles.
3
Economic Event Detection Based on Semantics
The analysis presented in Sect. 2 demonstrates several approaches to automated information extraction from news messages, which are often applied for annotation purposes and are not semantics-driven. Because we hypothesize that domain-specific information captured in semantics facilitates detection of relevant concepts, we propose a Semantics-Based Pipeline for Economic Event Detection (SPEED). The framework is modeled as a pipeline and is driven by a financial ontology developed by domain experts, containing information on the NASDAQ-100 companies that is extracted from Yahoo! Finance. Many concepts in this ontology stem from a semantic lexicon (e.g., WordNet), but another significant part of the ontology consists of concepts representing named entities (i.e., proper names). Figure 1 depicts the architecture of the pipeline. In order to identify relevant concepts and their relations, the English Tokenizer is employed, which splits text into tokens (which can be for instance words or numbers) and subsequently applies linguistic rules in order to split or merge identified tokens. These tokens are linked to ontology concepts by means of the Ontology Gazetteer, in contrast to a regular gazetteer, which uses lists of words as input. Matching tokens in the text are annotated with a reference to their associated concepts defined in the ontology.
Information Flow UsedBy Relationship
News
English Tokenizer
Ontology Gazetteer
Sentence Splitter
Part-Of-Speech Tagger
Morphological Analyzer
Ontology Instantiator
Event Pattern Recognition
Event Phrase Gazetteer
Word Sense Disambiguator
Word Group Look-Up
Ontology
Fig. 1. SPEED design
Semantic Lexicon
456
F. Hogenboom et al.
Subsequently, the Sentence Splitter groups the tokens in the text into sentences, based on tokens indicating a separation between sentences. These sentences are used for discovering the grammatical structures in a corpus by determining the type of each word token by means of the Part-Of-Speech Tagger. As words can have many forms that have a similar meaning, the Morphological Analyzer subsequently reduces the tagged words to their lemma as well as a suffix and/or affix. A word can have multiple meanings and a meaning can be represented by multiple words. Hence, the framework needs to tackle WSD tasks, given POS tags, lemmas, etcetera. To this end, first of all, the Word Group Look-Up component combines words into maximal word groups, i.e., it aims for as many words per group as possible for representing some concept in a semantic lexicon (such as WordNet). It is important to keep in mind groupings of words, as combinations of words may have very specific meanings compared to the individual words. Subsequently, the Word Sense Disambiguator determines the word sense of each word group by exploring the mutual relations between senses of word groups using graphs. The senses are determined based on the number and type of detected semantic interconnections in a labeled directed graph representation of all senses of the considered word groups [5]. After disambiguating word group senses, the text can be interpreted by introducing semantics, which links word groups to an ontology, thus capturing their essence in a meaningful and machine-understandable way. Therefore, the Event Phrase Gazetteer scans the text for specific (financial) events, by utilizing a list of phrases or concepts that are likely to represent some part of a relevant event. Events thus identified are then supplied with available additional information by the Event Pattern Recognition component, which matches events to lexico-semantic patterns that are subsequently used for extracting additional information. Finally, the knowledge base is updated by inserting the identified events and their extracted associated information into the ontology by means of the Ontology Instantiator.
4
SPEED Implementation
The analysis presented in Sect. 2 exhibits the potential of a general architecture for text engineering: GATE. The modularity of such an architecture can be of use in the implementation of our semantics-based pipeline for economic event detection, as proposed in Sect. 3. Therefore, we made a Java-based implementation of the proposed framework by using default GATE components, such as the English Tokenizer, Sentence Splitter, Part-Of-Speech Tagger, and the Morphological Analyzer, which generally suit our needs. Furthermore, we extended the functionality of other GATE components (e.g., ontology gazetteering), and also implemented additional components to tackle the disambiguation process. Initial results on a test corpus of 200 news messages fetched from the Yahoo! Business and Technology RSS feeds show fast gazetteering of about 1 second and a precision and recall for concept identification in news items of 86% and 81%, respectively, which is comparable with existing systems. Precision and recall of
SPEED: A Semantics-Based Pipeline for Economic Event Detection
457
fully decorated events result in lower values of approximately 62% and 53%, as they rely on multiple concepts that have to be identified correctly.
5
Conclusions and Future Work
In this paper, we have proposed a semantics-based framework for economic event detection: SPEED. The framework aims to extract financial events from news articles (announced through RSS feeds) and to annotate these with meta-data, while maintaining a speed that is high enough to enable real-time use. We discussed the main components of the framework, which introduce some novelties, as they are semantically enabled, i.e., they make use of semantic lexicons and ontologies. Furthermore, pipeline outputs also make use of semantics, which introduces a potential feedback loop, making event identification a more adaptive process. Finally, we briefly touched upon the implementation of the framework and initial test results on the basis of emerging news. The established fast processing time and high precision and recall provide a good basis for future work. The merit of our pipeline is in the use of semantics, enabling broader application interoperability. For future work, we do not only aim to perform thorough testing and evaluation, but also to implement the proposed feedback of newly obtained knowledge (derived from identified events) to the knowledge base. Also, it would be worthwhile to investigate further possibilities for implementation in algorithmic trading environments, as well as a principal way of linking sentiment to discovered events, in order to assign more meaning to these events. Acknowledgments. The authors are partially sponsored by the NWO EW Free Competition project FERNAT: Financial Events Recognition in News for Algorithmic Trading.
References 1. Fellbaum, C.: WordNet an Electronic Lexical Database. Computational Linguistics 25(2), 292–296 (1998) 2. Cunningham, H.: GATE, a General Architecture for Text Engineering. Computers and the Humanities 36(2), 223–254 (2002) 3. Black, W.J., Mc Naught, J., Vasilakopoulos, A., Zervanou, K., Theodoulidis, B., Rinaldi, F.: CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and RElations. Technical Report TR–U4.3.1, Department of Computation, UMIST, Manchester (2005) 4. Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: KIM - A Semantic Platform For Information Extraction and Retrieval. Journal of Natural Language Engineering 10(3-4), 375–392 (2004) 5. Navigli, R., Velardi, P.: Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7), 1075–1086 (2005)
Prediction of Business Process Model Quality Based on Structural Metrics Laura Sánchez-González1, Félix García1, Jan Mendling2, Francisco Ruiz1, and Mario Piattini1 1
Alarcos Research Group, TSI Department, University of Castilla La Mancha, Paseo de la Universidad, nº4, 13071,Ciudad Real, España {Laura.Sanchez,Felix.Garcia,Francisco.RuizG, Mario.Piattini}@uclm.es 2 Humboldt-Universität zu Berlin, Unter den Linden 6, D-10099 Berlin, Germany jan.mendling@wiwi.hu-berlin.de
Abstract. The quality of business process models is an increasing concern as enterprise-wide modelling initiatives have to rely heavily on non-expert modellers. Quality in this context can be directly related to the actual usage of these process models, in particular to their understandability and modifiability. Since these attributes of a model can only be assessed a posteriori, it is of central importance for quality management to identify significant predictors for them. A variety of structural metrics have recently been proposed, which are tailored to approximate these usage characteristics. In this paper, we address a gap in terms of validation for metrics regarding understandability and modifiability. Our results demonstrate the predictive power of these metrics. These findings have strong implications for the design of modelling guidelines. Keywords: Business process, measurement, correlation analysis, regression analysis, BPMN.
1 Introduction Business process models are increasingly used as an aid in various management initiatives, most notably in the documentation of business operations. Such initiatives have grown to an enterprise-wide scale, resulting in several thousand models and involving a significant number of non-expert modellers [1]. This setting creates considerable challenges for the maintenance of these process models, particularly in terms of adequate quality assurance. In this context, quality can be understood as “the totally of features and characteristics of a conceptual model that bear on its ability to satisfy stated or implied needs”[2]. It is well known that poor quality of conceptual models can increase development efforts or results in a software system that does not satisfy user needs [3]. It is therefore vitally important to understand the factors of process model quality and to identify guidelines and mechanisms to guarantee a high level of quality from the outset. An important step towards improved quality assurance is a precise quantification of quality. Recent research into process model metrics pursues this line of argument J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 458–463, 2010. © Springer-Verlag Berlin Heidelberg 2010
Prediction of Business Process Model Quality Based on Structural Metrics
459
by measuring the characteristics of process models. The significance of these metrics relies on a thorough empirical validation of their connection with quality attributes [4]. The most prominent of these attributes are understandability and modifiability, which both belong to the more general concepts of usability and maintainability, respectively [5]. While some research provides evidence for the validity of certain metrics as predictors of understandability, there is, to date, no insight available into the connection between structural process model metrics and modifiability. This observation is in line with a recent systematic literature review that identifies a validation gap in this research area [6]. In accordance with the previously identified issues, the purpose of this paper is to contribute to the maturity of measuring business process models. The aim of the empirical research presented herein is to discover the connections between an extensive set of metrics and the ease with which business process models can be understood (understandability) and modified (modifiability). This was achieved by adapting the measures defined in [7] to BPMN business process models [8]. The empirical data of six experiments which had been defined for previous works were used. A correlation analysis and a regression estimation were applied in order to test the connection between the metrics and both the understandability and modifiability of the models. The remainder of the paper is as follows. In Section 2 we describe the theoretical background of our research and the set of metrics considered. Section 3 describes the series of experiments that were used. Sections 4 and 5 present the results. Finally, Section 6 draws conclusions and presents topics for future research.
2 Structural Metrics for Process Models In this paper we consider a set of metrics defined in [6] for a series of experiments on process model understanding and modifiability. The hypothetical correlation with understandability and modifiability is annotated in brackets as (+) for positive correlation or (-) for negative correlation. The metrics include: • • • • • • • • • •
Number of nodes (-): number of activities and routing elements in a model; Diameter (-): The length of the longest path from a start node to an end node; Density (-): ratio of the total number of arcs to the maximum number of arcs; The Coefficient of Connectivity (-): ratio of the total number of arcs in a process model to its total number of nodes; The Average Gateway Degree (-) expresses the average of the number of both incoming and outgoing arcs of the gateway nodes in the process model; The Maximum Gateway Degree (-) captures the maximum sum of incoming and outgoing arcs of these gateway nodes; Separability (+) is the ratio of the number of cut-vertices on the one hand to the total number of nodes in the process model on the other; Sequentiality (+): Degree to which the model is constructed out of pure sequences of tasks. Depth (-): maximum nesting of structured blocks in a process model; Gateway Mismatch (-) is the sum of gateway pairs that do not match with each other, e.g. when an AND-split is followed by an OR-join;
460
• • •
L. Sánchez-González et al.
Gateway Heterogeneity (-): different types of gateways are used in a model; Cyclicity (-) relates the number of nodes in a cycle to the sum of all nodes; Concurrency(-) captures the maximum number of paths in a process model that may be concurrently activate due to AND-splits and OR-splits.
3 Research Design The empirical analysis performed is composed by six experiments: three to evaluate understandability and three to evaluate modifiability. The experimental material for the first three experiments consisted of 15 BPMN models with different structural complexity. Each model included a questionnaire related to its understandability. The experiments on modifiability included 12 BPMN models related to a particular modification task. A more detailed description of the family of experiments can be found in [9]. It was possible to collect the following objective data for each model and each task: time of understandability or modifiability for each subject, number of correct answers in understandability or modifiability, and efficiency defined as the number of correct answers divided by time. Once the values had been obtained, the variability of the values was analyzed to ascertain whether the measures varied sufficiently to be considered in the study. Two measures were excluded, namely Cyclicity and Concurrency, because the results they offered had very little variability (80% of the models had the same value for both measures, the mean value was near to 0, as was their standard deviation). The remaining measures were included in the correlation analysis. The experimental data was accordingly used to test the following null hypotheses for the current empirical analysis, which are: • •
For the experiments on understandability, H0,1: There is no correlation between structural metrics and understandability For the experiments on modifiability, H0,2: there is no correlation between structural metrics and modifiability
The following sub-sections show the results obtained for the correlation and regression analysis of the empirical data.
4 Correlation Analysis Understandability: Understanding time is strongly correlated with number of nodes, diameter, density, average gateway degree, depth, gateway mismatch, and gateway heterogeneity in all three experiments. There is no significant correlation with the connectivity coefficient, and the separability ratio was only correlated in the first experiment. With regards to correct answers, size measures, number of nodes (-.704 with p-value of .003), diameter (-.699, .004), and gateway heterogeneity (.620, .014) have a significant and strong correlation. With regard to efficiency, we obtained evidence of the correlation of all the measures with the exception of separability.
Prediction of Business Process Model Quality Based on Structural Metrics
461
The correlation analysis results indicate that there is a significant relationship between structural metrics and the time and efficiency of understandability. The results for correct answers are not as conclusive, since there is only a correlation of 3 of the 11 analyzed measures. We have therefore found evidence to reject the null hypothesis H0,1. The alternative hypothesis suggests that these BPMN elements affect the level of understandability of conceptual models in the following way. It is more difficult to understand models if: • • • •
There are more nodes. The path from a start node to the end is longer. There are more nodes connected to decision nodes. There is higher gateway heterogeneity.
Modifiability: We observed a strong correlation between structural metrics and time and efficiency. For correct answers there is no significant connection in general, while there are significant results for diameter, but these are not conclusive since there is a positive relation in one case and a negative correlation in another. For efficiency we find significant correlations with average (.745, .005) and maximum gateway degree (.763, .004), depth (-.751, .005), gateway mismatch (-.812, .001) and gateway heterogeneity (.853, .000). We have therefore found some evidence to reject the null hypothesis H0,2. The usage of decision nodes in conceptual models apparently implies a significant reduction in efficiency in modifiability tasks. In short, it is more difficult to modify model the model if: • •
More nodes are connected to decision nodes. There is higher gateway heterogeneity.
5 Regression Analysis The previous correlation analysis suggests that it is necessary to investigate the quantitative impact of structural metrics on the respective time, accuracy and efficiency dependent variables of both understandability and modifiability. This goal was achieved through the statistical estimation of a linear regression. The regression equations were obtained by performing a regression analysis with 80% of the experimental data. The remaining 20% were used for the validation of the regression models. The first step is selected the prediction models with p-values below 0.05. Then, it is necessary to validate the selected models verifying the distribution and independence of residuals through Kolmogorov-Smirnov and Durbin-Watson tests. Both tests values are considered to be satisfactory. The accuracy of the models was studied by using the Mean Magnitude Relative Error (MMRE) [10] and the prediction level Pred(25) and Pred(30) on the remaining 20% of the data, which were not used in the estimation of the regression equation. These levels indicate the percentage of model estimations that do not differ from the observed data by more than 25% and 30%. A model can therefore be considered to be accurate when it satisfies any of the following cases: a) MMRE ≤ 0,25 or b) Pred (0,25) ≥ 0,75 or c) Pred (0,30) ≥0,70. Table 3 depicts the results.
462
L. Sánchez-González et al.
Efficiency
E3
Time C.A.
E4 E4 E4
Efficiency
Understandabiltiy 47.04 + 2.46 nºnodes 3.17 - 0.005 nºnodes - 0.38 coeff. of connectivity + 0.17 depth - 0.015 gateway mismatch 0.042 - 0.0005 nºnodes+0.026sequentiality Modifiability 50.08 + 3.77 gateway mismatch + 422.95 density 1.85 - 3.569 density 0.006 + 0.008 sequentiality
P p(0,30)
E3 E2
V p(0.25)
Time Correct answers
Prediction model
MMRE
Exp
Table 1. Prediction models of understandability
.32 .18
.51 .79
.58 .79
0.84
.22
.25
.37 .23 .62
.31 .82 .32
.38 .83 .42
Understandability: The best model for predicting the understandability time is obtained with the E3, which has the lowest MMRE value of all the models. The best models with which to predict correct understandability answers originate from the E2, and this also satisfies all the assumptions. For efficiency, no model was found that satisfied all the assumptions. The model with the lowest value of MMRE is obtained in the E3. In general, the results further support the rejection of the null hypothesis H0,1. Modifiability: We did not obtain any models which satisfy all of the assumptions for the prediction of modifiability time, but we have highlighted the prediction model obtained in E4 since it has the best values. However, the model to predict the number of correct answers may be considered to be a precise model as it satisfies all the assumptions. The best results for predicting efficiency of modifiability are also provided by E4, with the lowest value of MMRE. In general, we find some further support for rejecting the null hypothesis H0,2. The best indicators for modifiability are gateway mismatch, density and sequentiality ratio. Two of these metrics are related to decision nodes. Decision nodes apparently have a negative effect on time and the number of correct answers in modifiability tasks.
6 Conclusions and Future Work In this paper we have investigated structural metrics and their connection with the quality of business process models, namely understandability and modifiability. The statistical analyses suggest rejecting the null hypotheses, since the structural metrics apparently seem to be closely connected with understandability and modifiability. For understandability these include Number of Nodes, Gateway Mismatch, Depth, Coefficient of Connectivity and Sequentiality. For modifiability Gateway Mismatch, Density and Sequentiality showed the best results. The regression analysis also provides us with some hints with regard to the interplay of different metrics. Some metrics are not therefore investigated in greater depth owing to their correlations with other metrics.
Prediction of Business Process Model Quality Based on Structural Metrics
463
Our findings demonstrate the potential of these metrics to serve as validated predictors of process model quality. Some limitations in the experimental data are about the nature of subjects, which implies that results are particularly relevant to nonexpert modellers. This research contributes to the area of process model measurement and its still limited degree of empirical validation. This work has implications both for research and practice. The strength of the correlation of structural metrics with different quality aspects (up to 0.85 for gateway heterogeneity with modifiability) clearly shows the potential of these metrics to accurately capture aspects that are closely connected with actual usage. From a practical perspective, these structural metrics can provide valuable guidance for the design of process models, in particular for selecting semantically equivalent alternatives that differ structurally. In future research we aim to contribute to the further validation and actual applicability of process model metrics Acknowledgments. This work was partially funded by projects INGENIO (PAC 080154-9262); ALTAMIRA (PII2I09-0106-2463), ESFINGE (TIN2006-15175-C05-05) and PEGASO/MAGO (TIN2009-13718-C02-01).
References 1. Rosemann, M.: Potential pitfalls of process modeling: part a. Business process Management Journal 12(2), 249–254 (2006) 2. ISO/IEC, ISO Standard 9000-2000: Quality Management Systems: Fundamentals and Vocabulary (2000) 3. Moody, D.: Theoretical and practical issues in evaluating the quality of conceptual models: current state and future directions. Data and Knowledge Engineering 55, 243–276 (2005) 4. Zelkowitz, M., Wallace, D.: Esperimental models for validating technology. IEEE Computer, Computing practices (1998) 5. ISO/IEC, 9126-1, Software engineering - product quality - Part 1: Quality Model (2001) 6. Sánchez, L., García, F., Ruiz, F., Piattini, M.: Measurement in Business Processes: a Systematic Review. Business process Management Journal 16(1), 114–134 (2010) 7. Mendling, J.: Metrics for Process Models: Empirical Foundations of Verification, Error Prediction, and Guidelines for Correctness. Springer Publishing Company, Incorporated, Heidelberg (2008) 8. OMG. Business Process Modeling Notation (BPMN), Final Adopted Specification (2006), http://www.omg.org/bpm 9. ExperimentsURL (2009), http://alarcos.inf-cr.uclm.es/bpmnexperiments/ 10. Foss, T., Stensrud, E., Kitchenham, B., Myrtveit, I.: A Simulation Study of the Model Evaluation Criterion MMRE. IEEE Transactions on Software Engineering 29, 985–995 (2003)
Modelling Functional Requirements in Spatial Design Mehul Bhatt, Joana Hois, Oliver Kutz, and Frank Dylla SFB/TR 8 Spatial Cognition, University of Bremen, Germany
Abstract. We demonstrate the manner in which high-level design requirements, e.g., as they correspond to the commonsensical conceptualisation of expert designers, may be formally specified within practical information systems, wherein heterogeneous perspectives and conceptual commitments are needed. Focussing on semantics, modularity and consistency, we argue that our formalisation serves as a synergistic interface that mediates between the two disconnected domains of human abstracted qualitative/conceptual knowledge and its quantitative/precision-oriented counterpart within systems for spatial design (assistance). Our demonstration utilises simple, yet real world examples.
1
Conceptual Modelling for Spatial Design
This paper investigates the role of ontological formalisation as a basis for modelling high-level conceptual requirement constraints within spatial design. We demonstrate the manner in which high-level functional requirements, e.g., as they correspond to the commonsensical conceptualisation of expert designers, may be formally specified within practical information systems. Here, heterogeneous perspectives and conceptual commitments are needed for capturing the complex semantics of spatial designs and artefacts contained therein. A key aspect of our modelling approach is the use of formal qualitative spatial calculi and conceptual design requirements as a link between the structural form of a design and the differing functional capabilities that it affords or leads to. In this paper, we focus on the representational modalities that pertain to ontological modelling of structural forms from different perspectives: human / designer conceptualisations and qualitative spatial abstractions suited to spatial reasoning, and geometric primitives as they are applicable to practical information systems for computer-aided design (CAD) in general, and computer-aided architecture design (CAAD) in particular. Our modelling is focussed on semantics, modularity and functional requirement consistency, as elaborated on in the following: Semantics. The expert’s design conceptualisation is semantic and qualitative in nature—it involves abstract categories such as Rooms, Doors, Motion Sensors and the spatial (topological, directional, etc.) relationships among them, e.g., ‘Room A and Room B have a Door in Between, which is monitored by Camera C ’. Professional design tools lack the ability to exploit such design expertise that J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 464–470, 2010. c Springer-Verlag Berlin Heidelberg 2010
Modelling Functional Requirements in Spatial Design
465
a designer is equipped with, but unable to communicate to the design tool explicitly in a manner consistent with its inherent human-centred conceptualisation, i.e., semantically and qualitatively. Modular and Multi-dimensional Representation. An abstraction such as a Room or Sensor may be identified semantically by its placement within an ontological hierarchy and its relationships with other conceptual categories. This is what a designer must deal with during the initial design conceptualisation phase. However, when these notions are transferred to a CAD design tool, the same concepts acquire a new perspective, i.e., now the designer must deal with points, line-segments, polygons and other geometric primitives. Within contemporary design tools, there is no way for a knowledge-based system to make inferences about the conceptual design and its geometric interpretation within a CAD model in a unified manner. Functional Requirements. A crucial aspect that is missing in contemporary design tools is the support to explicitly characterise the functional requirements of a design. For instance, it is not possible to model spatial artefacts such as the range space of a sensory device (e.g., camera, motion sensor), which is not strictly a spatial entity in the form of having a material existence, but needs to be treated as such nevertheless. For instance, consider the following constraint: ‘the motion-sensor should be placed such that the door connecting room A and room B is always within the sensor’s range space’. The capability to model such a constraint is absent from even the most state-of-the-art design tools. Organisation of paper. We present an overview of our conceptual modelling approach and the manner in which our formalisation serves as a synergistic interface that mediates between the two disconnected domains of human abstracted qualitative/conceptual knowledge and its quantitative/precision-oriented counterpart within practical information systems. Section 2 presents the concept of ontological modularity and its use for modelling multi-perspective design requirements using a spatial ontology. Section 3 details some key aspects of our spatial ontology and requirements modelled therein. Finally, Section 4 concludes.
2
Multi-perspective Representation and Modularity
Modularity has become one of the key issues in ontology engineering, covering a wide spectrum of aspects (see [11]). The main research question is how to define the notion of a module and how to re-use such modules. 2.1
Ontological Modules
The architectural design process defines constraints of architectural entities that are primarily given by spatial types of information. Space is particularly defined from a conceptual, qualitative, and quantitative perspective. The three ontological modules are briefly discussed in the following:
466
M. Bhatt et al. Task-Specific Requirements (building automation, access restriction, ...)
Integrated Representation and Axioms based on E-Connections
DOLCE
Physical Object
M1 - Conceptual Module
RCC-8 Relations
Building Architecture
M2 - Qualitative Module
Building Construction
IFC data model
M3 - Quantitative Module
Fig. 1. Multi-Dimensional Representation
M1 – Conceptual Module. This ontological module reflects the most general, or abstract, terminological information concerning architectural entities: they are conceptualised according to their essential properties, i.e., without taking into account the possible contexts into which they might be put. The ontology Physical Object categorises these entities with respect to their attributes, dependencies, functional characteristics, etc. It is based on DOLCE [9]. M2 – Qualitative Module. This module reflects qualitative spatial information of architectural entities. It specifies the architectural entities based on their regionrelated spatial characteristics.1 In particular, the ontology uses relations as provided by the RCC-8 fragment of the Region Connection Calculus (RCC) [10]. Here, we reuse an RCC ontology that has been introduced in [6], which defines the taxonomy for RCC-8 relations. M3 – Quantitative Module. This ontological module reflects metrical and geometric information of architectural entities, i.e., their polygon-based characteristics in the floor plan. It is closely related to an industrial standard for data representation and interchange in the architectural domain, namely the Industry Foundation Classes (IFC) [5]. This quantitative module specifies those entities of the architectural domain that are necessary to describe structural aspects of environments. Especially, information that is available by construction plans of buildings are described here. 2.2
E-Connecting Multiple Perspectives
The main aspects of modularity are: syntactic and logical heterogeneity, notions of module, distributed semantics and modular reasoning. Here, we restrict ourselves to describing our use of E-connections for multi-perspective modelling of spatial or architectural design. In order to model spatial design scenarios, we need to be able to cover rather disparate aspects of ‘objects’ on various conceptual (and spatial) levels. E-connections allow a formal representation of different views on the same domain 1
We concentrate on region-based spatial relations, as they are most suitable for our architectural design examples. However, other spatial relations (e.g., for distances, shapes, or orientations) may be applied as well.
Modelling Functional Requirements in Spatial Design
Sensor
Win5 fswatch fs
ops fstouch
Wall Panel
ops
Door
Col2
Sensor2 Door2
Win8 Win9 Door3
Win6 Wall6
Col3
Win7
ops
Window
Room2
Col1
Sensor1
Wall4
Wall3
range
Win4
Door1
fs
fs
fs
Win3
Win10
Wall7
Wall
Wall2
Win2
Wall5
Wall1 Win1 Room1
467
Sensor3
Wall8
fs
Fig. 2. Concrete Interpretations in R2
Fig. 3. Spatial Artefact Groundings
together with a loose coupling of such views by means of axiomatic constraints employing so-called ‘link-relations’ to formally realise the coupling. Specifically, in E-connections, an ‘abstract’ object o of some description logic (DL) can, e.g., be related via a relation E to its spatial extension in a logic such as RCC-8 (i.e. a (regular-closed) set of points in a topological space), by a relation T to its life-span in a temporal logic (i.e. an interval of time points), or by a relation S to another conceptual view (i.e. the concept of all rooms object o may be found in). Essentially, the language of an E-connection is the (disjoint) union of the original languages enriched with operators capable of talking about the link relations (see [8] for technical details). The connection of the three modules (M1–M3) is formalised by axiomatising the used link relations. The Integrated Representation defines couplings between classes from different modules. An overall integration of these thematic modules is achieved by E -connecting the aligned vocabulary along newly introduced link relations and appropriate linking axioms. Based on this Integrated Representation, the module for task-specific requirements specifies additional definitions and constraints to the architectural information available in the modules (M1– M3). It formulates requirements that describe certain functions that a specific design, e.g. a concrete floor plan, has to satisfy. They can codify building regulations that a work-in-progress design generally must meet, as explained next.
3
Functional Requirement Constraints in Architecture
Semantic descriptions of designs and their requirements acquires real significance when the spatial and functional constraints are among strictly spatial entities as well as abstract spatial artefacts. This is because although spatial artefacts may not be physically extended within a design, they need to be treated in a real physical sense nevertheless. In general, architectural working designs only contain physical entities. Therefore, it becomes impossible for a designer to model constraints involving spatial artefacts at the design level. In [1], we identified three important types of spatial artefacts (this list is not assumed to be exhaustive): A1. the operational space denotes the region of space that an object requires to perform its intrinsic function that characterises its utility or purpose
Door (a) Door 1
Washsink
M. Bhatt et al.
Phone
468
Door
Door
Door (b) Door 2
Fig. 4. Killer Doors
Stairs
Stairs
(a) Consistent
(b) Inconsistent
Fig. 5. Building code for doors and upward stairs—Landesbauordnung Bremen §35(10)
A2. the functional space of an object denotes the region of space within which an agent must be located to manipulate or physically interact with a given object A3. the range space denotes the region of space that lies within the scope of a sensory device such as a motion or temperature sensor
Fig. 2 provides a detailed view on the different kinds of spaces we introduced. From a geometrical viewpoint, all artefacts refer to a conceptualised and derived physical spatial extension in Rn . The derivation of an interpretation may depend on object’s inherent spatial characteristics (e.g., size and shape), as well as additional parameters referring to mobility, transparency, etc. We utilise the spatial artefacts introduced in (A1–A3) towards formulating functional requirements constraints for a work-in-progress spatial design. (C1–C2) may need to be satisfied by a design: C1. Steps of a staircase may not be connected directly to a door that opens in the direction of the steps. There has to be a landing between the staircase steps and the door. The length of this landing has to have at least the size of the door width. (“Bremen building code”/Landesbauordnung Bremen §35 (10)) C2. People should not be harmed by doors opening up. In general, the operation of a door should be non-interfering with the function / operation of surrounding objects.
Constraints such as (C1–C2) involve semantic characterisations and spatial relationships among strictly spatial entities as well as other spatial artefacts. In Fig. 5 we depict a consistent and an inconsistent design regarding this requirement. This official regulation can be modelled in the integrated representation by using the link relations introduced in Section 2. The regulation is specified by the ontological constraint that no operational space of a door is allowed to overlap with the steps of a staircase: Class: SubClassOf :
m2:DoorOperationalSpace m2:OperationalSpace, inv (ir:compose) exactly 1 m3:Door, not (rcc:overlaps some ( inv(ir:compose) some m3:StaircaseSteps) )
In this example, the different modules are closely connected with each other. In detail, categories in the qualitative modules M2, namely DoorOperationalSpace
Modelling Functional Requirements in Spatial Design
469
which is a subclass of OperationalSpace, are related to entities in the quantitative module M3, namely Door and StaircaseSteps, by the link relations given in the integrated representation module, namely compose.
4
Conclusion
The work described in this paper is part of an initiative that aims at developing the representation and reasoning methodology [2] and practically usable tools [4] for intelligent assistance in spatial design tasks. We have provided an overview of the overall approach to encoding design semantics within an architectural assistance system. High-level conceptual modelling of requirements, and the need to incorporate modular specifications therein, were the main topics covered in the paper. Because of parsimony of space, we could only provide a glimpse of the representation; details of the formal framework and ongoing work may be found in [3, 7]. Acknowledgements. We acknowledge the financial support of the DFG through the Collaborative Research Center SFB/TR 8 Spatial Cognition. Mehul Bhatt also acknowledges funding by the Alexander von Humboldt Stiftung, Germany. Participating projects in this paper include: [DesignSpace], I1-[OntoSpace] and R3-[Q-Shape]. We thank Graphisoft – http://www.graphisoft.com/ – for providing licenses for the design tool ArchiCAD v13 2010.
References [1] Bhatt, M., Dylla, F., Hois, J.: Spatio-terminological inference for the design of ambient environments. In: Hornsby, K.S., Claramunt, C., Denis, M., Ligozat, G. (eds.) COSIT 2009. LNCS, vol. 5756, pp. 371–391. Springer, Heidelberg (2009) [2] Bhatt, M., Freksa, C.: Spatial computing for design: An artificial intelligence perspective. In: NSF International Workshop on Studying Visual and Spatial Reasoning for Design Creativity, SDC 2010 (to appear, 2010), http://www.cosy. informatik.uni-bremen.de/staff/bhatt/seer/Bhatt-Freksa-SDC-10.pdf [3] Bhatt, M., Hois, J., Kutz, O.: Modelling Form and Function in Architectural Design. Submitted to a journal (2010), http://www.cosy.informatik.uni-bremen. de/staff/bhatt/seer/form-function.pdf [4] Bhatt, M., Ichim, A., Flanagan, G.: DSim: A Tool for Assisted Spatial Design. In: Proceedings of the 4th International Conference on Design Computing and Cognition, DCC 2010 (2010) [5] Froese, T., Fischer, M., Grobler, F., Ritzenthaler, J., Yu, K., Sutherland, S., Staub, S., Akinci, B., Akbas, R., Koo, B., Barron, A., Kunz, J.: Industry Foundation Classes for Project Management—A Trial Implementation. ITCon 4, 17–36 (1999), http://www.ifcwiki.org/ [6] Gr¨ utter, R., Scharrenbach, T., Bauer-Messmer, B.: Improving an RCC-Derived Geospatial Approximation by OWL Axioms. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 293–306. Springer, Heidelberg (2008)
470
M. Bhatt et al.
[7] Hois, J., Bhatt, M., Kutz, O.: Modular Ontologies for Architectural Design. In: Proc. of the 4th Workshop on Formal Ontologies Meet Industry, FOMI 2009, Vicenza, Italy. Frontiers in Artificial Intelligence and Applications, vol. 198, pp. 66–77. IOS Press, Amsterdam (2009) [8] Kutz, O., Lutz, C., Wolter, F., Zakharyaschev, M.: E -Connections of Abstract Description Systems. Artificial Intelligence 156(1), 1–73 (2004) [9] Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: WonderWeb Deliverable D18: Ontology Library. Technical report, ISTC-CNR (2003) [10] Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In: Proc. of KR 1992, pp. 165–176 (1992) [11] Stuckenschmidt, H., Parent, C., Spaccapietra, S. (eds.): Modular Ontologies. LNCS, vol. 5445. Springer, Heidelberg (2009)
Business Processes Contextualisation via Context Analysis Jose Luis de la Vara1, Raian Ali2, Fabiano Dalpiaz2, Juan Sánchez1, and Paolo Giorgini2 1
Centro de Investigación en Métodos de Producción de Software Universidad Politécnica de Valencia, Spain {jdelavara,jsanchez}@pros.upv.es 2 Department of Information Engineering and Computer Science University of Trento, Italy {raian.ali,fabiano.dalpiaz,paolo.giorgini}@disi.unitn.it
Abstract. Context-awareness has emerged as a new perspective for business process modelling. Even though some works have studied it, many challenges have not been addressed yet. There is a clear need for approaches that (i) facilitate the identification of the context properties that influence a business process and (ii) provide guidance for correct modelling of contextualised business processes. This paper addresses this need by defining an approach for business process contextualisation via context analysis, a technique that supports reasoning about context and discovery of its relevant properties. The approach facilitates adequate specification of context variants and of business process execution for them. As a result, we obtain business processes that fit their context and are correct. Keywords: business process modelling, context-awareness, business process contextualisation, context analysis, correctness of business process models.
1 Introduction Traditional approaches for business process modelling have not paid much attention to the dynamism of the environment of a business process. However, business processes are executed in an environment in which changes are usual, and modelling perspectives that aim to represent and understand them are necessary. Context-awareness has recently appeared as a new perspective for business process modelling to meet this need [3]. It is expected to improve business process modelling by explicitly addressing fitness between business processes and their context. The context of a business process is the set of environmental properties that affect business process execution. Therefore, these properties should be taken into account when designing a business process. If context is analysed when modelling a business process, then identification of all its variants (relevant states of the world in which the business process is executed) and definition of how the business process should be executed in them are facilitated. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 471–476, 2010. © Springer-Verlag Berlin Heidelberg 2010
472
J.L. de la Vara et al.
Some works have contributed to the advance of context-aware business process modelling by addressing issues such as context-aware workflows [4], general principles (e.g. [3]) and modelling of context effect (e.g. [2]). However, research on this topic is still at an initial stage and many challenges have not been addressed yet. This paper aims to advance in research on context-aware business process modelling by dealing with two of these challenges: 1) provision of techniques for determination of the relevant context properties that influence a business process, and; 2) provision of mechanisms and guidance for correct business process contextualisation. The objectives of the paper are to determine how business process context can be analysed, how it can influence business processes, how to create contextualised business process models, and how to guarantee their correctness. These objectives are achieved by defining an approach for business process contextualisation via context analysis [1], which is a technique that aims to support reasoning about context and discovery of contextual information to observe. Context analysis is adapted in the paper for analysis of business process context. The approach provides mechanisms and guidance that can help process designers to reason about business process context and to model business processes that fit their context and are correct. Context properties and variants are analysed in order to determine how they influence a business process, to guarantee that a business process is properly executed in all its context variants, and to correctly model contextualised business processes. The next sections present the approach and our conclusions, respectively.
2 Approach Description The approach consists of four stages (Fig. 1): modelling of initial business process, analysis of business process context, analysis of context variants and modelling of contextualised business process. First, an initial version of the business process that needs to fit its context is modelled. Next, the rest of stages have to be carried out while relevant context variations (changes) are found and they are not represented in the business process model. Relevant context variations influence the business process and imply that business process execution has to change. If a context variation is found, then business process context is analysed to find the context properties that allow process participants to know if a context variant holds. A context analysis model is created, and context variants of the business process are then analysed. Finally, a contextualised business process model is created on the basis of the final context variants and their effect on the business process.
Fig. 1. Business process contextualisation
Business Processes Contextualisation via Context Analysis
473
Fig. 2. Initial business process model
As a running example, product promotion in a department store is used (Fig. 2). The business process has been modelled with BPMN, and it does not reflect context variations such as the fact that customers do not like being addressed if they are in a hurry. The paper focuses on contextualisation of the task “Find potential buyer”. 2.1 Analysis of Business Process Context Business process context is analysed in the second stage of the approach. This stage aims to understand context, to reason about it and to discover the context properties that influence a business process. For these purposes, context analysis (which has been presented in the requirements engineering field) has been adapted for analysis of business process context. Further details about context analysis can be found in [1]. Context is specified as a formula of world predicates, which can be combined conjunctively and disjunctively. World predicates can be facts (they can be verified by a process participant) or statements (they cannot be). The truth value of a statement can be assumed if there is enough evidence to support it. Such evidence comes from another formula of world predicates that holds. A context is judgeable if there exists a formula of facts that supports it, and thus implies it. Identifying a judgeable context can be considered the main purpose of this stage. The facts of the formula correspond to the context properties that characterise the context and its variants, and their truth values influence business process execution. A context analysis model (Fig. 3) is created to facilitate reasoning about business process context and discovery of the facts of the formula that implies it.
Fig. 3. Context analysis model
474
J.L. de la Vara et al.
2.2 Analysis of Context Variants The main purposes of this stage are to adequately define the (final) context variants of a business process and that they allow correct business process contextualisation. A context variant corresponds to a set of facts whose conjunction implies a context. Fig. 4 shows the eleven initial context variants for C1, which is analysed in Fig. 1. Initial Context Variants {F1} {F2} {F3, F4, F5, F9} {F3, F4, F5, F10} {F3, F4, F5, F11} {F3, F5, F6, F9} {F3, F5, F6, F10} {F3, F5, F6, F11} {F3, F5, F7, F8, F9} {F3, F5, F7, F8, F10} {F3, F5, F7, F8, F11}
Æ
Final Context Variants CV1: {F1} CV2: {F2} CV3: {F3 Æ (F4, F5, F9)} CV4: {F3 Æ (F4, F5, F10)} CV7: {F4, F11 Æ F5} CV5: {F6 Æ F3 Æ (F5, F9)} CV6: {F6 Æ F3 Æ (F5, F10)} CV8: {F6 Æ F11Æ F5} CV9: {F3 Æ (F7, F8, F9)} CV10: {F7, F11 Æ F8}
Fig. 4. Context variants
Correctness of business processes is usually related to its soundness [5]. For business process executions that are defined from context variants, two situations can impede soundness of a contextualised business process. The first one is that a context variant contains conflicting facts. The second situation is to follow a sequence of fact verifications that will not allow a business process instance to be finished. These situations are avoided by analysing the context variants. For this purpose, a table is created to specify the relationships between facts. The table also aims to obtain context variants whose sets of facts are the minimum ones. An example is shown in Table 1, which specifies the relationships between the facts of the initial context variants of Fig. 4. The relationships are specified as follows. Given a pair of facts Fr (fact of a row) and Fc (fact of a column), their relationship can be: ‘X’ (no context variant contains Fr and Fc together); ‘Pr’ (Fr verification will precede Fc verification); ‘Pc’ (opposite to ‘Pr’); ‘Kr’ (Fr truth value will be known before Fc verification); ‘Kc’ (opposite to ‘Kr’); ‘Ur’ (Fr is always true when the Fc is true, thus Fr verification will be unnecessary when Fc is true); ‘Uc’ (opposite to ‘Ur’); ‘C’ (Fr and Fc are conflicting); ‘-’ (no relationship exists). Finally, context variants are refined by specifying sequence of fact verification (‘Æ’) and removing conflicting variants and unnecessary facts (Fig. 4). 2.3 Modelling of Contextualised Business Process A contextualised business process is modelled on the basis of its final context variants. The first step is determination of the tasks that will be part of the business process. They can correspond to: 1) tasks of the initial business process model that are not influenced by context); 2) tasks that are defined from refinement of the tasks of the initial business process model (e.g. “Address customer” refines “Find potential buyer”), and; 3) tasks that make facts true (e.g. “Approach customer” makes F3 true).
Business Processes Contextualisation via Context Analysis
475
Table 1. Relationships between facts F11 X X Ur Pc Kr Pc X X
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
F10 X X Pr Kr C X
F9 X X Pr Kr -
F8 X X Pr X Ur X -
F7 X X Pr X X
F6 X X Kc X Kc
F5 X X Pr -
F4 X X Pr
F3 X X
F2 X
Table 2. Relationships between tasks and facts T1: Approach customer T2: Address customer
F1 U U
F2 U U
F3 M Sc
F4 Sc1
F5 Sc1
F6 Sc1
F7 Sc1
F8 Sc1
F9 Sc1
F10 Sc1
F11 U U
If a task of the latter type is executed when a given fact is false, then the fact turns into true. These facts are called manageable. Once tasks are determined, a table is created to specify their relationships with the facts of the final context variants. An example is shown in Table 2. The relationships are specified as follows. Given a fact F, a set of facts φ and a task T, their relationship can be: ‘M’ (T allows F to be manageable); ‘U’ (T execution will be unnecessary if F is true); ‘Sc’ (T execution will succeed F verification); ‘ScX’ (where ‘X’ is a number; T execution will succeed verification of the facts of φ); ‘-’: (no relationship exists) CE1: F1 CE2: F2 CE3: (F3 | T1) Æ (F4, F5, F9) Æ T2 CE4: (F3 | T1) Æ (F4, F5, F10) Æ T2 CE5: F6 Æ (F3 | T1) Æ (F5, F9) Æ T2
CE6: F6 Æ (F3 | T1) Æ (F5, F10) Æ T2 CE7: F4, F11 Æ F5 CE8: F6 Æ F11Æ F5 CE9: (F3 | T1) Æ (F7, F8, F9) Æ T2 CE10: F7, F11 Æ F8
Fig. 5. Contextualised executions
The next step for modelling of a contextualised business process is specification of its contextualised executions (Fig. 5). A contextualised execution is a set of fact verifications and task executions that specifies a correct execution of a business process or of a fragment of a business process for a context variant. Contextualised executions are specified by extending the final context variants of a business process with the execution sequence of its tasks (‘Æ’). The manageable facts and their associated tasks are put in brackets and the symbol ‘|’ is put between them: either the fact is true or the task has to be executed. Finally, a contextualised business process model is created on the basis of the constraints (fact verification and task execution sequences) that the contextualised executions impose. BPMN has been extended by labelling its sequence flows for specification of formulas that have to hold so that a sequence flow is executed. Fact and formula verification is represented by means of gateways. Fig. 6 shows the effect of contextualisation of the task “Find potential buyer” for the running example.
J.L. de la Vara et al.
Assistant
476
Fig. 6. Contextualised business process model
3 Conclusions and Future Work This paper has addressed several challenges of context-aware business process modelling in order to allow research on it to further advance. As a result, an approach for business process contextualisation has been presented. The approach adapts context analysis for analysis of business process context, and provides mechanisms and guidance for analysis of business process context and its variants and for modelling of contextualised business processes. It facilitates discovery and adequate specification of relevant context properties in the form of facts, as well as of the relationships between facts and between facts and tasks of a contextualised business process. These relationships affect business process execution. Furthermore, the mechanisms and guidance can guarantee that a contextualised business process fits its context and is sound. As future work, we have to address approach automation and formal evaluation. Acknowledgements. This work has been developed with the support of the Spanish Government under the projects SESAMO TIN2007-62894 and HI2008-0190 and the program FPU AP2006-02324, partially funded by the EU Commission through the projects COMPAS, NESSOS and ANIKETOS, and co-financed by FEDER. The authors would also like to thank Amit K. Chopra for his useful comments.
References 1. Ali, R., Dalpiaz, F., Giorgini, P.: A Goal-based Framework for Contextual Requirements Modeling and Analysis. Requirements Engineering Journal (to appear, 2010) 2. Hallerbach, A., Bauer, T., Reichert, M.: Capturing Variability in Business Process Models: The Provop Approach. Journal of Software Maintenance and Evolution (to appear, 2010) 3. Rosemann, M., Recker, J., Flender, C.: Contextualisation of business processes. International Journal of Business Process Integration and Management 3(1), 47–60 (2008) 4. Smanchat, S., Ling, S., Indrawan, M.: A Survey on Context-Aware Workflow Adaptations. In: MoMM 2008, pp. 414–417 (2008) 5. Weske, M.: Business Process Management. Springer, Heidelberg (2007)
A Generic Perspective Model for the Generation of Business Process Views Horst Pichler and Johann Eder Universitaet Klagenfurt {horst.pichler,johann.eder}@uni-klu.ac.at
Abstract. Overwhelmed by the model size and the diversity of presented information in huge business process models, the stakeholders in the process lifecycle, like analysts, process designers, or software engineers, find it hard to focus on certain details of the process. We present a model along with an architecture that allows to capture arbitrary process perspectives which can then be used for the generation of process views that contain only relevant details.
1
Introduction
The complexity of big business (workflow) process models may contain hundreds of connected activities, augmented with information of diverse perspective, which are – corresponding to the application domain – required for the presentation of the process’s universe of discourse. Accordingly it is hard for various stakeholders in the process lifecycle (e.g., analysts, process managers, software engineers) to get a focus on the areas of interest. This can be accomplished with process views, which are extracts of processes that contains only relevant (selected) activities or aggregations of them. Most process view-related research publications solely focus on control flow issues. They assume that a set of already selected viewrelevant control flow elements is given and aim at the generation of process views corresponding to diverse correctness criteria. Their findings are important for the generation of process views, but they do not show how view-relevant control flow elements are selected corresponding to specified characteristics [4,3,2,5]. These characteristics are usually defined as parts of process perspectives (also called aspects), where the most frequently mentioned are: behavior (control flow), function, information (data), organization, and operation. However, this list is neither complete nor fixed and may therefore be arbitrarily modified and extended depending on the application domain or workflow system [6]. Correspondingly most standardization efforts are likely to fail. Furthermore, especially in ERP-systems, complex perspectives can not always be captured directly in the process model, but through referral to an external resource repository. We aim at the generation of process views for analytical purposes, based on queries which formulate combinations of constraints on diverse perspectives. In the following we present an architecture to import process models and related information from arbitrary sources (workflow systems, process modelling tools, J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 477–482, 2010. c Springer-Verlag Berlin Heidelberg 2010
478
H. Pichler and J. Eder
ERP-systems, etc.) into a predefined generic perspective model, which is then used to formulate user-queries for the generation of process views. Very similar to workflow data warehousing approaches [1] the structure and components of perspectives must suit the queries required to answer relevant questions. Furthermore it must be possible to extract information from various external sources, prepare (e.g., with aggregation operations) and transform it to the target perspective structures, to be loaded into the target perspective model, which can then be queried for further view generation.
2
Architecture Overview
The architecture of our systems, as visualized in Figure 1, consists of several components.
Fig. 1. System Architecture
How to use this architecture is indicated by the encircled numbers: (1) an expert specifies the perspective structure of a given model type (like XPDL) which must be stored in the perspective database. (2) Then he implements an import interface for every perspective of this model type and an XPDL-transformer for the control flow perspective and imports all processes (or process model instances respectively) into the instance database. (3) Now a user can access the system by formulating queries with the query interface, which guides (4) the user’s with model-specific context-aware information from the perspective database for a selected model type and process model instance. (5) When this specification-step is finished the query engine generates and executes the queries, that results in a list of relevant components. Then a view can be generated, followed by an export as XPDL-document. The white boxes are future components for a complete architecture - a definition tool that helps the expert during the definition phase by scanning external data structures and the import (e.g., an ETL-tool similar as
A Generic Perspective Model
479
for data warehouse systems). Another component could be a process viewer that is integrated into the architecture, which allows to get additional information on the generated views by accessing the perspective models.
3
Generic Perspective Model
Our generic perspective model allows experts to specify the components and structures of arbitrary process model perspectives, which are then filled with data from process model instances of arbitrary source systems. The UML class diagram of our generic perspective database model, as visualized in Figure 2, basically consists of two levels: (1) the perspective model that defines the components, structure, and relations of each perspective of a specific model type, and (2) the instance model, to store specific instances of process model types (e.g., the process ’claim handling’). In order to be generic the model must not contain any item that adheres to a specific concept, like ’user’, ’department’, or ’document’. Although we wanted our model to be as generic as possible, we opted against a totally generic approach for the control flow perspective for several reasons and chose the standardized process definition language XPDL in order to connect the components of other perspectives to it. 3.1
Perspective Model
The upper part of Figure 2 shows the model for arbitrary perspectives. The perspective database stores information about perspective components and their structures of any process model type (e.g., WorkParty, YAWL, SAP Business Workflow, BPEL, or any proprietary notation). Such a Model consists of several Perspectives (behavior, organization, etc.), where each perspective is composed of Components (e.g., activity, role, duration). The XPDLComponent is a specialization of a component to represent an XPDL control flow element type, which is required for the connection between the externalized XPDL control flow specification and other perspective components. Multiple inheritance between different components is supported by the association is-a, including attribute inheritance. Every component may have an arbitrary number of Attributes (e.g., a label, a description field, a condition). Due to space limitations we omitted the following attribute-related concepts in the UML class-diagram: attribute types (e.g., integer, decimal, string, boolean, date, time, enum, etc.) and type-dependent value ranges and constraints (e.g., between 1 and 10, enumeration {red, green, blue}). A relationship between components is realized by the class Relation (e.g., has, takes, connects, etc.), where multiple components can be source or target of such a relation. We differentiate between three different types of relations: (1) An association is a simple relationship between components (e.g., activity ’has’ duration). The other types are network-types used to specify directed graphs and tree structures, which means that they also implicitly describe transitive relationships that may be traversed. Again multiple inheritance between different relations is supported by the association is-a. This allows for instance the
480
H. Pichler and J. Eder
Control Flow Model
Perspective Model
1 belongs to
Model +id +name +type +version -description
belongs to
*
sub
Component +id +name +description
*
1
* super
Relation +id +name +description * +type
* source
*
* target
1
1
1 super *
*
XPDLComponent
consistsOf 1
is-a
belongs to
*
*
Perspective +id +name 1 +description
Control Flow
is-a *
sub
* 1
has
instanceOf
Attribute +id * +name +description 1
InstanceOf
instanceOf
* XDPLComponent Instance -idXPDL
* ComponentInstance -id 1
instanceOf
instanceOf
*
*
RelationInstance -id -isTransitive
target
*
* isPartOf
represented in
* has *
XPDL Document (Control Flow)
AttributeInstance -id -value
*
*
*
source
instanceOf
*
*
ModelInstance 1 -id n a me isPartOf -version -date 1 -description -author
isPartOf
1
control flow representation
Control Flow Instance
Instance Model
Fig. 2. Generic Perspective Database Model
specification of relations ’connects-regular’ and ’connects-exception’ which are both sub-classes of a relation ’connects’. Figure 3 visualizes how we mapped the structures of perspectives of a sample process to our perspective model. In this figure components – to be stored as instances of the class Component in the generic model – are represented by rectangles, their attributes – to be stored in the class Attribute – are listed below the component name, and the relations between components – to be stored in the class Relation – are represented by directed edges. With the exception of the relation ’hierarchy’ all Relations are of type ’association’. The relation-type ’network’ indicates that organizational units may form a hierarchy. Inheritance structures between components are visualized as directed edges (between super and sub) with white arrowheads. 3.2
Instance Model
The lower part of Figure 2 shows the instance model, which stores the perspective information for specific process model instances. It contains all component instances and their attribute instances (e.g., an activity instance with the name
A Generic Perspective Model Control Flow
481
Organization Activity:XMLComponent -name
assigned:Relation
Participant:Component -name
requires:Relation
access:Relation
hiearchy:Relation
Permission -accessType
Form:Component -name contains:Relation
hasAccessType:Relation
is-a Role:Component
FormField:Component -name -fieldType
is-a
(NetworkBW)
is-a
User:Component
has:Relation
Unit:Component
affiliation:Relation
Data
Fig. 3. Mapping the Perspective Model of the Sample Process
instance ’a’, a unit instance ’DepartmentA’, a user instance ’Pichler’). Additionally it contains the relation instances, which connect the component instances (e.g., DepartmentA ’hierarchy’ branchSales, Pichler ’has’ clerk). Specifically for the control flow each XPDLComponentInstance has a idXPDL, which is the original id (or key attribute) of the referenced element in the input file. This information is required when generating views, such that the control flow components in the view can be related to the components in the original process. The left-hand side of Figure 4 visualizes a small part of the organizational instance structure for our process as it is imported from the source system: user Pichler has the role Clerk, and is affiliated to a unit DepartmentA, which is a sub-unit of BranchSales. According to the assigned-relation Activity ’a’ may be executed by anybody within the DepartmentA.
Original Instances
Incl. Closures Unit name=branchSales
hierarchy
hierarchy
Activity name=a
assigned
Unit name=DepartmentA affiliation
User name=Pichler
Participant name=branchSales Unit name=branchSales
Activity name=a
Participant name=DepartmentA Unit name=DepartmentA assigned affiliation
has
has
Role name=Clerk
affiliation
Participant name=Pichler User name=Pichler
Participant name=Clerk Role name=Clerk
Fig. 4. Small Part of the Instance Model’s Content for the Sample Process
482
H. Pichler and J. Eder
The right-hand side of Figure 4 shows the closures, which exist according to the inheritance between components and relations of type ’network’. According to the perspective model the components User, Unit, and Role inherit from the component Participant, which means, that for each of their component instances their also exists a corresponding component instance of type Participant, along with duplicates of the relations to other components. Similarly, as component instances of type Unit are connected by a ’network’ relation, all relations connected to a specific unit instance must also connect to it’s predecessors in a bottom-up fashion. E.g., as user ’Pichler’ is affiliated to ’DepartmentA’, he is also affiliated to ’BranchSales’.
4
Conclusions and Outlook
In this paper we presented a system that captures process models with arbitrary perspective structures in order to generate process views based on user queries. We showed how to model these perspectives by example and how to represent them in our generic perspective model to import of process model instances from external sources. Currently we are working on a context-sensitive dynamic query interface that helps users define queries for view generation, along with an adaptation of an existing reduction-based view generation technique [5], to be completed with a BPMN-based visualization component for generated views.
References 1. Bonifati, A., et al.: Warehousing workflow data: Challenges and opportunities. In: Proc. of the 27th International Conference on Very Large Databases, VLDB 2001. Morgan Kaufmann, San Francisco (2001) 2. Bobrik, R., Reichert, M.U., Bauer, T.: View-Based Process Visualization. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 88–95. Springer, Heidelberg (2007) 3. Chebbia, I., Dustdar, S., Tataa, S.: The view-based approach to dynamic interorganizational workflow cooperation. Data & Knowledge Engineering 56(2) (2006) 4. Liu, D., Shen, M.: Workflow modeling for virtual processes: an order-preserving process-view approach. Information Systems 28(6) (2003) 5. Tahamtan, N.A.: Modeling and Verification of web Service Composition Based Interorganizational Workflows. Ph.D. Thesis, University of Vienna (2009) 6. zur Muehlen, M.: Workflow-based Process Controlling. In: Foundation, Design, and Application of workflow-driven Process Information Systems, Logos (2004) ISBN 978-3-8325-0388-8
Extending Organizational Modeling with Business Services Concepts: An Overview of the Proposed Architecture* Hugo Estrada1, Alicia Martínez1, Oscar Pastor2, John Mylopoulos3, and Paolo Giorgini3 1 CENIDET, Cuernavaca, Mor. México {hestrada,amartinez}@cenidet.edu.mx 2 Technical University of Valencia, Spain opastor@dsic.upv.es 3 University of Trento, Italy {jm,paolo.giorgini}@dit.unitn.it
Abstract. Nowadays, there is wide consensus on the importance of organizational modelling in the definition of software systems that correctly address business needs. Accordingly, there exist many modelling techniques that capture business semantics from different perspectives: transactional, goal-oriented, aspect-oriented, value-oriented etc. However, none of these proposals accounts for the service nature of most business organizations, nor of the growing importance of service orientation in computing. In this paper, an overview of a new business service-oriented modeling approach, that extends the i* framework, is presented as a solution to this problem. The proposed modeling approach enables analysts to represent an organizational model as a composition of business services, which are the basic building blocks that encapsulate a set of business process models. In these models the actors participate in actor dependency networks through interfaces defined in the business service specification. Keywords: Organizational modeling, Business Services.
1 Introduction Nowadays, there exists a great variety of business modelling techniques in academia and industry alike. Many of these include modelling primitives and abstractions mechanisms intended to capture the semantics of business organizations from a specific view-point: process-oriented, goal-oriented, aspect-oriented, value-oriented, etc. However, none of them support primitives that capture the service-orientation of most business organizations, nor do they account for the growing importance of service orientation within IT. In this context, additional modelling and programming efforts are needed to adapt the organizational concepts (from a specific view of the enterprise) to service-oriented software systems. *
This research has been partially supported by DGEST Project #24.25.09-P/2009.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 483–488, 2010. © Springer-Verlag Berlin Heidelberg 2010
484
H. Estrada et al.
The objective of this work is to reduce the mismatch between organizational descriptions and service-oriented specifications by using the concept of service to model an enterprise. Therefore, the main contribution of this paper is to present an overview of a new modeling and methodological approach to address the enterprise modeling activity using business services as building blocks for encapsulating organizational behaviors. In order to support the service-oriented approach into a well-known and well-founded business modeling technique, extensions to the i* framework [1] were proposed as an initial work of this research [2]. The paper is structured as follows: Section 2 presents the related works that use services at the organizational level. In Section 3 the proposed business service architecture is given, and finally, Section 4 presents the conclusions of this work.
2 Related Works in Services at the Organizational Level The use of services at the organizational level is the most emerging research fields in service-oriented modeling. The focus of this phase consists of the definition of the services that are offered by an enterprise. Following, we present the relevant works in this area. One of the few existing proposals is On demand Business Service Architecture [3]. In this proposal, the authors explore the impact of service orientation at the business level. The services represent functionalities offered by the enterprise to the customers. It considers the definition of complex services composed of low-level services. One of the contributions of the Cherbakov work is that the services are represented from the customer point of view. One of the main weaknesses is the lack of mechanisms to model the complex internal behavior needed to satisfy the business services. The services are represented as “black boxes” where the internal details of the implementation of each service are not represented; therefore, in this approach there is not a mechanism to represent the relationship between services and the goals that justify their creation. Another example of the use of services at business level is the proposal of Software-aided Service Bundling [4][5][6]. The main contribution of this research work is the definition of an ontology –a formalized conceptual model– of services to develop software for service bundling. A service bundle consists of elementary services, where service providers can offer service bundles via the Internet. The ontology describes services from a business value perspective. Therefore, the services are described by the exchange of economic values between suppliers and customers rather than describing services by physical properties. This modeling technique shares the same problem as the proposal of on demand business service. The services are defined as black boxes, where the main focus is on the definition of the set of input and outputs of the service. One of the main consequences of not having mechanisms to describe the internal behavior of the services is that it is impossible to relate the services offered with the strategic objectives of the enterprise. Therefore, it could be difficult to define the alternative services that better satisfy the goals of the enterprise. No matter what the services are analyzed in, in all cases there is a strong dependency between the concept of services and the concept of business functionalities. However, this key aspect of service modeling has been historically neglected in the literature. At present, there is only a partial solution to the problem of representing services at the organizational level, in the same way as the services are perceived by
Extending Organizational Modeling with Business Services Concepts
485
the final customers. This paper presents an overview of the solution to this problem. In the proposed approach the goals are the mechanisms that allow the analyst to match the business functionalities and the user´s needs.
3 The Business Service Architecture The research work presented in this paper is based on the hypothesis that it is possible to focus the organizational modeling activity on the values (services) offered by the enterprise to their customers. In this research work, we will call them business services. Following this hypothesis, a proposed method has been developed that provides mechanisms to guide the organizational modeling process based on the business service viewpoint. In this context, the business services can be used as the basic granules of information that allow us to encapsulate a set of composite process models. The use of services as building blocks enables the analyst to represent new business functionalities by composing models of existing services. It is important to point out that research presented in this paper is an overview of a big research project where following components are proposed: a) a modeling language that extends i* modeling framework to support services. The language gives solution to issues detected in empirical evaluations of i* in practice [7], b) a three tier architecture that capture relevant aspects of services: composition, variability, goals, actors, plans, behaviors, c) an elicitation technique to find current implementations of the services offered and requested by the analyzed enterprise, where goals play a very relevant role in the discovering process, d) a specific business modeling method to design o redesign an enterprise in accordance with the concept of business service, e) a formal definition (axioms) of the modeling primitives and diagrams of the serviceoriented architecture. In the entire project, formal (axioms) and informal (diagrammatic) definitions are provided for business services components. 3.1 Our Conceptualization about Business Service We have defined a business service as a functionality that an organizational entity (an enterprise, functional area, department, or organizational actor) offers to other entities in order to fulfill its goals [2]. To provide the functionality, the organizational unit publishes a fragment of the business process as an interface with the users of the service. The business services concept refers to the basic building blocks that act as the containers in which the internal behaviors and social relationships of a business process are encapsulated.
service
Goal A
customer Goal A
Internal provider goals
Services and dependency
Internal customer goal
Fig. 1. The Business Service Notation
486
H. Estrada et al.
The business services have been represented using an extension of the notation of the i* framework. The concept of dependency provided by the i* framework has been modified to appropriately represent the social agreement between customers and providers (Fig. 1). 3.2 The Three-Tier Architectural Models The business service architecture is composed of three complementary models that offer a view of what an enterprises offers to its environment and what enterprise obtains in return. Global Model: In the proposed method, the organizational modeling process starts with the definition of a high-level view of the services offered and used by the enterprise. The global model permits the representation of the business services and the actor that plays the role of requester and provider. Extensions to i* conceptual primitives are used in this model. Fig. 2 shows a fragment of the detailed view of the business service global model for the running example. Manage travel agency Manage car rentals
Maximize investment in car rentals
Manage travel packages
Minimize cost
maximum performance of each car
provide suitable travel packages Flight reservation
Offering different reservations means Manage flight reservations
cars “ready” to being rented 350 days to year extend the car life for 3 years
Manage integrated planning travels
Car Reservation
reserve a flight
rent a car
Manage car reservations
Manage hotel reservations
customer Hotel Reservation
Integrated Travel Planning
reserve a hotel
buy a travel package
Fig. 2. Fragment of the detailed view of the global model for the running example
Process Model: Once business services have been elicited, they must be decomposed into a set of concrete processes that perform them. To do this, we use a process model that represents the functional abstractions of the business process for a specific service. At least one process model needs to be developed for each business services defined in the global model. Extensions to i* conceptual primitives are used in this model. Fig. 3 shows an example of the simplified view of the process model for the walk-in rental car case study. The process model provides the mechanisms required to describe the flow of multiple processes, where the direction of the arrows in the figure indicates dependency for process execution, for example, analyzing the car availability is a precondition for requesting a walk-in rental. The box connectors with a letter “T” are used to indicate a transactional process. Interaction Model: Finally, the semantics of the interactions and transactions of each business process is represented in an isolated diagram using the i* conceptual
Extending Organizational Modeling with Business Services Concepts
487
Aggregated processes Analyze car availability
Enterprise
service
Request walk-in rental
Walk-in reservation T Formalize the rent
Rent a car in the branch
customer
T
Finish walk-in rental
Fig. 3. Example of the process model for the walk-in reservation business service Request walk-in car rental
provide the service
use the service Customer
Enterprise
request the service
authorize the service Analyze the Analyze the preconditions own prefor the Client conditions Indicate the acceptation or rejection
requested data
deliver data Wait for the notification
acceptation/ rejection
Validate the customer credit
Validate credit
Bank
Fig. 4. The interaction needed for requesting a walk-in rental
constructs. This model provides a description of a set of structured and associated activities that produce a specific result or product for a business service. This model is represented using the redefinition of the i* modeling primitives. Fig. 4 presents an example of the interaction model for the running example.
4 Conclusions and Future Work As a solution to the lack of appropriated mechanisms to reduce current mismatch between business models and service-oriented designs and implementations, we have proposed a service-oriented organizational model. In this model, services represent the functionalities that the enterprise offers to potential customers. Business services are the building blocks for a three-tier business architecture: business services, business processes and business interactions. The organizational modeling process starts with the definition of a high-level view of the services offered and used by the
488
H. Estrada et al.
enterprise. Later, each business service is refined into more concrete process models, according to the business service proposed method. Finally, business interactions are represented using the revised version of the modeling concepts of the i* framework. The proposed service-oriented architecture introduces new i*-based modeling diagrams and the analysis needed to represent services at an organizational level. Our current research work is focused on generating semi-automatically WSDL Web service descriptions from business services.
References 1. Eric, Y.: Modelling Strategic Relationships for Process Reengineering, Ph.D. Thesis, Department of Computer Science, University of Toronto (1995) 2. Hugo, E.E.: A service-oriented approach for the i* framework. PhD Thesis, Valencia University of Technology, Valencia, Spain (2008) 3. Cherbakov, L., Galambos, G., Harishankar, R., Kalyana, S., Rackham, G.: Impact of service orientation at the business level. IBM Systems Journal 44(4), 653–668 (2005) 4. Ziv, B.: Software-aided Service Bundling - Intelligent Methods & Tools for Graphical Service Modeling, PhD thesis, Vrije Universiteit Amsterdam, The Netherlands (2006) 5. Gordijn, J., Akkermans, H.: E3-value: Design and evaluation of e-business models. IEEE Intelligent Systems 16(4), 11–17 (2001) 6. Gordijn, J., Akkermans, H.: Value based requirements engineering: Exploring innovative e-commerce idea. Requirements Engineering Journal 8(2), 114–134 (2003) 7. Estrada, H., Martinez, A., Pastor, O., Mylopoulos, J.: An empirical evaluation of the i* framework in a model-based software generation environment. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001, pp. 513–527. Springer, Heidelberg (2006) ISSN: 0302-9743
Author Index
Abo Zaid, Lamia 233 Ali, Raian 471 Armellin, Giampaolo 90 Artale, Alessandro 174, 317
Gonzalez-Perez, Cesar 219 Gottesheim, Wolfgang 202 Gottlob, Georg 347 Gutierrez, Angelica Garcia 188
Baumann, Peter 188 Baumgartner, Norbert 202 Bellahsene, Zohra 160, 261 Bernstein, Philip A. 146 Bhatt, Mehul 464 Borgida, Alex 118 Brogneaux, Anne-France 132
Hainaut, Jean-Luc 132 Henderson-Sellers, Brian 219 Hogenboom, Alexander 452 Hogenboom, Frederik 452 Hois, Joana 464 Horkoff, Jennifer 59
Cabot, Jordi 419 Cal`ı, Andrea 347 Calvanese, Diego 317 Carro, Manuel 288 Castellanos, Malu 15 Chopra, Amit K. 31 Cleve, Anthony 132 Currim, Faiz 433 Dadam, Peter 332 Dalpiaz, Fabiano 31, 471 Dayal, Umeshwar 15 Deneck`ere, R´ebecca 104 de la Vara, Jose Luis 471 De Troyer, Olga 233 Dijkman, Remco 1 Duchateau, Fabien 261 Dustdar, Schahram 288 Dylla, Frank 464 Eder, Johann 477 Ernst, Neil A. 118 Estrada, Hugo 483 Evermann, Joerg 274 Fahland, Dirk 445 Farr´e, Carles 438 Frasincar, Flavius 452 Garc´ıa, F´elix 458 Giorgini, Paolo 31, 471, 483
Ib´ an ˜ez-Garc´ıa, Ang´elica Ivanovi´c, Dragan 288 Jureta, Ivan J.
317
118
Kampoowale, Alankar 433 Kaymak, Uzay 452 Khatri, Vijay 46 Kleinermann, Frederic 233 Knuplesch, David 332 Kontchakov, Roman 174 Kornyshova, Elena 104 Kutz, Oliver 464 Liu, Jun 160 Ly, Linh Thao
332
Mameli, Gianluca 90 Marks, Gerard 405 Martinenghi, Davide 377 Mart´ınez, Alicia 483 Maz´ on, Jose-Norberto 419 McBrien, Peter J. 362 Mendling, Jan 1, 445, 458 Mhatre, Girish 433 Mitsch, Stefan 202 Murphy, John 405 Mylopoulos, John 31, 76, 90, 118, 483 Neidig, Nicholas 433 Ng, Wilfred 302 Oliv´e, Antoni
247
490
Author Index
Pardillo, Jes´ us 419 Pastor, Oscar 483 Perini, Anna 90 Pfeifer, Holger 332 Piattini, Mario 458 Pichler, Horst 477 Pieris, Andreas 347 Pinggera, Jakob 445 Queralt, Anna
Smirnov, Sergey 1 Smith, Andrew C. 362 Susi, Angelo 90 Teniente, Ernest 438 Terwilliger, James F. 146 Torlone, Riccardo 377 Treiber, Martin 288 Trujillo, Juan 419
438
Reijers, Hajo A. 445 Retschitzegger, Werner 202 Rinderle-Ma, Stefanie 332 Rizopoulos, Nikos 362 Roantree, Mark 160, 405 Ruiz, Francisco 458 Rull, Guillem 438 Ryzhikov, Vladislav 174 Salay, Rick 76 S´ anchez-Gonz´ alez, Laura 458 S´ anchez, Juan 471 Schouten, Kim 452 Schwinger, Wieland 202 Siena, Alberto 90 Signer, Beat 391 Simitsis, Alkis 15
Unnithan, Adi 146 Urp´ı, Toni 438 van der Meer, Otto 452 Vandic, Damir 452 Vessey, Iris 46 Villegas, Antonio 247 Weber, Barbara 445 Weidlich, Matthias 445 Weske, Mathias 1 Wilkinson, Kevin 15 Wu, You 302 Yu, Eric
59
Zakharyaschev, Michael Zugal, Stefan 445
174