Related Books of Interest
DB2 9 for Linux, UNIX, and Windows DBA Guide, Reference, and Exam Prep, Sixth Edition
Understanding DB2 Learning Visually with Examples, Second Edition
by George Baklarz and Paul C. Zikopoulos
by Raul F. Chong, Xiaomei Wang, Michael Dang, and Dwaine R. Snow
ISBN: 0-13-185514-X
ISBN: 0-13-158018-3
The sixth edition of this classic offers complete, ® 9 administra® tion and development for Linux , UNIX®, and Windows® platforms, as well as authoritative preparation for the latest IBM® exam. Written for both DBAs and developers,
IBM DB2 9 and DB2 9.5 provide breakthrough capabilities for providing Information on Demand, implementing Web services and Service Oriented Architecture, and streamlining information management. Understanding DB2: Learning Visually with Examples, Second Edition, is the easiest way to master the latest versions of DB2 and apply their full power to your business challenges. Written by four IBM DB2 experts, this book introduces key concepts with dozens of examples drawn from the authors’ experience working with DB2 in enterprise environments. Thoroughly updated for DB2 9.5, it covers new innovations ranging from manageability to performance and XML support to API integration. Each concept is presented with easy-to-understand screenshots, diagrams, charts, and tables. This book is for everyone who works with DB2: database administrators, system administrators, developers, and consultants. With hundreds of well-designed review questions and answers, it will also help profession-
ers all aspects of deploying and managing DB2 9, including DB2 database design and development; day-to-day administration and backup; deployment of networked, Internet-centered, and SOAbased applications; migration; and much more. tips for optimizing performance, availability, and value. Download Complete DB2 V9 Trial Version Visit ibm.com/db2/9/download.html to download a complete trial version of DB2, which enables you to try out dozens of the most powerful features of DB2 for yourself – everything from pureXML™ support to automated administration and optimization. Listen to the author’s podcast at: ibmpressbooks.com/podcasts
730, 731, or 736. Listen to the author’s podcast at: ibmpressbooks.com/podcasts
Sign up for the monthly IBM Press newsletter at ibmpressbooks/newsletters
Related Books of Interest
Understanding DB2 9 Security By Rebecca Bond, Kevin Yeung-Kuen See, Carmen Ka Man Wong, and Yuk-Kuen Henry Chan ISBN: 0-13-134590-7
Understanding DB2 9 Security is a comprehensive guide to securing DB2 and leveraging the powerful new security features of DB2 9. Direct from a DB2 Security deployment expert and the IBM DB2 development team, this book gives DBAs and their managers a wealth of security information that is available nowhere else. It presents real-world implementation scenarios, step-by-step examples, and expert guidance on both the technical and human sides of DB2 security. This book’s material is organized to support you through every step of securing DB2 in Windows, Linux, or UNIX environments. You’ll start by exploring the regulatory and business issues driving your security efforts, and then master the technological and managerial knowledge crucial to effective implementation. Next, the authors offer practical guidance on post-implementation auditing, and show how to systematically maintain security on an ongoing basis.
Mining the Talk Unlocking the Business Value in Unstructured Information by Scott Spangler, and Jeffrey Kreulen ISBN: 0-13-233953-6
In Mining the Talk, two leading-edge IBM researchers introduce a revolutionary new approach to unlocking the business value hidden in virtually any form of unstructured data – from word processing documents to websites, emails to instant messages. The authors review the business drivers that have made unstructured data so important and explain why conventional methods for working with it are inadequate. Then, writing for business professionals – not just data mining specialists – they walk step-bystep through exploring your unstructured data, understanding it, and analyzing it effectively.
key areas: learning from your customer interactions; hearing the voices of customers when they’re not talking to you; discovering the “collective consciousness” of your own organization; enhancing innovation; and spotting emerging trends. Whatever your organization, Mining the Talk offers you breakthrough opportunities to become more responsive, agile, and competitive. Listen to the author’s podcast at: ibmpressbooks.com/podcasts
Visit ibmpressbooks.com for all product information
Related Books of Interest An Introduction to IMS Meltz, Long, Harrington, Hain, Nicholls ISBN: 0-13-185671-5
A Practical Guide to Trusted Computing
Enterprise Master Data Management by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson ISBN: 0-13-236625-8
Enterprise Master Data Management provides an authoritative, vendor-independent MDM technical reference for practitioners: architects, technical analysts, consultants, solution designers, and senior IT decision makers. Written by the IBM® data management innovators who are pioneering MDM, this book systematically introduces MDM’s key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA.
Challener, Yoder, Catherman, Safford, Van Doorn ISBN: 0-13-239842-7
Mainframe Basics for Security Professionals Pomerantz, Weele, Nelson, Hahn ISBN: 0-13-173856-9
Service-Oriented Architecture (SOA) Compass Bieberstein, Bose, Fiammante, Jones, Shah ISBN: 0-13-187002-5
WebSphere Business Integration Primer Iyengar, Jessani, Chilanti ISBN: 0-13-224831-X
Drawing on their experience with cutting-edge projects, the authors introduce MDM patterns, blueprints, solutions, and best practices published nowhere else—everything you need to establish a consistent, manageable set of master data, and use it for competitive advantage.
Sign up for the monthly IBM Press newsletter at ibmpressbooks/newsletters
Outside-in Software Development Kessler, Sweitzer ISBN: 0-13-157551-1
This page intentionally left blank
DB2® pureXML® Cookbook Project Management with the
This page intentionally left blank
IBM WebSphere [SUBTITLE ]
DB2® pureXML® Cookbook
Deployment and Advanced Master the Power of the IBM Configuration
®
Hybrid Data Server
Roland Barcia, Bill Hines, Tom Alcott, and Keys Botzum
Matthias Nicola Pav Kumar-Chatterjee
IBM Press Pearson plc Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Cape Town • Sydney • Tokyo • Singapore • Mexico City Ibmpressbooks.com
The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. Before you use any IBM or non-IBM or open-source product mentioned in this book, make sure that you accept and adhere to the licenses and terms and conditions for any such product. © Copyright 2010 by International Business Machines Corporation. All rights reserved. Note to U.S. Government Users: Documentation related to restricted right. Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corporation. IBM Press Program Managers: Steven M. Stansel, Ellice Uffer Cover design: IBM Corporation Associate Publisher: Greg Wiegand Marketing Manager: Kourtnaye Sturgeon Publicist: Heather Fox Acquisitions Editor: Bernard Goodwin Managing Editor: Kristy Hart Designer: Alan Clements Project Editor: Andy Beaster Copy Editor: Paula Lowell Senior Indexer: Cheryl Lenser Compositor: Gloria Schurick Proofreader: Leslie Joseph Manufacturing Buyer: Dan Uhrig Published by Pearson plc Publishing as IBM Press IBM Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales 1-800-382-3419
[email protected]. For sales outside the U.S., please contact: International Sales
[email protected]. The following terms are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM, the IBM logo, IBM Press, DB2, pureXML, z/OS, ibm.com, WebSphere, System z, developerWorks, InfoSphere, DRDA, Rational, AIX, OmniFind, i5/OS, Lotus, and DataPower. Microsoft, Windows, Microsoft Word, Microsoft Visual Studio, Visual Basic, and Visual C# are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc., in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
Library of Congress Cataloging-in-Publication Data Nicola, Matthias. DB2 PureXML cookbook : master the power of IBM’s hybrid data server / Matthias Nicola and Pav Kumar-Chatterjee. p. cm. Includes indexes. ISBN-13: 978-0-13-815047-1 (hardback : alk. paper) ISBN-10: 0-13-815047-8 (hardback : alk. paper) 1. IBM Database 2. 2. XML (Document markup language) 3. Database management. I. Kumar-Chatterjee, Pav. II. Title. QA76.9.D3N525 2009 006.7’4—dc22 2009020222 All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax (617) 671 3447 ISBN-13: 978-0-13-815047-1 ISBN-10: 0-13-815047-8 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing August 2009
I would like to dedicate this book to Scott and Carrie in the hope that it will inspire them to work hard at school and to my mother who did not see the final version, but who gave me unconditional support as only a mother can.
—Pav Kumar-Chatterjee
Contents Chapter1
Introduction
1
1.1 1.2 1.3 1.4 1.5
Anatomy of an XML Document Differences Between XML and Relational Data Overview of DB2 pureXML Benefits of DB2 pureXML over Alternative Storage Options for XML Data XML Solutions to Relational Data Model Problems 1.5.1 When the Schema Is Volatile 1.5.2 When Data Is Inherently Hierarchical in Nature 1.5.3 When Data Represents Business Objects 1.5.4 When Objects Have Sparse Attributes 1.5.5 When Data Needs to be Exchanged 1.6 Summary
Chapter 2 2.1 2.2 2.3 2.4 2.5
Designing XML Data and Applications
Choosing Between XML Elements and XML Attributes XML Tags versus Values Choosing the Right Document Granularity Using a Hybrid XML/Relational Approach Summary
Chapter 3
Designing and Managing XML Storage Objects
3.1 Understanding XML Document Trees 3.2 Understanding pureXML Storage 3.3 XML Storage in DB2 for Linux, UNIX, and Windows 3.3.1 Storage Objects for XML Data 3.3.2 Defining Columns,Tables, and Table Spaces for XML Data 3.3.3 Dropping XML Columns 3.3.4 Improved XML Storage Format in DB2 9.7 3.4 Using XML Base Table Row Storage (Inlining) 3.4.1 Monitoring and Configuring XML Inlining 3.4.2 Potential Benefits and Drawbacks of XML Inlining 3.5 Compressing XML Data 3.6 Examining XML Storage Space Consumption 3.7 Reorganizing XML Data and Indexes 3.8 Understanding XML Space Management: A Comprehensive Example 3.9 XML in Range Partitioned Tables and MDC Tables 3.9.1 XML and Range Partitioning 3.9.2 XML and Multidimensional Clustering 3.10 XML in a Partitioned Database (DPF) 3.11 XML Storage in DB2 for z/OS
xi
2 4 7 10 11 12 12 12 13 13 13
15 15 19 22 24 25
27 28 30 33 33 36 40 40 41 43 47 48 51 53 54 57 57 58 59 60
xii
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
3.11.1 Storage Objects for XML Data 3.11.2 Characteristics of XML Table Spaces 3.11.3 Tables with Multiple XML Columns 3.11.4 Naming and Storage Conventions 3.12 Utilities for XML Objects in DB2 for z/OS 3.12.1 REPORT TABLESPACESET for XML 3.12.2 Reorganizing XML Data in DB2 for z/OS 3.12.3 CHECK DATA for XML 3.13 XML Parsing and Memory Consumption in DB2 for z/OS 3.13.1 Controlling the Memory Consumption of XML Operations 3.13.2 Redirecting XML Parsing to zIIP and zAAP 3.14 Summary
Chapter 4
Inserting and Retrieving XML Data
4.1 Inserting XML Documents 4.1.1 Simple Insert Statements 4.1.2 Reading XML Documents from Files or URLs 4.2 Deleting XML Documents 4.3 Retrieving XML Documents 4.4 Handling Documents with XML Declarations 4.5 Copying Full XML Documents 4.6 Dealing with XML Special Characters 4.7 Understanding XML Whitespace and Document Storage 4.7.1 Preserving XML Whitespace 4.7.2 Changing the Whitespace Default from “Strip” to “Preserve” 4.7.3 Storing XML Documents for Compliance 4.8 Summary
Chapter 5
Moving XML Data
5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 5.1.1 Exporting XML Documents to a Single File 5.1.2 Exporting XML Documents as Individual Files 5.1.3 Exporting XML Documents as Individual Files with Non-Default Names 5.1.4 Exporting XML Documents to One or Multiple Dedicated Directories 5.1.5 Exporting Fragments of XML Documents 5.1.6 Exporting XML Data with XML Schema Information 5.2 Importing XML Data in DB2 for Linux, UNIX, and Windows 5.2.1 IMPORT Command and Input Files 5.2.2 Import/Insert Performance Tips 5.3 Loading XML Data in DB2 for Linux, UNIX, and Windows 5.4 Unloading XML Data in DB2 for z/OS 5.5 Loading XML Data in DB2 for z/OS 5.6 Validating XML Documents during Load and Insert Operations 5.7 Splitting Large XML Documents into Smaller Documents 5.8 Replicating and Publishing XML Data
61 63 64 64 65 67 68 69 71 71 72 73
75 76 76 79 82 83 85 86 87 89 91 93 94 95
97 98 98 100 102 102 104 105 106 107 108 109 111 114 116 116 118
Table of Contents
xiii
5.9 Federating XML Data 5.10 Managing XML Data with HADR 5.11 Handling XML Data in db2look and db2move 5.12 Summary
Chapter 6
Querying XML Data: Introduction and XPath
6.1 An Overview of Querying XML Data 6.2 Understanding the XQuery and XPath Data Model 6.2.1 Sequences 6.2.2 Sequence in, Sequence out 6.3 Sample Data for XPath, SQL/XML, and XQuery 6.4 Introduction to XPath 6.4.1 Analogy Between XPath and Navigating a File System 6.4.2 Simple XPath Queries 6.5 How to Execute XPath in DB2 6.6 Wildcards and Double Slashes 6.7 XPath Predicates 6.8 Existential Semantics 6.9 Logical Expressions with and, or, not() 6.10 The Current Context and the Parent Step 6.11 Positional Predicates 6.12 Union and Construction of Sequences 6.13 XPath Functions 6.14 General and Value Comparisons 6.15 XPath Axes and Unabbreviated Syntax 6.16 Summary
Chapter 7
Querying XML Data with SQL/XML
7.1 Overview of SQL/XML 7.2 Retrieving XML Documents or Document Fragments with XMLQUERY 7.2.1 Referencing XML Columns in SQL/XML Functions 7.2.2 Retrieving Element Values Without XML Tags 7.2.3 Retrieving Repeating Elements with XMLQUERY 7.3 Retrieving XML Values in Relational Format with XMLTABLE 7.3.1 Generating Rows and Columns from XML Data 7.3.2 Dealing with Missing Elements 7.3.3 Avoiding Type Errors 7.3.4 Retrieving Repeating Elements with XMLTABLE 7.3.5 Numbering XMLTABLE Rows Based on Repeating Elements 7.3.6 Retrieving Multiple Repeating Elements at Different Levels 7.4 Using XPath Predicates in SQL/XML with XMLEXISTS 7.5 Common Mistakes with SQL/XML Predicates 7.6 Using Parameter Markers or Host Variables 7.7 XML Queries with Dynamically Computed XPath Expressions
120 121 122 123
125 126 128 128 130 131 132 133 133 137 140 142 147 148 151 153 154 155 156 157 157
159 160 161 162 163 164 165 165 167 168 169 173 174 177 181 183 185
xiv
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
7.8 Ordering a Query Result Set Based on XML Values 7.9 Converting XML Values to Binary SQL Types 7.10 Summary
Chapter 8
Querying XML Data with XQuery
8.1 XQuery Overview 8.2 Processing XML Data with FLWOR Expressions 8.2.1 Anatomy of a FLWOR Expression 8.2.2 Understanding the for and let Clauses 8.2.3 Understanding the where and order by Clauses 8.2.4 FLWOR Expressions with Multiple for and let Clauses 8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML 8.3.1 Traversing XML Documents 8.3.2 Using XML Predicates 8.3.3 Result Set Cardinalities in XQuery and SQL/XML 8.3.4 Using FLWOR Expressions in SQL/XML 8.4 Constructing XML Data 8.4.1 Constructing Elements with Computed Values 8.4.2 Constructing XML Data with Predicates and Conditions 8.4.3 Constructing Documents with Multiple Levels of Nesting 8.4.4 Constructing Documents with XML Aggregation in SQL/XML Queries 8.5 Data Types, Cast Expressions, and Type Errors 8.6 Arithmetic Expressions 8.7 XQuery Functions 8.7.1 String Functions 8.7.2 Number and Aggregation Functions 8.7.3 Sequence Functions 8.7.4 Namespace and Node Functions 8.7.5 Date and Time Functions 8.7.6 Boolean Functions 8.8 Embedding SQL in XQuery 8.9 Using SQL Functions and User-Defined Functions in XQuery 8.10 Summary
Chapter 9
Querying XML Data:Advanced Queries & Troubleshooting
9.1 Aggregation and Grouping of XML Data 9.1.1 Aggregation and Grouping Queries with XMLTABLE 9.1.2 Aggregation of Values within and across XML Documents 9.1.3 Grouping Queries in SQL/XML versus XQuery 9.2 Join Queries with XML Data 9.2.1 XQuery Joins between XML Columns 9.2.2 SQL/XML Joins between XML Columns 9.2.3 Joins between XML and Relational Columns 9.2.4 Outer Joins between XML Columns
186 187 188
189 190 191 191 193 194 195 197 197 198 200 201 202 202 204 206 207 208 212 214 215 218 220 222 224 226 227 229 230
233 233 234 236 237 239 240 242 248 250
Table of Contents
9.3 Case-Insensitive XML Queries 9.4 How to Avoid “Bad” Queries 9.4.1 Construction of Excessively Large Documents 9.4.2 “Between” Predicates on XML Data 9.4.3 Large Global Sequences 9.4.4 Multilevel Nesting SQL and XQuery 9.5 Common Errors and How to Avoid Them 9.5.1 SQL16001N 9.5.2 SQL16002N 9.5.3 SQL16003N 9.5.4 SQL16005N 9.5.5 SQL16015N 9.5.6 SQL16011N 9.5.7 SQL16061N 9.5.8 SQL16075N 9.6 Summary
Chapter 10 Producing XML from Relational Data 10.1 SQL/XML Publishing Functions 10.1.1 Constructing XML Elements from Relational Data 10.1.2 NULL Values, Missing Elements, and Empty Elements 10.1.3 Constructing XML Attributes from Relational Data 10.1.4 Constructing XML Documents from Multiple Relational Rows 10.1.5 Constructing XML Documents from Multiple Relational Tables 10.1.6 Comparing XMLAGG, XMLCONCAT, and XMLFOREST 10.1.7 Conditional Element Construction 10.1.8 Leading Zeros in Constructed Elements and Attributes 10.1.9 Default Tagging of Relational Data with XMLROW and XMLGROUP 10.1.10 GUI-Based Definition of SQL/XML Publishing Queries 10.1.11 Constructing Comments, Processing Instructions, and Text Nodes 10.1.12 Legacy Functions 10.2 Using XQuery Constructors with Relational Input 10.3 XML Declarations for Constructed XML Data 10.4 Inserting Constructed XML Data into XML Columns 10.5 Summary
Chapter 11 Converting XML to Relational Data 11.1 Advantages and Disadvantages of Shredding 11.2 Shredding with the XMLTABLE Function 11.2.1 Hybrid XML Storage 11.2.2 Relational Views over XML Data 11.3 Shredding with Annotated XML Schemas 11.3.1 Annotating an XML Schema 11.3.2 Defining Schema Annotations Visually in IBM Data Studio
xv
252 253 253 254 256 257 258 259 259 260 261 262 263 263 264 264
267 268 269 274 275 277 281 284 284 285 286 289 290 290 290 292 294 295
297 297 301 303 305 306 306 311
xvi
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
11.3.3 Registering an Annotated Schema 11.3.4 Decomposing One XML Document at a Time 11.3.5 Decomposing XML Documents in Bulk 11.4 Summary
Chapter 12 Updating and Transforming XML Documents 12.1 Replacing a Full XML Document 12.2 Modifying Documents with XQuery Updates 12.3 Updating the Value of an XML Node in a Document 12.3.1 Replacing an Element Value 12.3.2 Replacing an Attribute Value 12.3.3 Replacing a Value Using a Parameter Marker 12.3.4 Replacing Multiple Values in a Document 12.3.5 Replacing an Existing Value with a Computed Value 12.4 Replacing XML Nodes in a Document 12.5 Deleting XML Nodes from a Document 12.6 Renaming Elements or Atttributes in a Document 12.7 Inserting XML Nodes into a Document 12.7.1 Defining the Position of Inserted Elements 12.7.2 Defining the Position of Inserted Attributes 12.7.3 Insert Examples 12.8 Handling Repeating and Missing Nodes 12.9 Modifying Multiple XML Nodes in the Same Document 12.9.1 Snapshot Semantics and Conflict Situations 12.9.2 Converting Elements to Attributes and Vice Versa 12.10 Modifying XML Documents in Queries 12.11 Modifying XML Documents in Insert Operations 12.12 Modifying XML Documents in Update Cursors 12.13 XML Updates in DB2 for z/OS 12.14 Transforming XML Documents with XSLT 12.14.1 The XSLTRANSFORM Function 12.14.2 XML to HTML Transformation 12.15 Summary
Chapter 13 Defining and Using XML Indexes 13.1 Defining XML Indexes 13.1.1 Unique XML Indexes 13.1.2 Lean XML Indexes 13.1.3 Using the DB2 Control Center to Create XML Indexes 13.2 XML Index Data Types 13.2.1 VARCHAR(n) 13.2.2 VARCHAR HASHED 13.2.3 DOUBLE and DECFLOAT 13.2.4 DATE and TIMESTAMP
311 312 315 318
321 322 324 326 326 327 328 328 329 331 333 334 335 335 336 337 340 343 343 345 346 349 350 351 352 353 356 358
361 362 364 365 366 367 367 368 369 369
Table of Contents
13.3
13.4 13.5 13.6
13.7
13.8 13.9
13.2.5 Choosing a Suitable Index Data Type 13.2.6 Rejecting Invalid Values Using XML Indexes to Evaluate Query Predicates 13.3.1 Understanding Index Eligibility 13.3.2 Data Types in XML Indexes and Query Predicates 13.3.3 Text Nodes in XML Indexes and Query Predicates 13.3.4 Wildcards in XML Indexes and Query Predicates 13.3.5 Using Indexes for Structural Predicates XML Indexes and Join Predicates XML Indexes on Non-Leaf Elements Special Cases Where XML Indexes Cannot be Used 13.6.1 Special Cases with XMLQUERY 13.6.2 Parent Steps 13.6.3 The let and return Clauses XML Index Internals 13.7.1 XML Index Keys 13.7.2 Logical and Physical XML Indexes XML Index Statistics Summary
Chapter 14 XML Performance and Monitoring 14.1 Explaining XML Queries in DB2 for Linux,UNIX, and Windows 14.1.1 The Explain Tables in DB2 for Linux, UNIX, and Windows 14.1.2 Using db2exfmt to Obtain Access Plans 14.1.3 Using Visual Explain to Display Access Plans 14.1.4 Access Plan Operators 14.1.5 Understanding and Analyzing XML Query Execution Plans 14.2 Explaining XML Queries in DB2 for z/OS 14.2.1 The Explain Tables in DB2 for z/OS 14.2.2 Obtaining Access Plan Information in SPUFI 14.2.3 Using Visual Explain to Display Access Plans 14.2.4 Access Plan Operators 14.2.5 Understanding and Analyzing XML Query Execution Plans 14.3 Statistics Collection for XML Data 14.3.1 Statistics Collection for XML Data in DB2 for z/OS 14.3.2 Statistics Collection for XML Data in DB2 for Linux, UNIX, and Windows 14.3.3 Examining XML Statistics with db2cat 14.4 Monitoring XML Activity 14.4.1 Using the Snapshot Monitor in DB2 for Linux, UNIX, and Windows 14.4.2 Monitoring Database Utilities 14.5 Best Practices for XML Performance 14.5.1 XML Document Design 14.5.2 XML Storage
xvii
369 371 373 373 374 375 376 377 379 383 385 385 385 386 387 387 389 390 393
395 396 396 397 400 401 403 409 409 410 411 413 414 417 417 418 419 424 424 427 428 428 429
xviii
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
14.5.3 XML Queries 14.5.4 XML Indexes 14.5.5 XML Updates 14.5.6 XML Schemas 14.5.7 XML Applications 14.6 Summary
Chapter 15 Managing XML Data with Namespaces 15.1 Introduction to XML Namespaces 15.1.1 Namespace Declarations in XML Documents 15.1.2 Default Namespaces 15.2 Exploring Namespaces in XML Documents 15.3 Querying XML Data with Namespaces 15.3.1 Declaring Namespaces in XML Queries 15.3.2 Using Namespace Declarations in SQL/XML Queries 15.3.3 Using Namespaces in the XMLTABLE Function 15.3.4 Dealing with Multiple Namespaces per Document 15.4 Creating Indexes for XML Data with Namespaces 15.5 Constructing XML Data with Namespaces 15.5.1 SQL/XML Publishing Functions and Namespaces 15.5.2 XQuery Constructors and Namespaces 15.6 Updating XML Data with Namespaces 15.6.1 Updating Values in Documents with Namespaces 15.6.2 Renaming Nodes in Documents with Namespace Prefixes 15.6.3 Renaming Nodes in Documents with Default Namespaces 15.6.4 Inserting and Replacing Nodes in Documents with Namespaces 15.7 Summary
Chapter 16 Managing XML Schemas 16.1 Introduction to XML Schemas and Their Usage 16.1.1 Valid Versus Well-Formed XML Documents 16.1.2 To Validate or Not to Validate,That Is the Question! 16.1.3 Custom Versus Industry Standard XML Schemas 16.2 Anatomy of an XML Schema 16.3 An XML Schema with Include and Import 16.4 Registering XML Schemas 16.4.1 Registering XML Schemas in the DB2 Command Line Processor 16.4.2 Registering XML Schemas from Applications via Stored Procedures 16.4.3 Registering XML Schemas from Java Applications via JDBC 16.4.4 Two XML Schemas Sharing a Common Schema Document 16.4.5 Error Situations and How to Resolve Them 16.5 Removing XML Schemas from the Schema Repository
430 432 433 434 434 435
437 437 439 442 444 447 448 451 452 454 456 460 460 462 463 464 465 467 468 469
471 472 473 474 474 476 479 483 484 486 488 489 490 492
Table of Contents
16.6 XML Schema Evolution 16.6.1 Schema Evolution Without Document Validation 16.6.2 Generic Schema Evolution with Document Validation 16.6.3 Compatible Schema Evolution with the UPDATE XMLSCHEMA Command 16.7 Granting and Revoking XML Schema Usage Privileges 16.8 Document Type Definitions (DTDs) and External Entities 16.9 Browsing the XML Schema Repository (XSR) 16.9.1 Tables and Views of the XML Schema Repository 16.9.2 Queries against the XML Schema Repository 16.10 XML Schema Considerations in DB2 for z/OS 16.11 Summary
Chapter 17 Validating XML Documents against XML Schemas 17.1 17.2 17.3 17.4 17.5 17.6 17.7
Document Validation Upon Insert Document Validation Upon Update Validation without Rejecting Invalid Documents Enforcing Validation with Check Constraints Automatic Validation with Triggers Diagnosing Validation and Parsing Errors Validation during Load and Import Operations 17.7.1 Validation against a Single XML Schema 17.7.2 Validation against Multiple XML Schemas 17.7.3 Using a Default XML Schema 17.7.4 Overriding XML Schema References 17.7.5 Validation Based on schemaLocation Attributes 17.8 Checking Whether an Existing Document Has Been Validated 17.9 Validating Existing Documents in a Table 17.10 Finding the XML Schema for a Validated Document 17.11 How to Undo Document Validation 17.12 Considerations for Validation in DB2 for z/OS 17.12.1 Document Validation Upon Insert 17.12.2 Document Validation Upon Update 17.12.3 Validating Existing Documents in a Table 17.12.4 Summary of Platform Similarities and Differences 17.13 Summary
Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers 18.1 Manipulating XML in SQL Stored Procedures 18.1.1 Basic XML Manipulation in Stored Procedures 18.1.2 A Stored Procedure to Store XML in a Hybrid Manner 18.1.3 Loops and Cursors 18.1.4 A Stored Procedure to Update a Selected XML Element or Attribute 18.1.5 Three Tips for Testing Stored Procedures
xix
493 494 494 495 499 501 502 503 508 510 512
513 514 518 519 520 523 525 530 530 531 532 532 534 534 535 538 540 540 541 542 543 543 544
547 548 548 550 553 554 555
xx
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
18.2 Manipulating XML in User-Defined Functions 18.2.1 A UDF to Extract an Element or Attribute Value 18.2.2 A UDF to Extract the Values of a Repeating Element 18.2.3 A UDF to Shred XML Data to a Relational Table 18.2.4 A UDF to Modify an XML Document 18.3 Manipulating XML Data with Triggers 18.3.1 Insert Triggers on Tables with XML Columns 18.3.2 Delete Triggers on Tables with XML Columns 18.3.3 Update Triggers on XML Columns 18.4 Summary
Chapter 19 Performing Full-Text Search 19.1 Overview of Text Search in DB2 19.2 Sample Table and Data 19.3 Enabling a Database for the DB2 Net Search Extender 19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 19.4.1 Creating Basic Text Indexes 19.4.2 Creating Text Indexes with Specific Storage Paths 19.4.3 Creating Text Indexes with a Periodic Update Schedule 19.4.4 Creating Text Indexes for Specific Parts of Each Document 19.4.5 Creating Text Indexes with Advanced Options 19.4.6 Updating and Reorganizing Text Indexes 19.4.7 Altering Text Indexes 19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 19.5.1 Full-Text Search in SQL and XQuery 19.5.2 Full-Text Search with Boolean Operators 19.5.3 Full-Text Search with Custom Document Models 19.5.4 Advanced Search with Proximity, Fuzzy, and Stemming Options 19.5.5 Finding the Correct Match within an XML Document 19.5.6 Search Conditions on Sibling Branches of an XML Document 19.5.7 Text Search in the Presence of Namespaces 19.6 DB2 Text Search 19.6.1 Enabling a Database for DB2 Text Search 19.6.2 Creating and Maintaining Full-Text Indexes for DB2 Text Search 19.6.3 Writing DB2 Text Search Queries for XML Data 19.6.4 Full-Text Search with XPath Expressions 19.6.5 Full-Text Search with Wildcards 19.7 Summary of Text Search Administration Commands 19.8 XML Full-Text Search in DB2 for z/OS 19.9 Summary
556 557 557 558 559 561 562 563 564 564
567 568 570 571 572 572 573 574 576 578 579 580 581 581 583 585 586 587 588 588 590 590 591 592 593 594 594 596 596
Table of Contents
Chapter 20 Understanding XML Data Encoding 20.1 Understanding Internal and External XML Encoding 20.1.1 Internally Encoded XML Data 20.1.2 Externally Encoded XML Data 20.2 Avoiding Code Page Conversions 20.3 Using Non-Unicode Databases for XML 20.4 Examples of Code Page Issues 20.4.1 Example 1: Chinese Characters in a Non-Unicode Code Page ISO-8859-1 20.4.2 Example 2: Fetching Data from a Non-Unicode Code Database into a Character Type Application Variable 20.4.3 Example 3: Encoding Issues with XMLTABLE and XMLCAST 20.4.4 Example 4: Japanese Literal Values in a Non-Unicode Database 20.4.5 Example 5: Data Expansion and Shrinkage Due to Code Page Conversion 20.5 Avoiding Data Loss and Encoding Errors in Non-Unicode Databases 20.6 Summary
Chapter 21 Developing XML Applications with DB2 21.1 The Value of DB2 pureXML for Application Development 21.1.1 Avoid XML Parsing in the Application Layer 21.1.2 Storing Business Objects in an Intuitive Format 21.1.3 Rapid Prototyping 21.1.4 Responding Quickly to Changing Business Needs 21.2 Using Parameter Markers or Host Variables 21.3 Java Applications 21.3.1 XML Support in JDBC 3.0 21.3.2 XML Support in JDBC 4.0 21.3.3 Comprehensive Example of Manipulating XML Data with JDBC 4.0 21.3.4 Creating XML Documents from Application Data 21.3.5 Binding XML Data to Java Objects 21.3.6 IBM pureQuery 21.4 .NET Applications 21.4.1 Querying XML Data in .NET Applications 21.4.2 Manipulating XML Data in .NET Applications 21.4.3 Inserting XML Data from .NET Applications 21.4.4 XML Schema and DTD Handling in .NET Applications 21.5 CLI Applications 21.6 Embedded SQL Applications 21.6.1 COBOL Applications with Embedded SQL 21.6.2 PL/1 Applications with Embedded SQL 21.6.3 C Applications with Embedded SQL 21.7 PHP Applications
xxi
597 599 599 600 601 601 602 602 603 604 605 605 606 606
609 610 610 612 612 613 613 615 615 619 621 627 629 629 631 632 633 635 636 636 639 640 643 645 647
xxii
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
21.8 Perl Applications 21.9 XML Application Development Tools 21.9.1 IBM Data Studio Developer 21.9.2 IBM Database Add-ins for Visual Studio 21.9.3 Altova XML Tools 21.9.4 21.9.5 Stylus Studio 21.10 Summary
Chapter 22 Exploring XML Information in the DB2 Catalog 22.1 XML-Related Catalog Information in DB2 for Linux, UNIX, and Windows 22.1.1 Catalog Information for XML Columns 22.1.2 The XML Strings and Paths Tables 22.1.3 The Internal XML Regions and Path Indexes 22.1.4 Catalog Information for User-Defined XML Indexes 22.1.5 Catalog Information for XML Schemas 22.2 XML-Related Catalog Information in DB2 for z/OS 22.2.1 Catalog Information for XML Storage Objects 22.2.2 Catalog Information for XML Indexes 22.2.3 Catalog Information for XML Schemas 22.3 Summary
Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 23.1 Designing XML Data and Applications 23.2 Designing and Managing Storage Objects for XML 23.3 Inserting and Retrieving XML Data 23.4 Moving XML Data 23.5 Querying XML 23.6 Producing XML from Relational Data 23.7 Converting XML to Relational Data 23.8 Updating and Transforming XML Documents 23.9 Defining and Using XML Indexes 23.10 XML Performance and Monitoring 23.11 Managing XML Data with Namespaces 23.12 XML Schemas and Validation 23.13 Performing Full-Text Search 23.14 XML Application Development 23.15 Answers
Appendix A Getting Started with DB2 pureXML A.1 Exploring the Structure of XML Documents A.1.1 Exploring XML Documents in the DB2 Control Center A.1.2 Exploring XML Documents in the CLP A.1.3 Exploring XML Documents in SPUFI A.2 Tips for Running XML Operations in the CLP
650 651 652 656 656 658 659 659
661 661 661 662 663 664 667 667 667 671 672 673
675 675 677 680 681 682 686 687 688 689 692 693 694 696 697 700
703 703 703 704 705 706
Table of Contents
Appendix B The XML Sample Database B.1 B.2 B.3 B.4 B.5
XML Sample Database on DB2 for Linux, UNIX, and Windows XML Sample Tables on DB2 for z/OS Table customer—Column info Table product—Column description Table purchaseorder—Column porder
Appendix C Further Reading C.1 General Resources for All Chapters C.2 Chapter-Specific Resources C.3 Resources on the Integration of DB2 pureXML with Other Products
Index
xxiii
709 709 710 710 712 713
717 717 718 726
727
This page intentionally left blank
Foreword n the years since E.F. Codd’s groundbreaking work in the 1970s, relational database systems have become ubiquitous in the business world. Today, most of the world’s business data is stored in the rows and columns of relational databases. The relational model is ideally suited to applications in which data has a relatively simple and uniform structure, and in which database structure evolves much more slowly than data values.
I
With the advent of the Web, however, big changes began to occur in the database world, driven by globalization and by dramatic reductions in the cost of storing, transmitting, and processing data. Today, businesses are globally interconnected and exchange large volumes of data with customers, suppliers, and governments. Much of this data consists of things that do not fit neatly into rows and columns, such as medical records, legal documents, incident reports, tax returns, and purchase orders. The new kinds of data tend to be more heterogeneous than traditional business data, having more variation and a more rapidly evolving structure. In response to the changing requirements of business data, a new generation of standards have appeared. XML has emerged as an international standard for the exchange of self-describing data, unifying structured, unstructured, and semi-structured information formats. XML Schema has been adopted as the metadata syntax for describing the structure of XML documents. Industry-specific XML schemas have been developed for medical, insurance, retail, publishing, banking, and other industries. XPath and XQuery have been adopted as standard languages for retrieving and manipulating data in XML format, and new facilities have been added to the SQL standard for interfacing between relational and XML data. In DB2, the new generation of XML-related standards is reflected in pureXML, a broad new set of XML functionality implemented in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. pureXML bridges the gap between the XML and relational worlds and makes DB2 a true hybrid database management system. DB2 pureXML stores and indexes XML data alongside relational data in a highly efficient new storage format, and supports XML query languages such as XPath and XQuery alongside the traditional SQL. pureXML is perhaps the largest new package of functionality in the history of DB2, impacting nearly every aspect of the system. The implementation of pureXML required deep changes in the database kernel, optimization methods, database administrator tools, system utilities, and application programming interfaces. New facilities were added for registering XML schemas and using them to validate stored documents. New kinds of statistics on XML documents had to be gathered and exploited. Facilities for replicated, federated, and partitioned databases had to be updated to accommodate the new XML storage format. pureXML provides DB2 users with a new level of capability, but using this capability to full advantage requires users to have a new level of sophistication. A new user of pureXML is
xxv
xxvi
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
confronted with many complex choices. What kinds of data should be represented in XML rather than in normalized tables? How can data be converted between XML and relational formats? How can a hybrid database be designed to take advantage of both data formats? What are the most appropriate uses for SQL, XQuery, and XPath? What kinds of indexes should be maintained on XML data? What is the XML equivalent of a NULL value? These and many other questions are considered in detail in the DB2 pureXML Cookbook. Matthias Nicola has been deeply involved in the design and implementation of DB2 pureXML since its inception. As a Senior Engineer at IBM’s Silicon Valley Laboratory, his work has focused on measuring and optimizing the performance of new storage and indexing techniques for XML. After the release of pureXML, he worked with many IBM customers and business partners to create, deploy, and optimize XML applications for government, banking, telecommunications, retail, and other industries. Pav Kumar-Chatterjee is a technical specialist with many years of experience in consulting with IBM customers throughout the UK and Europe on developing and deploying DB2 and XML solutions. Through their work with customers, Matthias and Pav have learned how to explain concepts clearly and how to identify and avoid common pitfalls in the application development process. They have also developed a set of “best practices” that they have shared at numerous conferences, classes, workshops, and customer engagements. Between them, Matthias and Pav have accumulated all the knowledge and experience you need to successfully create and deploy solutions using DB2 pureXML. Their expertise is encapsulated in this book in the form of hundreds of practical examples, tested and clearly explained. The book also includes a comprehensive set of questions to test your understanding. DB2 pureXML Cookbook includes both an introduction to basic XML concepts and a comprehensive description of the XML-related features of DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Chapters are organized around tasks that reflect the lifecycle of XML projects, including designing databases, loading and validating data, writing queries and updates, developing applications, optimizing performance, and diagnosing problems. Each topic provides a clear progression from introductory material to more advanced concepts. The writing style is informal and easy to understand for both beginners and experts. If you are an application developer, database administrator, or system architect, this is the book you need to gain a comprehensive understanding of DB2 pureXML.
Don Chamberlin IBM Fellow, Emeritus Almaden Research Center April 10, 2009
Preface n recent years XML has continued to emerge as the de-facto standard for data exchange, because it is flexible, extensible, self-describing, and suitable for any combination of structured and unstructured data. With the increasing use of XML as a pervasive data format, there is a growing need to store, index, query, update, and validate XML documents in database systems. In response to this demand, IBM has developed sophisticated XML data management capabilities that are deeply integrated in the DB2 database system. This novel technology is called DB2 pureXML and is available in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. With pureXML, DB2 has evolved into a hybrid database system that allows you to manage both XML and relational data in a tightly integrated manner.
I
The DB2 pureXML Cookbook provides the single most comprehensive coverage of DB2’s pureXML functionality in DB2 for Linux, UNIX, and Windows as well as DB2 for z/OS. This book is a “cookbook” because it is more than just a description of functions and features (“ingredients”). This book provides “recipes” that show you how to combine the pureXML ingredients to efficiently perform typical user tasks for managing XML data. This book explains DB2 pureXML in more than 700 practical examples, including 250+ XQuery and SQL/XML queries, taking you from simple introductions all the way to advanced scenarios, tuning, and troubleshooting. Since the first release of DB2 pureXML in 2006 we have worked with numerous companies to help them design, implement, optimize, and deploy XML applications with DB2. In this book we have distilled our experience from these pureXML projects so that you can benefit from proven implementation techniques, best practices, tips and tricks, and performance guidelines that are not described elsewhere.
WHO SHOULD READ THIS BOOK? This book is written for database administrators, application developers, IT architects, and everyone who wants to get a deep technical understanding of DB2’s pureXML technology and how to use it most effectively. As a DBA you will learn, for example, how to design and manage XML storage objects, how to index XML data, where to find XML-related information in the DB2 catalog, and how to mange XML with DB2 utilities. Application developers learn, among other things, how to write XML queries and XML updates with XPath, SQL/XML, and XQuery, and how to code XML applications with Java, .NET, C, COBOL, PL/1, PHP, or Perl. This book is suitable for both beginners and experts. Each topic starts with simple examples, which provide an easy introduction, and works towards advanced concepts and solutions to complex problems. Extensive XML knowledge is not required to read this book because it includes the necessary introductions to XML, XPath, XQuery, XML Schema, and namespaces. These
xxvii
xxviii
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
concepts are explained through numerous examples that are easy to follow. We assume that you have some experience with relational databases and SQL, but we show all the relevant DB2 commands that are required to work through the examples in this book. Appendix C, Further Reading, also contains links to additional educational material about both DB2 and XML.
COVERAGE OF DB2 FOR Z/OS AND DB2 FOR LINUX, UNIX, AND WINDOWS IN THIS BOOK The book describes DB2 pureXML on all supported platforms and versions, which at the time of writing are DB2 9 for z/OS as well as DB2 9.1, 9.5, and 9.7 for Linux, UNIX, and Windows. Many pureXML features and functions are identical across DB2 for Linux, UNIX, and Windows and DB2 for z/OS. Where platform-specific differences exist we point them out along the way. However, this book does not intend to be a reference that lists all functions and features according to platform and version of DB2. Instead, this book is a “cookbook” that focuses on concepts, examples, and best practices. The capabilities in DB2 for z/OS and DB2 for Linux, UNIX, and Windows continue to grow and converge over time. For the latest information on which feature is available in which version, please consult the respective DB2 information center. DB2 for z/OS also continues to deliver pureXML enhancements via APARs. Please look at APAR II14426, which is an informational APAR that summarizes and links all other XML-related APARs for DB2 on z/OS. In our work with users who adopt DB2 pureXML we have made the following observation: Some of the users who begin to use DB2 pureXML on Linux, UNIX, and Windows have little or no prior experience with DB2. In contrast, most users who are interested in DB2 pureXML on z/OS are already familiar with DB2 for z/OS in general. This difference is reflected in this book; that is, we describe some DB2 concepts, such as monitoring or the use of DB2 utilities, in more detail for DB2 for Linux, UNIX, and Windows than for DB2 for z/OS.
DO IT YOURSELF! The best way to learn a new technology is hands-on. We strongly recommend that you download DB2 Express-C, which is free, and try the concepts that you learn in this book in DB2’s sample database. Appendixes A and B contain the necessary information to get you started.
DON’T HESITATE TO ASK QUESTIONS! If any pureXML question is not covered in this book, the fastest way to get an answer is to post a question in the DB2 pureXML forum at http://www.ibm.com/developerworks/forums/forum. jspa?forumID=1423. Whether you seek clarification about specific features or functions, or if you need help with a tricky query, this forum is the right place to ask for help. You are also welcome to contact the
Preface
xxix
authors directly. If you want to discuss an XML project or if you have comments or feedback on the material in this book—we will be happy to hear from you. Please contact Matthias at
[email protected] and Pav at
[email protected].
HOW THIS BOOK IS STRUCTURED The DB2 pureXML Cookbook takes you through the different tasks and topics that you typically encounter during the life cycle of an XML project. The structure of this book with its 23 chapters is the following:
Planning Chapter 1, Introduction, provides an overview of XML and its differences to relational data, and discusses scenarios where XML has advantages over the relational model. This chapter also includes a summary of the pureXML technology. Chapter 2, Designing XML Data and Applications, covers fundamental XML design questions such as choosing between XML elements and attributes, selecting an appropriate XML document granularity, and deciding on a “good” mix of XML and relational data for your application.
Designing and Populating an XML Database Chapter 3, Designing and Managing XML Storage Objects, first explains the tree representation of XML documents and how they are physically stored in DB2. Then it describes how to create and manage tables and table spaces for XML, including compression, reorganization, and partitioning. Chapter 4, Inserting and Retrieving XML Data, looks at “full document” operations such as insert, delete, and retrieval of XML documents. This chapter also explains how to handle XML declarations, white space, and reserved characters in XML documents. Chapter 5, Moving XML Data, looks at importing, exporting, loading, replicating, and federating XML data in DB2. A technique to split large XML documents into smaller ones is also demonstrated.
Querying XML Data Chapter 6, Querying XML Data: Introduction and XPath, is the first of four chapters on querying XML data. This chapter provides an overview of the different options for querying XML, introduces the XPath and XQuery data model, and describes the XPath language in detail. These concepts are fundamental for the subsequent chapters.
xxx
DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server
Chapter 7, Querying XML Data with SQL/XML, explains how XPath can be included in SQL statements with the SQL/XML functions XMLQUERY and XMLTABLE and the XMLEXISTS predicate. The use of SQL/XML is illustrated through a rich collection of examples and a discussion of common mistakes and how to avoid them. Chapter 8, Querying XML Data with XQuery, introduces the XQuery language, which is a superset of XPath. Among other things, this chapter describes XQuery FLWOR expressions, combinations of SQL and XQuery, and a comparison of XPath, XQuery, and SQL/XML. Chapter 9, Querying XML Data: Advanced XML Queries and Troubleshooting, takes querying XML data to the expert level. It demonstrates how to perform grouping, aggregation, and joins over XML data or a mix of XML and relational data. The troubleshooting section discusses “bad” XML queries, common errors, and how to avoid both.
Converting, Updating, and Transforming Chapter 10, Producing XML from Relational Data, begins the discussion of converting, updating, and transforming data. This chapter explains how to read relational data from existing database tables and construct XML documents from it. Chapter 11, Converting XML to Relational Data, describes the opposite of Chapter 10, that is, the process of decomposing or shredding XML documents into relational tables. Two shredding methods are discussed, one using the XMLTABLE function and the other using annotated XML Schemas. Chapter 12, Updating and Transforming XML Documents, covers three techniques for updating XML documents: Full document replacement, XSLT transformations, and the XQuery Update Facility that allows you to modify, insert, delete, or rename individual elements and attributes within an XML document.
Performance and Monitoring Chapter 13, Defining and Using XML Indexes, is one of two chapters dedicated to performance. It describes how to create XML indexes to improve query performance and explains under which conditions query predicates can or cannot use XML indexes. Chapter 14, Performance and Monitoring, looks at analyzing the performance of XML operations with particular emphasis on understanding XML query access plans. A summary of best practices for XML performance in DB2 is also provided.
Preface
xxxi
Ensuring Data Quality Chapter 15, Managing XML Data with Namespaces, introduces XML namespaces and explains how they avoid naming conflicts and ambiguity, thus contributing to data quality. This chapter illustrates how to index, query, update, and construct XML documents that contain namespaces. Chapter 16, Managing XML Schemas, first describes how XML Schemas can constrain XML documents in terms of their structure, element and attribute names, data types, and other characteristics. Then this chapter walks you through the concepts of registering, managing, and evolving XML Schemas in DB2. Chapter 17, Validating XML Documents against XML Schemas, concentrates on the validation of XML documents to ensure XML data quality in DB2. You can validate XML documents in INSERT and UPDATE statements, queries, and import and load operations.
Application Development Chapter 18, Using XML in Stored Procedures, UDFs, and Triggers, demonstrates how you can implement application-specific processing logic with XML manipulation in SQL stored procedures, user-defined functions, and triggers. Chapter 19, Performing Full-Text Search, describes how the DB2 Net Search Extender and DB2 Text Search support efficient full-text search in collections of XML documents. Chapter 20, Understanding XML Data Encoding, explains internal and external XML encoding, how DB2 determines and handles XML encoding, and how you can avoid code page conversion. Chapter 21, Developing XML Application with DB2, contains techniques and best practices for application programs that exchange XML data with the DB2 server. Code samples are provided for Java, .NET, C, COBOL, PL/1, PHP, and Perl programmers.
Reference Material Chapter 22, Exploring XML Information in the DB2 Catalog, is a guide to how XML storage objects, XML indexes, and XML Schemas are listed in the database catalog. Chapter 23, Test Your Knowledge—The DB2 pureXML Quiz, offers 82 questions to revisit specific topic areas. The Appendixes list supporting information and further reading for each chapter.
This page intentionally left blank
Acknowledgments Writing this book would not have been possible without the support from many people. For their support and technical reviews we would like to thank Andrew Eisenberg, Andy Lai, Bert van der Linden, Bob Harbus, Christian Daser, Cindy Saracco, Craig Mullins, Daniela Wersin, David Salinero, Don Chamberlin, Guogen Zhang, Henrik Loeser, Holger Seubert, Ian Cook, Jan-Eike Michels, Jason Cu, John Pickford, Lan Huang, Manfred Paessler, Mark Mezofenyi, Martin Sommerlandt, Paul Fletcher, Phil Nelson, Qi Jin, Shantanu Munkur, Stefan Momma, Susan Gausden, Susan Malaika, Susan Visser, Susanne Englert, Thomas Fanghaenel, Tiffany Money, Tim Kiefer, and Yuchu Tong. Thanks also to the many talented people in the DB2 pureXML development team who have implemented this exciting technology that we have the privilege of writing about.
xxxiii
About the Authors Matthias Nicola is a Senior Software Engineer for DB2 pureXML at IBM’s Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, XML storage, indexing, and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Matthias has published more than a dozen articles on various XML topics (see www.matthiasnicola.de) and is a frequent speaker at DB2 conferences. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.
Pav Kumar-Chatterjee has worked with DB2 since 1991 on DB2 for z/OS and since 2000 on DB2 for Linux, UNIX, and Windows. He is currently employed by IBM as a technical sales specialist for Information Management in the United Kingdom. He has helped customers implement the XML Extender product with DB2 V8 and has presented on DB2 and XML in the United Kingdom and around Europe.
xxxiv
C
H A P T E R
1
Introduction
ML, the eXtensible Markup Language, is the standard format for exchanging information between different systems, applications, and organizations. XML is also the underlying data format for many web applications, Service-Oriented Architectures (SOA), and messagebased transaction processing systems. Enterprise application integration (EAI), enterprise information integration (EII), web services, the enterprise message bus (ESB), and standardization efforts in many vertical industries all rely on XML as the underlying technology for data exchange.
X
Organizations as well as entire industries have standardized XML Schemas to promote and simplify data exchange and are evolving those schemas to meet changing business needs. Many industry-specific initiatives as well as regulatory requirements are driving the adoption of XML. As more business transactions are conducted through web-based interfaces and electronic forms, government agencies and commercial enterprises face increasing requirements for preserving and post-processing the original transaction records. XML provides a straightforward means of capturing and maintaining the data associated with such electronic transactions. XML uses tags to define elements and attributes that hold business data. The element and attribute tags describe the intended meaning of the data items, and the nesting of the tags describes hierarchical relationships between the data items. Hence, XML is a self-describing data format. Data and metadata are tightly integrated in a vendor- and platform-independent format. These properties make XML well-suited for data exchange. Additionally, new tags can be invented and easily added. This extensibility allows XML to accommodate ever-evolving business needs. XML is a flexible data model that is suited for any combination of structured, unstructured, and semi-structured data. Also, XML documents can be modified and transformed, even into other
1
2
Chapter 1
Introduction
formats such as HTML. Furthermore, the consistency of XML documents can easily be verified with an XML Schema. All this has become possible through widely available standards and tools such as XML parsers, XSLT, XPath, XQuery, and XML Schema. They greatly relieve applications from the burden of dealing with proprietary data formats. In an era where message formats, business forms, processes, and services change frequently, XML often reduces the cost and time it takes to react to such changes and to maintain databases and application logic correspondingly. Beyond XML for data exchange, enterprises are keeping large amounts of business-critical data permanently in XML format. This practice has various reasons. Some businesses must retain XML documents in their original format for auditing and regulatory compliance. Common examples include legal and financial documents as well as electronic forms. Another reason for using XML as a permanent storage format is that XML can be a more suitable data model than a relational schema. If business objects are inherently complex, hierarchical, semi-structured, or highly variable in nature, the flexibility of XML offers advantages over a rigorously defined relational database schema. Accustomed to the benefits of mature relational databases, many users expect the same capabilities for XML data, such as the ability to persist, query, index, update, and validate XML data with full ACID (Atomicity, Consistency, Isolation, Durability) compliance, recoverability, high availability, and high performance. DB2 pureXML is the answer. The subsequent discussion in this chapter is structured along the following topics: • Brief introduction to XML as a data format (section 1.1) • Differences between XML and relational data (section 1.2) • Overview of DB2 pureXML and its capabilities for managing XML data (section 1.3) • Advantages of DB2 pureXML over alternative storage options for XML (section 1.4) • Sample scenarios where XML can offer advantages over relational data (section 1.5)
1.1
ANATOMY OF AN XML DOCUMENT
In this section we illustrate the most important parts of an XML document. A complete and exhaustive discussion of the XML standard is outside the scope of this book. Pointers to textbooks and tutorials about XML are provided in Appendix C, Further Reading. Let’s look at the XML document in Figure 1.1 as an example. The first line of the document contains the optional XML declaration. It indicates that this document follows the XML 1.0 standard, which is most commonly used. Besides XML 1.0, the only other version of XML is currently XML 1.1, which is very rarely used. We only consider XML 1.0 in this book. The XML declaration of the sample document in Figure 1.1 also carries an optional encoding declaration. Encoding concepts are discussed in Chapter 20, Understanding XML Data Encoding.
1.1
Anatomy of an XML Document
3
An XML document consists of elements and their attributes. Each element consists of a start tag and an end tag. These tags are enclosed in angle brackets. For example, the third line of the document shows a start tag and an end tag . Together they define a single XML element, the name element. The characters between the start and the end tag, Larry Menard, represent the value or the content of this element. Every start tag of an element must have a corresponding end tag. Elements can contain other elements, which means that tags can be nested. For example, the element addr contains the elements street, city, prov-state, and pcode-zip. Nesting builds hierarchical structures and expresses relationships between the elements. Elements can occur multiple times, in which case they are called repeating elements. For example, the phone element is a repeating element. It occurs multiple times because a single customer can have multiple phone numbers. Nested and repeating elements express one-to-many relationships between data items.
XML and encoding declaration Attribute
Larry Menard
Start tag of the root element Namespace declaration
223 NatureValley Road Toronto Ontario
Element Element value (text node)
M4C 5K8
Attribute name
905-555-9146 416-555-6121
Attribute value Comment End tag of the root element
Figure 1.1 Anatomy of an XML document
Elements can also contain one or multiple attributes within their start tag. Attributes are used to attach additional information to elements. They consist of an attribute name, the equal sign (=), and a value in quotes. For example, the element addr has an attribute country whose value is
4
Chapter 1
Introduction
Canada. Similarly, each occurrence of the element phone has an attribute type. Attribute values
must be in quotes regardless of whether the value is considered a numeric or a string value. For an XML document to be well-formed, it must have a single root element. The root element is the outermost element and contains all the other elements of the document. The root element in Figure 1.1 is customerinfo. It contains two attributes in its start tag, xmlns and Cid. The attribute Cid is used here to represent the customer identification number. The attribute xmlns is a reserved attribute and declares a namespace. Namespaces are optional and we defer their discussion to Chapter 15, Managing XML Data with Namespaces. XML element and attribute names are case sensitive. The tags , and are all completely distinct from each other. XML element and attribute names can contain letters, numbers, and certain other characters such as the underscore. However, tag names must not start with a number or punctuation character, must not start with the characters xml (or XML, xML, and so on), and must not contain spaces. The order in which elements appear in a document is significant. The order in which attributes appear within the start tag of an element is not significant. In other words, elements are ordered, attributes are not ordered. When to use elements and when to use attributes to represent certain data items is a data modeling question and addressed in Section 2.1, Choosing Between XML Elements and XML Attributes. Further discussion of XML documents and their hierarchical representation is provided in Section 3.1, Understanding XML Document Trees.
1.2
DIFFERENCES BETWEEN XML AND RELATIONAL DATA
For a comparison of XML and relational data, let’s consider the simple XML document and the relational table in Figure 1.2. The relational table has six columns with fixed names and data types. This table is a very strict and inflexible structure because every row in the table has to have exactly the same format with the same number of columns and the same data types. It is not possible that one row in the table has more or fewer columns than the next. It is also not possible for a column to have no data type or more than one data type. Each column has to have exactly one fixed data type. Moreover, the structure and data types of the table are defined before any data is inserted. Whenever data is inserted or retrieved from this table, the format of the rows is known without looking at the actual data. The strict schema provides a lot of information about the data and its format, which allows for very efficient access. The XML document in the left side of Figure 1.2 represents similar data as the row in the table on the right. With DB2 pureXML you can store, index, query, and update this XML document even if there is no XML Schema that defines its structure or the data types of its elements. You may have an XML Schema for this XML document, but you don’t have to. The document itself contains some meta information that describes the data items, but no further schema information is necessary to store and query this document.
1.2
Differences Between XML and Relational Data
Robert Shoemaker 845 Kean Street Aurora 905-555-7258
5
CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30), email VARCHAR(50), phone VARCHAR(20))
CID NAME STREET CITY EMAIL PHONE 1003 Robert Shoemaker 845 Kean Street Aurora NULL 905-555-7258
Figure 1.2
XML document (left) and relational table (right)
Assume you receive information about another customer whose street name is 42 characters long. Inserting this information into the relational table fails with an error that needs to be handled. This error can be desirable because it enforces a certain constraint, but it can also be undesirable because it prevents the new information from being stored and processed immediately. Because XML allows more schema flexibility, a document with a 42-character street name can be inserted without an error. The absence of an error can be desirable because it allows the data to be stored immediately, but it can also be undesirable because the excessive length of the street value goes undetected and can cause problems in later processing steps. Clearly, the flexibility of XML needs to be used with care and only to the degree that is appropriate for a given application. Optionally, you can choose to use an XML Schema that constrains the XML document as strictly as the relational table in Figure 1.2. You could also choose to use a less stringent XML Schema. For example, you could use an XML Schema that requires the Cid value to be an integer and the name to not exceed 30 characters, leaving the data types of all other data items unconstrained. You can choose the degree of schema flexibility that is right for your application. Note that the relational table in Figure 1.2 contains a NULL value in the column email. In the XML document, an email element is simply omitted if this customer does not have email. Optional XML elements are another form of schema flexibility. Assume you receive information about a customer where, unexpectedly, the name of his assistant is included. The assistant name can easily be accommodated with an optional assistant element in an XML document. However, the relational table in Figure 1.2 does not allow the assistant name to be stored. Next, let’s consider a schema change. Due to unforeseen changes in your business, you now need to store multiple phone numbers per customer. Reacting to this change is simple with XML. The document in the left side of Figure 1.3 simply uses multiple occurrences of the phone element. The repeating phone elements represent the new one-to-many relationship between customers and phones. Existing XPath queries that read phone elements do not change. Accommodating
6
Chapter 1
Introduction
multiple phone numbers per customer in the relational schema requires normalization, which is a drastic schema change. Existing SQL queries must be modified to perform the proper join between the two relational tables. Downtime and service interruptions are likely. CREATE TABLE phones(cid INTEGER, phone VARCHAR(20)) Robert Shoemaker 845 Kean Street Aurora 905-555-7258 416-555-2937
CID 1003 1003
PHONE 905-555-7258 416-555-2937
CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30), email VARCHAR(50), phone VARCHAR(20))
CID NAME STREET CITY EMAIL PHONE 1003 Robert Shoemaker 845 Kean Street Aurora NULL 905-555-7258
Figure 1.3
A schema change in XML and relational data
Some of the key differences between XML and relational data are summarized in Table 1.1. The flexibility of XML implies that examining and interpreting XML data can consume more computing resources than if the same data was stored in relational form. The reason is that information about the structure of the XML data needs to be discovered at runtime because a fixed schema is not always present. The relational data model relies on much more rigid schema definitions than XML. For a relational table in a database, the structure of a row and the size and data types of its columns are known as soon as the table is created. Therefore, data access is more straightforward and can be more efficient than for XML data. As such, relational data can provide very high performance but might fail to meet application requirements for schema flexibility. Table 1.1
Comparison of Relational and XML Data
Relational Data
XML Data
Highly structured, highly regular in nature
Semi-structured, can be highly variable in nature
Rows are flat
Data is hierarchical, can be arbitrarily nested
Fixed schema and metadata
Variable schema and metadata
Fixed number of columns per table
No fixed format, flexible number of elements and attributes per document
Fixed data type for all values in a column
Data types are optional and can be variable
1.3
Overview of DB2 pureXML
Table 1.1
7
Comparison of Relational and XML Data (Continued)
Data format defined by DDL, known at query/ insert/update compile time
Data format not necessarily predefined, not known until query/insert/update runtime
NULL values represent missing information
Optional elements and attributes can be omitted
Schema changes can be expensive
Schema changes are less expensive
In some cases, the nested and flexible structure of XML can offer performance benefits over relational schemas. Relational databases often require normalization to fit business data into flat, tabular structures. This normalization of complex business data requires transformation when data is stored and retrieved, and often leads to multi-way join queries in relational databases. XML can provide a more natural representation of complex business objects with all relevant relationships represented in a single document. The hierarchies within an XML document are essentially precomputed joins between related data items.
1.3
OVERVIEW OF DB2 PUREXML
This section provides a condensed overview of the DB2 pureXML technology. It summarizes the most important aspects of DB2 pureXML, which are described in more detail in the remainder of this book. At the core of DB2 pureXML is the data type XML, which has been added to the SQL type system in the SQL:2003 standard. Database users can define tables that contain one or multiple columns of type XML. In each row, a column of type XML contains either a well-formed XML document or NULL. A table that contains one or more XML columns can also contain other columns, such as INTEGER, VARCHAR, or DATE columns. Hence, users can define tables that hold both XML data and traditional relational data in each row of the table. The integration of XML and relational data is therefore very easy. It is also possible to create a table that only contains a single column of type XML and no other columns. DB2’s internal XML storage mechanism does not store XML data as text in large objects (LOBs) and does not convert XML to relational format. When you insert or load XML documents into a column of type XML, DB2 stores the XML documents in a parsed hierarchical format. Each XML document is parsed only once; that is, when it is first inserted into an XML column. The parsed storage format allows queries and updates to operate on XML data without XML parsing—a key performance benefit. The maximum XML document size is 2GB. You can use regular SQL statements to insert, delete, and update (replace) full XML documents. XML insert, update, and delete operations are logged by default and XML data is always buffered in the buffer pool. XML data participates in backup, restore, and recovery operations just like traditional relational data in the database. XML data can be compressed, replicated, and
8
Chapter 1
Introduction
federated, and is allowed in range-partitioned tables, clustered tables (MDC), and partitioned database environments (DPF). Partitioning keys and clustering keys must be relational columns. All the critical database utilities support XML data, such as LOAD, UNLOAD, IMPORT, EXPORT, RUNSTATS, REORG, BACKUP, RESTORE, and others. In DB2 for Linux, UNIX, and Windows, XML columns are also supported by High Availability Disaster Recovery (HADR). An XML Schema can be used to constrain XML documents, but the usage of XML Schemas is optional in DB2. In particular, you do not need to provide an XML Schema to create a column of type XML or to insert XML documents. DB2’s pureXML storage format does not depend on XML Schemas. When you insert, update, or load XML documents, you can choose to validate the documents against one or multiple XML Schemas. If you choose to validate documents, the validation and the association of schemas to documents happens on a per-document basis, not on a per-column basis. DB2 does not require all documents in an XML column to belong to the same XML Schema, although you can enforce that with triggers if you want. Since schema flexibility is often a key reason for using XML, DB2 allows documents for multiple schemas, or multiple versions of a schema, to coexist in a single XML column. XML Schema evolution is seamless and does not require any database downtime. The use of XML Schemas for document validation can help applications ensure XML data quality. However, there is no performance penalty if you store XML documents without validation in DB2. Although XML Schemas can constrain one XML document at a time, there is no standard or XML technology yet to define constraints or referential integrity across XML documents or across XML and relational data. However, when you insert XML documents into a table you can choose to extract selected element or attribute values into relational columns. DB2 can perform such value extraction as part of the INSERT statement, but it can also be automated with triggers. Then you can define relational constraints, such as foreign keys and check constraints, on the populated relational columns. In DB2, XML data can be queried with XPath and SQL/XML, and in DB2 for Linux, UNIX, and Windows, also with XQuery. The SQL/XML standard allows XPath and XQuery expressions to be embedded in SQL statements so that XML and relational data can be queried together in a single query. Joins between XML columns or between XML and relational columns are possible. The SQL/XML function XMLTABLE can be used to query XML data and return the result set in relational format. Other SQL/XML functions support the opposite; that is, to query traditional relational tables to construct and return XML documents that contain the data values. To ensure high performance for XML queries, DB2 allows you to create XML indexes on specific XML elements and attributes that you specify with an XPath. Similar to the relational world, it makes sense to index those XML elements and attributes that are frequently used in query predicates and join conditions. Although you can decide to index all elements and all attributes in all documents in an XML column, you are not forced to do so. Indexing selected elements and attributes is often preferred. If you define an XML index on an optional element that, for example, occurs in only 5% of the documents (rows), then the index is quite small because it contains
1.3
Overview of DB2 pureXML
9
entries only for those 5% of the documents and rows in the table. In contrast, relational indexes always contain exactly one entry for each row in a table. If a query contains relational predicates and XML predicates, DB2 can use a combination of XML and relational indexes to evaluate the query. DB2’s RUNSTATS utility can collect statistics for XML data which the DB2 optimizer uses to create efficient query execution plans. Although DB2 uses separate storage formats for XML and relational data, DB2 only has a single processing engine and a single query compiler and optimizer that handle any mix of relational and XML queries. DB2’s EXPLAIN facility can be used to examine the execution plans for XML queries just like for relational queries. DB2 for Linux, UNIX, and Windows also supports XQuery Updates to modify, insert, delete, or rename individual XML elements and attributes within an XML document. XSLT transformations as well as full-text search over XML data are also supported. Access control as well as concurrency control (locking) for XML data happens on the level of full documents. Since each XML document belongs to a row in a table, access control and concurrency control for a particular row determines the accessibility of the XML document in that row. Access rights and privileges cannot be defined for individual elements within an XML document. The XML data type can be used for more than just the definition of XML columns. For example, you can define XML parameters and XML variables in SQL stored procedures and user-defined function (UDFs). Such procedures and UDFs can contain XQuery or SQL/XML statements to manipulate XML documents while they remain in DB2’s internal parsed format. Application development for DB2 pureXML is based on existing but enhanced APIs. The traditional database APIs such as JDBC, ODBC/CLI, ADO.NET, or embedded SQL all support XQuery and SQL/XML statements as well as the exchange of XML data between a DB2 server and a client application. The JDBC 4.0 standard defines a new Java data type SQLXML to match the data type XML defined by the SQL standard. Similarly you can define XML host variables in COBOL, C, PL/1, and Assembler. With DB2 pureXML, applications can often avoid XML parsing, because DB2 stores XML documents in a parsed format. The parsed storage allows you to extract or update document fragments or individual values without having to parse the XML data in your application. Applications send appropriate XML query or update statements to DB2 instead of fetching and parsing full documents. As a result, using DB2 pureXML leads to less application code, reduced application complexity, and higher end-to-end performance. Both the DB2 Control Center and IBM Data Studio support DB2 pureXML through a variety of wizards and visual interfaces. For example, you can view the tree structure of XML documents, create XML indexes with point-and-click into XML documents, design and register XML Schemas, or build XQuery and SQL/XML statements with context assist in Data Studio’s statement editor.
10
Chapter 1
Introduction
1.4
BENEFITS OF DB2 PUREXML OVER ALTERNATIVE STORAGE OPTIONS FOR XML DATA Prior to the availability of DB2 pureXML, the two main storage options for XML data in relational databases are LOB storage and shredding: • The LOB storage approach stores full XML documents in their textual form in character or binary large object columns (CLOB or BLOB). Other columns in the same table typically contain document identification numbers or other information that helps applications to identify specific XML documents for retrieval or replacement. The main problem of this approach is that the XML documents are stored as if they were arbitrary pieces of text. The XML structure is ignored and not immediately visible. Therefore any operation that needs to access individual elements or attributes in a document requires XML parsing. For example, any query that extracts element values requires XML parsing at runtime. The resulting parsing overhead for query and update execution is a major performance problem that renders LOB storage inadequate for most XML applications. • Shredding (decomposing) XML documents into relational tables converts XML data into relational format. Shredding first requires a design stage where an administrator maps XML elements and attributes to relational columns. When XML documents are inserted, they are parsed, broken up, and only their atomic data values are retained (see Figure 1.4). These values are inserted into the relational target tables by a series of INSERT statements. After an XML document has been shredded, its values are stored in these tables without the original XML tags. Depending on the complexity of the XML documents, shredding can require dozens or hundreds of relational tables to represent all the hierarchical relationships among the original XML elements and attributes. In many real-world XML applications this complexity is staggering such that even the mapping task is considered prohibitively expensive or unfeasible. Queries over decomposed XML data often require multi-way SQL joins that tend to be difficult to develop and tune. Changes or variability in the XML input format often break the mapping to the relational database schema, which incurs time-consuming maintenance. A fixed schema mapping that is costly to change negates the flexibility for which XML is typically used. DB2 pureXML has been designed to overcome the problems that are inherent in LOB storage and shredding. The advantages of DB2 pureXML and its native XML storage format include: • Retaining awareness of the internal structure of the XML data: Contrary to LOB storage, DB2 pureXML stores XML in a parsed tree format that explicitly represents the structure of each XML document. As a result, applications can query and update XML data using XQuery, XPath, and SQL/XML without XML parsing at runtime. This is a critical performance benefit. Additionally, query performance can be enhanced by creating indexes on specific elements and attributes in the XML documents.
1.5
XML Solutions to Relational Data Model Problems
LOB storage: stores XML as text
XML DOC
11
Shredding: XML Relational
Schema Mapping
DB2 pureXML: stores XML as XML
XML DOC
XML DOC
Shredder
XML DOC XML DOC XML DOC
XML Index
CLOB Column
regular relational tables
XML Column
Figure 1.4
DB2 pureXML and alternative XML storage options
• Keeping business objects intact: DB2 pureXML stores each XML document as a cohesive unit that belongs to one row in a table, providing a very intuitive storage and processing model for the application developer. In contrast, XML shredding scatters the values of each XML document over a number of tables. Hence, shredding can result in an unwieldy relational schema that is difficult to understand and inefficient for queries and the reconstruction of XML documents. • Schema flexibility: While shredding requires all XML documents to adhere to a single XML Schema that is mapped to relational tables, DB2 pureXML can store documents for variable or evolving schemas in the same XML column. The cost of schema evolution is much lower for DB2 pureXML than for a shredding approach. • Faster application development: Because DB2 pureXML does not require any schema mapping and uses a single XML column instead of complex relational schema, prototyping and designing applications can be much simpler with DB2 pureXML than with shredding.
1.5
XML SOLUTIONS TO RELATIONAL DATA MODEL PROBLEMS
The data model that you use for your business data should allow for an easy and intuitive representation of your data and should efficiently support the most critical usage and access patterns. If the data being modeled is naturally tabular, it is typically better to represent it in relational format than as XML. However, there are cases where the relational model is not necessarily the best choice and sometimes even a poor choice to hold your data. The following are some situations where an XML representation tends to be more beneficial than the relational format.
12
Chapter 1
Introduction
1.5.1 When the Schema Is Volatile Problem with relational data: If the schema of the data changes often, then a relational representation of the data is subject to costly relational schema changes. Although some forms of schema modification are relatively painless in relational databases, such as adding a new column to a table, other forms are more involved, such as dropping a column or changing the type of a column. Still other forms of schema modification are extremely difficult, such as normalizing one table into multiple tables. Changing the tables means that the SQL statements in the applications that access them must also be changed. Solution with XML data: Portions of the schema that are volatile can be expressed as a single XML column. The self-describing and extensible nature of XML allows seamless handling of schema variability and evolution. Changes in the XML document format are accommodated without changing tables or columns in the database and typically without breaking existing XML queries.
1.5.2 When Data Is Inherently Hierarchical in Nature Problem with relational data: Data that is inherently hierarchical or recursive is often difficult to represent in relational schemas. Examples include a bill of materials, engineering objects, or biological data. A bill of materials explosion can be stored in a relational database but reconstructing it in parts or in full might require recursive SQL. Solution with XML data: Since XML is a hierarchical data model, it is a much more natural fit for inherently hierarchical business data. Using XML allows simple, navigational data access to replace complex set operations, which would be required if the same data was represented in tabular format.
1.5.3 When Data Represents Business Objects Problem with relational data: If application data represents business objects, such as insurance claim forms, then it is often beneficial to keep the data items that comprise a particular claim together, instead of spreading them over a set of tables. This benefit is particularly important when the individual data items of a claim form have no valid business meaning by themselves and can only be interpreted in the context of the complete form. Normalizing the claims across dozens of relational tables means that applications deal with a complex and unnatural fragmentation of their business data. Such normalization can increase complexity and the chance for errors. Solution with XML data: XML enables you to represent even complex business objects as cohesive and distinct documents while still capturing all the relationships between the data items that comprise the business object. Representing each claim form (business object) as a single XML document in a single row of a table provides a very intuitive storage model for the application developer and enables rapid application development.
1.6
Summary
13
1.5.4 When Objects Have Sparse Attributes Problem with relational data: Some applications have a large number of possible attributes, most of which are sparse; that is, they apply to very few objects. A classic example is a product catalog where the number of different product attributes can be huge, including size, color, weight, length, height, material, style, weave, voltage, resolution, water resistance, and a near endless list of other properties. For any given product, only a subset of these attributes is relevant. One possible relational schema is to have one column per attribute, which means a very large percentage of the cells in the table contain NULL values. Large numbers of NULLs are undesirable and can be inefficient. A different relational approach for such sparse data is a three-column table that stores several name/value pairs for each product ID. In this name/value pair approach, the attribute names are not column names but values in a VARCHAR column. This design prevents relational database systems from accurately estimating constraint selectivity and generating efficient query plans. Finally, defining and enforcing constraints, such as uniqueness for a certain attribute, is extremely difficult. Hence, data quality and integrity suffers. Solution with XML data: The beauty of XML is that elements and attributes can be optional, so they are simply omitted if they don’t apply for a specific product. Neither NULL values nor name/value pairs are needed. The XML Schema can define a very large number of optional elements, but only few of them are used for any given object. While every row in a relational table has to have the exact same columns, XML documents in an XML column can have different elements from one row to the next. Also, an XML index for an optional element is very small if this element appears only in a small percentage of the documents (rows). This is a clear advantage over relational indexes which have exactly one entry per row.
1.5.5 When Data Needs to be Exchanged Problem with relational data: If you export a set of rows from a relational table and send them to another application or organization, the recipient cannot interpret the data without additional metadata that describes the columns. This separation of data from metadata in the relational world poses a particular problem if your relational schema has changed since the last time you sent data. Solution with XML data: XML data is self-describing. The XML tags are metadata and describe the values that they enclose. The nesting of XML elements further defines the relationship between data items.
1.6
SUMMARY
XML, the extensible markup language, acts as a flexible and self-describing data format for data exchange, web services, and service-oriented architectures. XML is also a hierarchical data model that is inherently different from the relational model. While relational data processing is
14
Chapter 1
Introduction
based on rigorous and predefined schemas that allow for limited flexibility, XML is well-suited to represent data with variable or evolving schemas. XML is also commonly used as a data format for semi-structured data or to integrate structured and unstructured data. Depending on the performance and flexibility requirements of particular applications, you will find that in some cases XML is a better choice than a relational schema, and in other cases relational data has advantages over XML. Many scenarios also exist in which a hybrid approach, that is, a mix of XML and relational data, is the best solution. Considerations for hybrid data models are discussed further in the next chapter. DB2 pureXML provides sophisticated capabilities for storing, indexing, querying, updating, and validating XML documents. The pureXML technology and its native XML storage format provide significantly higher performance and flexibility than alternative storage options for XML data, such as LOBs or shredding. DB2 pureXML also enables seamless integration of XML and relational data.
C
H A P T E R
2
Designing XML Data and Applications
his chapter looks at several design issues in the world of XML documents. Sometimes you might get involved in the design of a specific format for your XML documents and you will find that the design decisions made at this point can have a big impact on how your application processes XML. Therefore, this is the first stage of XML application design. In many other cases, the format of the XML documents that you need to process may have already been designed and decided by the time you get involved. Many vertical industries and consortia define specific XML Schemas to standardize the XML document formats that are used to exchange and process information within a particular industry. Some of them are discussed in Chapter 16, Managing XML Schemas. Even if you work with a predefined XML format, there are still decisions to be made, such as the most suitable granularity in which you should store XML documents or document fragments.
T
In this chapter you learn • How to choose between XML elements and attributes (section 2.1) • How to represent data as XML values and metadata as XML tags (section 2.2) • How to design documents with an appropriate size and scope (section 2.3) • How to decide on a “good” mix of XML and relational data (section 2.4)
2.1
CHOOSING BETWEEN XML ELEMENTS AND XML ATTRIBUTES
A common question is when to use attributes and when to use elements, and whether this choice affects performance. It turns out that this is much more of a data modeling question than a performance question. As such, this question is as old as SGML, the precursor of XML, and has been
15
16
Chapter 2
Designing XML Data and Applications
hotly debated with no universally accepted consensus. However, a key thing to remember is that XML elements are more flexible than attributes because they can be repeated and nested. Table 2.1 shows an example of an XML document with and without attributes. Both documents logically represent the same business data. They contain information about a book called “Database Systems”, written by authors “John Doe” and “Peter Pan” who have id numbers 47 and 58 respectively, and the price of the book is 29, but there is no information in either document about the currency of the price. In the document on the left of Table 2.1, price and title are child elements of the element book, and the author id is a child element of the element author. This approach is certainly a decent way of modeling the data. Alternatively, the document on the right has price and title as attributes of the element book, and id as an attribute of the element author. In general, both versions of the document, with and without attributes, can be reasonable choices. There is no immediate way to decide whether one of the two document formats is “better” than the other. Table 2.1
An XML Document with and without Attributes
XML document without attributes:
XML document with attributes:
47 John Doe 58 Peter Pan Database systems 29 SQL relational
John Doe Peter Pan SQL relational
The document with attributes might be appealing because it is shorter. It contains 200 nonwhitespace characters as opposed to 248 in the document without attributes. An XML parser needs to look at every single character of a document, which generally means that shorter documents can be parsed faster. This reduction in parsing times may matter if you are designing an XML message format for very high-volume processing with near real-time performance requirements and throughput targets such as thousands of messages per second. However, many XML applications do not fall into this category and performance should be a secondary concern during XML modeling.
2.1
Choosing Between XML Elements and XML Attributes
17
More important is the flexibility and extensibility of the XML format, which is usually why XML is chosen to begin with. In the example in Table 2.1, chances are that the format of the price information eventually needs to be extended. This extension is easy in the document on the left where price is an element. For example, you can add an attribute currency to the price element to make it more descriptive. Also, as the business expands to international markets, you can easily repeat the price element multiple times to reflect the price of the book for different countries (see Figure 2.1). 47 John Doe 58 Peter Pan Database systems 29 5735 35.80 SQL relational
Figure 2.1 Document with multiple price elements
This extension of the price element has the very desirable property that XPath queries that worked for the old document format continue to work without changes for the new format. For example, the XPath /book/price returns the single price element from the document on the left in Table 2.1, but also all three price elements with their currency information from the new document format in Figure 2.1. This property helps to ensure seamless operation of applications during such a schema evolution. In the document on the right side of Table 2.1, where price is an attribute, such an extension is a lot harder to make if you want to keep using attributes. The existing price attribute cannot be extended to contain another nested attribute, and an attribute by the name of price can only occur once for the book element. You could certainly remove the existing price attribute and use price elements instead. This change implies that for older documents the XPath to the price information is /book/@price whereas for newer books it is /book/price. Thus, this change is invasive and indicates that you probably should have used elements to begin with. In such a situation you should not use multiple price attributes with different names, as shown in Figure 2.2. This design has a variety of undesirable consequences. First of all, XPath queries need
18
Chapter 2
Designing XML Data and Applications
to be changed each time you introduce a new currency to your business. Second, this design makes it more complicated to retrieve all price information with a single query. Third, if your queries use search conditions on the price attributes then you will have to define a separate XML index for each currency, instead of just two indexes (on e for price and one for currency). These problems stem from the fact that the currency information is part of your business data, not part of the metadata. Hence, the currency should be a value and not part of a tag name. The use of tags and values is discussed further in section 2.2. ...
Figure 2.2
Bad XML design with different names for price attributes
Also note that the XML standard specifies that elements are ordered while attributes are unordered. For example, the three price elements in Figure 2.1 are in a fixed order, and this order is guaranteed when the document is parsed, stored, queried, or otherwise processed. In contrast, the three price attributes in Figure 2.2 do not have a significant order within the book element. They could appear in a different order and the document would still be considered “the same.” Hence, if the relative order among your data items is important, use elements instead of attributes. Although you could model all data without attributes, they can be a very intuitive choice for data items that are known in advance to never repeat (per element) nor have any subfields. Attributes contribute to somewhat shorter XML because they have only a single tag as opposed to elements, which have a start tag and an end tag. Shorter attribute tags are at most a minor performance bonus rather than an incentive to convert elements to attributes, especially when data modeling considerations actually call for elements. In DB2, attributes can be used in queries, updates, predicates, and index definitions just as easily as elements. There is generally no significant performance difference between accessing or updating elements versus attributes when XML documents are stored in DB2. Both elements and attributes can be defined as mandatory or optional in an XML Schema. As another example, let’s look at the XML document in Figure 2.3, which contains information about a department with two employees. The document uses attributes for the department and employee identifiers. This approach seems to make sense because each employee and department will always have just one ID value. Furthermore, an element is used for the employee telephone information, which allows an employee to have multiple occurrences of the phone element if needed. It is also extensible in case you later need to break telephone numbers into fragments. For example, the phone element could have child elements for country code, area code, and extension, which would not be possible if phone was an attribute.
2.2
XML Tags versus Values
19
John Doe 408-555-1212 408-463-4880 Peter Pan 408-255-8587 F589
Figure 2.3 A sample XML document
The XML document in Figure 2.3 also raises another design question, which we discuss in section 2.3: Is it better to keep the information for all employees of a department in one document, or is it better to have one XML document per employee?
2.2
XML TAGS VERSUS VALUES
The idea of XML as an extensible markup language is that the markup, which consists of all the element and attribute tags, describes the enclosed data values. The ability to use custom tags for markup makes XML a self-describing data format. The XML tags can also be considered metadata. Hence, XML documents conveniently combine data and metadata in a universally accepted format. An important aspect of designing XML documents is to distinguish clearly between data and metadata. The metadata should be represented as element and attribute names, the data as element and attribute values. This approach is analogous to relational modeling, where table and column names are metadata, and the values in the columns are the actual data. In XML it’s almost always a bad idea to represent metadata as values instead of tags, or actual data as tags instead of values. Let’s look at the examples in Table 2.2 and Table 2.3. The document on the left side of Table 2.2 contains information about the brand, price, and year of a car. The brand is Honda, the price is 5000, and the year is 1996. The terms “brand”, “price”, and “year” constitute meta information for the values Honda, 5000, and 1996. Hence, Honda is a data value, not metadata. Therefore it should be an XML element value, not an element name. The XML document on the right side of Table 2.2 is a better representation of the same data. There the term “brand” is used as an element name (meta information) for the value Honda. Imagine yourself modeling the same data in a relational table. You would not use Honda as a column name in a table. Avoiding business data in tag names has several advantages: • If you are using an XML Schema, you don’t need to add new element definitions to your XML Schema each time your business handles a new brand of car.
20
Chapter 2
Designing XML Data and Applications
• You can always use the XPath /car/brand to retrieve the brand from a particular car document. Otherwise, if brand names are tags, many different or more complicated XPath expressions are necessary. • If you search for cars by brand then you can use XML indexes in a much simpler and more intuitive manner if the brand names are element or attribute values rather than tag names. Table 2.2
Business Data as Tags Versus Values
Business data as element name (not recommended): 5000 1996
Business data as element value (recommended): Honda 5000 1996
What happens if you use meta information, such as the terms “brand”, “price”, and “year”, as values rather than element or attribute names? This is shown in the left side of Table 2.3 where the XML document consists of very generic tag names, such as object, type, field, name, and value. These tags are not very descriptive, which is contrary to the concept of XML as a selfdescribing data format. You see that the brand, price, and year of the car are represented by pairs, which consist of a name and a value. However, the names are actually XML attribute values, not descriptive tag names. This approach is commonly referred to as Name/Value Pairs (NVP), KeyValue Pairs (KVP), or Entity-Attribute-Value model (EAV). Table 2.3
Name/Value Pairs (Metadata as Tags Versus Values)
Metadata as values, aka Name/Value Pairs (often bad): Metadata as element names (good): Honda 5000 1996
The Name/Value Pair approach to data modeling also sometimes appears in the relational world when a table with three columns (id, name, and value) is used. This approach may seem attractive when dealing with entities that can have hundreds or thousands of attributes, but only a small number of them apply to any individual entity. If you were to represent each possible attribute by a column in a relational table, you might exceed the maximum row length or the maximum number of columns in a table. Nevertheless, the Name/Value Pairs approach has very significant and inherent drawbacks, which are similar for XML and relational data. In particular:
2.2
XML Tags versus Values
21
• Defining business rules and constraints for Name/Value Pairs is very difficult and often impossible. You cannot define an effective XML Schema to control and constrain this type of XML data. If you use the “better” XML format shown in the right side of Table 2.3, an XML Schema can easily specify that the value of the price element has to be greater than zero, and the value of the year element has to be a four-digit integer between 1950 and 2099. In the Name/Value Pairs in the left column of Table 2.3, price and year are represented by the same XML attribute called value. An XML Schema does not allow you to specify that if there is an attribute called name with the value price then the value of the attribute value in the same field element must be greater than zero. • Name/Value Pairs handle all data as strings (text). Since the attribute value can contain arbitrary data values, it cannot be typed as INTEGER, DECIMAL, DATE, or TIMESTAMP. Handling all data as strings leads to data quality issues because proper data types cannot be enforced. Another consequence is that any indexes and comparisons have to treat the data values as strings. If you search for cars with a price greater than “5000”, you will also find cars with prices such as “600” or “900” because these strings are greater than the string “5000”. You can solve this problem with appropriate cast operations in your queries, but those usually preclude the use of indexes, which means performance suffers. • Writing queries against Name/Value Pair data is very complex. As an example, assume that you need to retrieve the years of all Honda cars that have a price greater than 5000. The corresponding XPath expression for the Name/Value Pair data is shown in Figure 2.4, followed by the same query for the “regular” XML data in the right side of Table 2.3. The difference in complexity is striking, and it is even greater for more advanced search queries. -- XPath query to retrieve the years of all Honda cars with a -- price greater than 5000 from Name/Value Pair XML data: /object[@type="car" and field[@name = "brand" and @value = "Honda"] and field[@name = "price" and @value > "5000"] ]/field[@name="year"]/data(@value)
-- Same query for regular XML Data: /car[brand="Honda" and price > 5000]/year
Figure 2.4
Complexity when querying Name/Value Pairs
22
2.3
Chapter 2
Designing XML Data and Applications
CHOOSING THE RIGHT DOCUMENT GRANULARITY
When you design your XML application, and in particular your XML document structure, you may have a choice as to which business data is kept together in a single XML document. Is it better to keep a lot of data in a large XML document, or is it better to use many small documents instead? The proper scope of any given document is a critical design decision. The general recommendation is to choose an XML document granularity such that one document represents one logical business object from an application point of view. Another guideline is to use an XML document granularity that matches the anticipated predominant granularity of data access or data exchange. Very often the logical business objects match the predominant granularity of data access, so these two guidelines lead to the same result. What constitutes a small, medium, or large XML document? Very roughly, XML documents up to 50KB are typically considered small, documents between 50KB and 1MB are often considered medium, and documents of more than 1MB are considered large. Documents in the range of hundreds of Megabytes or a few Gigabytes are huge, relatively rare, and almost always the result of combining a large number of smaller XML documents. Let’s look at the example in Figure 2.5, which shows three design options to represent data for several orders. Each order has a date, a customer name, and several parts, which have a key, a quantity, and a price. Let’s assume that you have to store and manage these orders for a particular application that treats each individual order as a separate logical business object. It typically receives and processes one order at a time, and a single order is the predominant level of access or transmission. In case (a) on the left, multiple orders are combined in one large document (coarse granularity). This approach can be useful when you need to archive or FTP a certain batch of orders, such as all orders for the past week, for example. Storing this large document as-is in a database is only a good idea if this batch in its entirety represents a meaningful business object to your application and users. This is not the case in our example. Since our fictitious application typically reads and writes one order at a time, storing many orders in a single large document would result in suboptimal performance. In general, combining many independent business objects in a single document is not recommended. DB2 uses indexes over XML data to filter on a per-document level. Therefore, the finer the XML document granularity, the higher the potential benefit from indexbased access. Although DB2 pureXML helps you avoid a lot of XML parsing in the application layer, some applications might still use a DOM parser to ingest XML documents and run into performance problems or failures if the documents are too large. Many XML design and editing tools also use DOM parsers and are often unable to handle very large XML documents. Therefore, debugging and correcting XML documents is much easier if they are small. In case (b), each order is a separate XML documents (medium granularity). This approach matches the nature of the application and not only provides good performance but is also very
2.3
Choosing the Right Document Granularity
23
intuitive for the application developer. One row in the database contains one business object for the application and no joins are required to retrieve all data for this object. Case (c) on the right represents fine granularity. Each order and each part is stored as a separate XML document. This approach can be a very good choice if each part information in itself is a separate business object of interest and often accessed and processed independently from the order it belongs to. In this example, however, part information has no real business meaning on its own and is dependent on an order. For example, the quantity and the price of a part are relevant only for a specific order. A different order can contain the same part with a different price and quantity. Typically, an application always needs to see all parts of an order and would never retrieve a part by itself without order information. Another reason why case (c) might not be useful is that having part and order information in separate documents would require joins between them. These reasons make case (b) desirable because the XML documents already represent this join in their structure. (a) Doe 5 5.00 11 19.95 Doe 5 5.00 11 19.95 23 1.99 1 24.95
Figure 2.5
Different document granularities
24
Chapter 2
Designing XML Data and Applications
In a nutshell, choose the XML document granularity with respect to the logical business objects and the anticipated predominant granularity of access. When in doubt, it is usually better to lean towards finer granularity and smaller XML documents.
2.4
USING A HYBRID XML/RELATIONAL APPROACH
XML is not the grand solution for all data management problems. As discussed in Chapter 1, Introduction, XML can provide significant advantages if the structure of your data is highly variable, evolves over time, or is hard to represent in a simple relational schema. Also, if you receive and send business objects in XML format, you can often improve performance and simplify applications if you also store these objects in XML format. Storing XML objects in XML format avoids complex mappings and costly transformations. However, sometimes the best solution is to store some of your data in relational format and some of your data in XML format, which is called hybrid storage. There are no definitive rules that describe precisely how to determine the right mix of XML and relational data. The right mix depends on the specific characteristics and requirements of a given application, or set of applications, that access the data. The following considerations can help you find the right design for your application. It is quite common that business objects such as orders, trades, sales records, customer records, emails, and blog posts consist of a fixed header plus a highly variable body. The header contains certain data fields that are common for all business objects of the same category. The body can be very different from one business object to the next and can contain any of thousands of optional attributes. For example, a financial trade might contain a header with the trade ID, the trading date, and the IDs of the two parties involved in the trade. Although these data items are present for every trade, the elements in the body of the trade depend highly on the exact nature of this particular trade. In this case, you might want to store the header fields in relational columns and the body in an XML column of the same table. Similarly, think of XML documents such as emails, blog posts, or CRM (customer relationship management) records produced in a call center. CRM records often contain the customer name and identifier, the date when the customer called in to report a problem, the name or ID of the product or service that the customer needs help with, and most likely a unique identifier of the CRM record itself. This data is very regular and structured with well-defined data types and can easily be stored in relational columns. However, the body of a CRM record typically contains semi-structured information with free text as well as interspersed data fields such as dates and a user ID to track when and by whom new information gets appended. This semi-structured part of the CRM record is better stored as a whole in an XML column.
2.5
Summary
25
If a business object arrives as an XML document, DB2 can extract selected element or attribute values from the document as part of the INSERT statement, without any extra XML parsing. This process is explained in more detail in Chapter 11, Converting XML to Relational Data. The benefits of storing some data fields of a business object in relational format can include the following: • You can define primary key and foreign key constraints on relational columns, but not on any elements or attributes in an XML column. • You can define multi-column (composite key) indexes on two or more relational columns, but you cannot define a composite key on two or more elements or attributes in an XML column. • Relational columns can be used to define range partitioning, hash partitioning, or multidimensional clustering for a table. These cannot be defined based on elements or attributes in an XML column. • Queries can use regular relational SQL predicates for relational columns, which some people find easier to use than XML predicates. • If you use WebSphere Replication Server to replicate rows to another database server, you can define filtering conditions on relational columns of the source so that rows are selectively replicated only if they meet the specified condition. Such replication filters cannot be specified on XML columns. • Relational column values can be referenced in the definition of generated columns and materialized views, but XML columns and individual XML elements and attributes cannot.
2.5
SUMMARY
Designing an XML application begins with designing the XML data. The more appropriately you design your XML data for your business needs and application, the easier it will be to process and manage this XML data efficiently. Both your applications and your database will run best if the scope and granularity of your XML documents match the logical business objects of your application as well as the most frequent granularity of data access or data exchange. Try to favor smaller documents rather than larger documents. For the low-level design of your XML documents, keep in mind that XML elements are more flexible than attributes because they can be repeated and nested. You often want to favor XML elements over attributes to ensure future extensibility of your XML data. Also, make sure that meta information that describes your data is represented by XML element and attribute names, not by values. Conversely, the actual data items that your applications need to read and manipulate should be XML element and attribute values, not XML tags. Remember the analogy to the columns in relational tables, where column names represent metadata while the column content is your business data.
26
Chapter 2
Designing XML Data and Applications
Often you do not have the luxury to design your XML document format. Many XML applications are forced to consume and process XML documents in a format that has previously been designed by other parties and cannot be changed. You can still choose to let DB2 split those documents into smaller fragments if that better matches the predominant granularity of access. Additionally, it can be advantageous to extract a few selected elements or attribute values from each document into relational columns. Chapter 5, Moving XML Data and Chapter 11, Converting XML to Relational Data explain DB2’s capabilities for splitting XML documents and hybrid XML/relational storage.
C
H A P T E R
3
Designing and Managing XML Storage Objects
n this chapter we discuss how to create and configure a database, table spaces, and tables to manage XML data. This discussion includes topics such as hierarchical XML storage structures, XML compression and inlining, monitoring and measuring XML storage consumption, reorganization, and partitioning of tables and databases that contain XML data. The topics in this chapter are organized as follows:
I
• Understanding XML document trees and their pureXML storage representation. These concepts are platform independent (sections 3.1 and 3.2) • Managing XML storage in DB2 for Linux, UNIX, and Windows (sections 3.3 through 3.10) • Managing XML storage in DB2 for z/OS (sections 3.11 and 3.12) • XML parsing and XML memory options specific to DB2 for z/OS (section 3.13) When you create a database that will contain XML data, one of the first design choices is to choose a code page. The recommended code page is UTF-8 Unicode. The benefits of Unicode are explained in Chapter 20, Understanding XML Data Encoding. It is also possible to manage XML in a non-Unicode database, which allows you to easily add XML to existing databases that do not use UTF-8. DB2 9 for z/OS allows XML columns in databases and table spaces of any supported encoding. In DB2 9.5 and 9.7 for Linux, UNIX, and Windows, all new databases use UTF-8 as the default code page. However, you can specify a non-Unicode code page in the CREATE DATABASE statement, if you want.
27
28
Chapter 3
Designing and Managing XML Storage Objects
DB2 9.1 for Linux, UNIX, and Windows is slightly more restrictive because pureXML is available only in UTF-8 encoded databases, and you must explicitly set the database code page to UTF-8 in the USING CODESET clause of the CREATE DATABASE statement: CREATE DATABASE mydb USING CODESET utf-8 TERRITORY us
Before we discuss how XML documents are physically stored in a DB2 database, let’s look at how the XQuery Data Model defines XML document trees.
3.1
UNDERSTANDING XML DOCUMENT TREES
Since XML is a hierarchical data model, every XML document can be represented as a tree of nodes. Any query or update of XML data traverses the hierarchical structure of the XML documents. This traversal can be done most efficiently if the XML documents are physically stored in a hierarchical format. Therefore, DB2 for z/OS and DB2 for Linux, UNIX, and Windows store XML documents as trees of nodes with parent-child relationships between the nodes. These trees are defined by the XQuery Data Model (XDM) and described in this section. Further details of the XQuery Data Model are covered in Chapter 6, Querying XML Data: Introduction and XPath. Let’s look at the XML document in Figure 3.1 as an example. It is a simple document that contains information about a customer. The outermost element, customerinfo, is called the root element. Its children are the elements name and addr as well as two occurrences of the element phone. The element addr has an attribute country as well as four child elements: street, city, state, and zip. Each phone element has an attribute called type. Jim Noodle 555 Bailey Ave San Jose CA 95141 408-289-4136 408-710-7910
Figure 3.1
Sample XML document
Figure 3.2 shows the same XML document in its tree representation. Such a tree can be constructed by parsing a textual XML document with an XML parser. In general, an XML document tree can have six different types of nodes. Element nodes, attribute nodes, text nodes, and the document node are the most common node kinds. They occur in the tree in Figure 3.2. Occasionally, XML documents can also contain comment nodes and processing instruction nodes.
3.1
Understanding XML Document Trees
29
Every XML element of the document in Figure 3.1 is represented by an element node in the corresponding document tree in Figure 3.2. The element nodes are white and rectangular. The textual value of each element is represented by a separate text node, shown in gray. Attribute nodes are shown with a double border. An attribute node contains all information about an attribute, including its value. The XQuery Data Model also defines that each document tree has a document node, shown in Figure 3.2 as a black circle. It is the topmost node and the parent of the root element. The document node is not visible in the textual representation of an XML document, only in its parsed hierarchical format. You will see later in this book that the document node is sometimes important when you manipulate XML documents. For example, assume you cut off the addr branch from the tree in Figure 3.2. This branch by itself does not have a document node and is therefore not a valid document tree. Hence, inserting it as a document into an XML column would fail unless you construct a new document node. Construction of a document node is shown in Chapter 5 (see section 5.7, Splitting Large XML Documents into Smaller Documents).
customerinfo
name
Jim Noodle
addr
country=US
street
555 Bailey Ave
Figure 3.2
city
San Jose
phone
state
CA
zip
type=work
408-289-4136
phone
type=cell
408-710-7910
95141
XML document tree
You might wonder why element values reside in separate text nodes while attribute values do not. The main reason is that the child nodes of an element can be a mix of text nodes and other element nodes, which is known as mixed content. An attribute, however, has exactly one value and never any child nodes, which makes attributes less extensible than elements. An element can have multiple text node children but they cannot be adjacent siblings to each other. As an example of mixed content and multiple text node children, consider the following two XML documents, both of which contain a title element. In the first case the title has a single text value and the corresponding tree representation is shown in Figure 3.3(a). The title element in the second document contains some text, “The ” and “ Cookbook” (note the spaces!), as well as a child element bold.
30
Chapter 3
Designing and Managing XML Storage Objects
Figure 3.3(b) shows that this results in a mixed set of child nodes under the title element: two text nodes and one child element (bold). The two text nodes “The ” and “ Cookbook” are separated by the element bold and are not adjacent children. If they were adjacent they would automatically collapse into a single text node. (a) (b)
The DB2 pureXML Cookbook The DB2 pureXML Cookbook
title
title
The
bold
Cookbook
The DB2 pureXML Cookbook
DB2 pureXML
(a) Figure 3.3
(b)
An example of mixed content
Note the XQuery Data Model defines the value of an XML element as the concatenation of all text nodes in the subtree under that element. This concatenation is trivial for elements that have only one text node. The value of the element state in Figure 3.2 is “CA”, and the value of title in Figure 3.3(a) is “The DB2 pureXML Cookbook”. At the same time, the value of the title element in Figure 3.3(b) is also “The DB2 pureXML Cookbook”, and the value of the element bold is “DB2 pureXML”. Similarly, the value of the addr element in Figure 3.2 is “555 Bailey AveSan JoseCA95141” (note that there is no space between Ave and San and also no space between Jose and CA and 95141). The addr element is called a non-leaf element, and this example shows that values of non-leaf elements are often not useful.
3.2
UNDERSTANDING PUREXML STORAGE
The document tree in Figure 3.2 illustrates the hierarchical format in which XML documents are stored in DB2 (all platforms). When an XML document in its textual format is inserted or loaded into an XML column, the DB2 server parses the XML document to produce the parsed hierarchical format that is stored on pages in a table space. This process is reversed when an application retrieves an XML document from DB2. This reverse process is called serialization; that is, the document tree is converted back into the text format of the XML document. You can think of parsing and serialization as inverse operations.
3.2
Understanding pureXML Storage
31
The exact shape of a document tree in DB2’s storage layer depends on and can vary with each individual instance document. It is not pre-defined based on an XML Schema, which allows DB2 to store documents with widely varying or evolving structures in the same XML column. DB2 performs a variety of optimization when storing document trees on pages. For example, element and attribute names (also called tag names) are transparently replaced by unique 4-byte integer numbers. Thus, DB2’s internal tree format looks actually more like Figure 3.4 than Figure 3.2. In addition to the integer number, each node can also contain other properties, such as information about namespaces and data types.
100
101
Jim Noodle
102
103=US
104
555 Bailey Ave
Figure 3.4
109
San Jose
116
106
CA
113
110=work
408-289-4136
116
110=cell
408-710-7910
95141
XML document tree with tag names replaced by integer values
The mapping from tag names to the so-called stringIDs is kept in the catalog table sysibm. sysxmlstrings (see Figure 3.5). This mapping is database-wide, where each distinct tag name and each distinct namespace URI has exactly one entry. For example, the phone element occurs twice in the sample document and may occur millions of times across all the XML documents in a database. Each occurrence is replaced with the same unique stringID, which is 116 in this example. Hence, the phone element has only one entry in the mapping table. Consequently, the mapping table is never larger than the number of distinct tag names in the database, which is typically a small number (several hundred to several thousand).
32
Chapter 3
Designing and Managing XML Storage Objects
STRING
STRINGID
IS_TEMPORARY
customerinfo
100
N
name
101
N
addr
102
N
country
103
N
street
104
N
city
109
N
state
106
N
zip
113
N
phone
116
N
type
110
N
…
…
…
Figure 3.5
Mapping tag names to integers in sysibm.sysxmlstrings
When a document is inserted and parsed, DB2 checks every tag name to see whether it is already recorded in this mapping table. If it is not, a new entry is added to the mapping table. Otherwise the existing stringID for the tag is used. Hence, inserts into the mapping table are very rare and occur only for new elements that DB2 has never seen before in a given database. For example, if you insert a million documents of similar structure, there is a good chance that only the first document, or the first few documents, actually cause inserts into the sysibm.sysxmlstrings catalog table. Most of the time the mapping table is active as a lookup table and DB2 has a special purpose mechanism and cache to ensure high lookup performance. DB2’s use of the mapping table leads to significant performance benefits. First of all, it reduces the space that is required to represent XML on pages in table spaces or buffer pools. Second, any query evaluation and traversal of XML documents now operate on integers, not on strings, which is much faster. Since the sysibm.sysxmlstrings table never grows very large, DB2 never deletes or updates any entries in this table. This avoids lock contention on this table and enables high performance. Even REORG or LOAD REPLACE of a user table does not reset the mapping table. Remember that the mapping table contains entries for XML documents in the entire database, and not just for XML documents in a single table. Excessive growth of the mapping table is not a concern, because XML applications do not use an unbounded number of distinct tag names.
3.3
XML Storage in DB2 for Linux, UNIX, and Windows
33
The mapping table is really only for DB2’s internal operation and you cannot modify it. You can however, read this table if you want to get a list of all tag names that ever existed in the database (Figure 3.6). Since version 9.5, DB2 for Linux, UNIX, and Windows stores the tags in a binary format to avoid code page problems in non-Unicode databases. Therefore you need to use the function xmlbit2char to make the strings human-readable. -- DB2 for z/OS and DB2 9 for Linux, UNIX, Windows: SELECT * FROM sysibm.sysxmlstrings; -- DB2 for Linux, UNIX, and Windows, Version 9.5 and higher: SELECT stringid, substr(sysibm.xmlbit2char(string),1,50), is_temporary FROM sysibm.sysxmlstrings;
Figure 3.6
Reading XML tag names from sysibm.sysxmlstrings
The column IS_TEMPORARY in sysibm.sysxmlstrings only exists in DB2 for Linux, UNIX, and Windows. It indicates whether a tag name belongs to a document that is stored in an XML column (IS_TEMPORARY = 'N') or to an element or attribute that has been newly constructed as part of a query (IS_TEMPORARY = 'Y'). For example, a query that creates and returns a new element name that has never been seen in the database before also causes a new entry in the string table. However this happens only upon its very first execution, after which the new tag is registered and known. You cannot delete or update entries in this catalog table.
3.3
XML STORAGE IN DB2 FOR LINUX, UNIX, AND WINDOWS
This and the following sections describe storage objects, such as tables and table spaces, for XML data in DB2 for Linux, UNIX, and Windows. DB2 for z/OS uses similar but slightly different concepts, which are discussed in sections 3.11 through 3.12.
3.3.1
Storage Objects for XML Data
Whenever you define a table, DB2 creates one or multiple storage objects in a table space. For example, a relational table structure is stored in a DAT (data) object. Any kind of index is stored as an INX object. If your table contains a LOB column, DB2 creates a separate LOB object. And, if your table contains one or multiple XML columns, there is an XDA (XML data area) object. For SMS (system-managed space) table spaces, these objects appear as separate files in the file system. For DMS (database-managed space) table spaces, which are the default and recommended, these objects are not visible but nevertheless exist in the DMS containers.
34
Chapter 3
Designing and Managing XML Storage Objects
WHAT IS A TABLE SPACE? A table space is a storage structure that can contain relational tables and indexes as well as large objects (LOBs) and XML data. Table spaces enable you to specify where your data is physically stored. They also allow you to assign different types of data to different buffer pools in main memory, or to back up and restore specific parts of your database.
Let’s look at this CREATE TABLE command as an example (note that no XML Schema is required to define a table with a column of type XML. DB2’s XML storage is independent of any particular XML Schema): CREATE TABLE customer (id INTEGER, info XML)
The storage objects that DB2 creates and maintains for this table are illustrated in Figure 3.7. The table with two columns is maintained in a DAT object. The XML column in this table does not contain the actual XML documents that are inserted, but just logical pointers to them. The reason is that XML documents can easily be too big to fit into a relational row on a single page. This approach is similar to the storage of large objects (LOBs) in DB2. The main difference between XML and LOBs is that XML is buffered in the buffer pool whereas LOBs are not. By default, XML documents are stored in the XDA object. If a table has multiple XML columns, all of them share the same XDA object. Whenever a document tree does not fit on a single page, DB2 automatically and transparently breaks the tree into multiple subtrees, which are called regions. Each region is then stored on a separate XDA page so a single document can span many pages. Documents that fit on a single page consist of a single region. If documents are much smaller than the page size, multiple regions (documents) can be stored on a single page so that no space is wasted. DB2 allows you to store XML documents up to 2GB in size, which is large enough for just about every application. One regions index is created automatically by DB2 for each table that contains one or more XML columns. In the catalog view syscat.indexes, every regions index is identified by the value XRGN in the column INDEXTYPE. It is not a user-defined index and you cannot drop it. The regions index contains one entry for each region of a document. If a document consists of multiple regions, then these regions are represented by consecutive regions index entries. An XML document pointer in the XML column in the DAT object points to a regions index entry that in turn points to the “first” region of the corresponding document. This is the region that contains the root node of the document. A short range scan on the regions index then provides pointers to the remaining regions of the document. If a node A in a region has a child node B that is the topmost node of
3.3
XML Storage in DB2 for Linux, UNIX, and Windows
35
another region, node A contains information that points back into the regions index (not shown in Figure 3.7). It points to the regions index entry that leads to the region with node B. Also not shown in Figure 3.7 is that DB2 maintains a path index for every XML column. It contains one entry per unique path in the XML data and is therefore very small. More details on the path index can be found in Chapter 13, Defining and Using XML Indexes. Table Space
ID (INT) 1001 1000 1003 1005
INFO (XML)
Regions Index
pages
INX Object
page
DAT Object
page
page
page
page
page
page
page
page
page
page
page
page
XDA Object Figure 3.7
Storage objects involved with an XML column
Storing large documents as regions across pages has several advantages. First and foremost, DB2’s proven infrastructure for managing pages works for XML data just like for relational data. This includes table spaces, buffer pools, page cleaning, backup and restore, recovery, HADR, and so on. If a document is large and spans many XDA pages and a query touches only part of the document, DB2 does not necessarily need to bring all pages of the document into the buffer pool. DB2 always strives to split a document into the smallest possible number of regions. The regions for one document are in most cases stored on physically consecutive pages. The way XML documents are broken into regions is completely transparent to the application and to the DBA. You should never attempt to design XML documents with the goal of optimizing any aspect of how DB2 stores the documents. You should model your XML data at the logical level to reflect your business data and focus on the characteristics and requirements of your application, not on how DB2 processes XML. Most applications are best served with large numbers of small documents, where each XML document represents a separate business object.
36
3.3.2
Chapter 3
Designing and Managing XML Storage Objects
Defining Columns,Tables, and Table Spaces for XML Data
In DB2 for Linux, UNIX, and Windows, database-managed table spaces (DMS) provide higher performance than system-managed table spaces (SMS) for relational data, and even more so for XML read and write access. Since DB2 9, newly created table spaces are DMS by default. It is also recommended to use DMS table spaces with automatic storage so that they grow as needed without manual intervention. A key aspect of physical database design is the page size of a table space. Measurements have shown that the lower the number of regions (splits) per XML document the better the performance, especially for XML insert and full-document retrieval. If a document does not fit on a single page, the number of splits per document depends on the page size (4KB, 8KB, 16KB, or 32KB). The larger the page size of the table space the lower the number of regions per document. For example, let’s say a given document gets split into forty regions across forty 4KB pages. Then it might be possible to store the same document on only twenty 8KB pages, or ten 16KB, or five 32KB pages. If the XML documents are significantly smaller than the selected page size, no space is wasted because multiple small documents can be stored on a single page. The impact of the page size on the number of regions per document is illustrated in Figure 3.8. Since each region requires one regions index entry, a larger page size that allows for fewer regions per document also leads to a smaller regions index. 4K Pages
8K Pages
…. 32k Pages
Figure 3.8
The number of regions per document depends on the page size
3.3
XML Storage in DB2 for Linux, UNIX, and Windows
NOTE
37
Most XML applications perform best using 16KB or 32KB
pages.
16KB pages can provide good performance if most documents are quite small (for example, less than 4KB) so that several documents fit on a page. Larger documents are better served by 32KB pages. If you prefer to use a single page size for XML and relational data, or for data and indexes, and you find that 32KB pages are too large for efficient access to relational data or indexes, then 16KB pages can be a good compromise. Let’s look at some examples. Figure 3.9 shows how to define two table spaces, one with 4KB pages and one with 32KB pages. These table spaces are used in the subsequent CREATE TABLE statements and figures. CREATE BUFFERPOOL bpsmall PAGESIZE 4k ; CREATE BUFFERPOOL bplarge PAGESIZE 32k ; CREATE TABLESPACE tbspace4k PAGESIZE 4K MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bpsmall ; CREATE TABLESPACE tbspace32k PAGESIZE 32K MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bplarge ;
Figure 3.9
Creating table spaces with different page sizes
The CREATE TABLE statement shown in Figure 3.10 defines a table with an integer column and an XML column using the table space with 32KB pages. It places XML data and relational data into the same table space (see Figure 3.7). Consequently, they use the same page size and are buffered in the same buffer pool. This default layout provides good performance for most applications. CREATE TABLE customer(id INTEGER, info XML) IN tbspace32k;
Figure 3.10
Creating a table with an XML column in a named table space
If you have done a performance analysis and find that you need a large page size for XML data but a small page size for relational data or indexes, you can use separate table spaces to achieve this. When you define a table, you can direct “long” data (LOB and XML data) into a separate table space with a different page size. The corresponding table definition and storage objects are shown in Figure 3.11 and Figure 3.12, respectively. In this example, relational data is stored in a
38
Chapter 3
Designing and Managing XML Storage Objects
table space tbspace4k with page size 4KB and XML data is stored in a table space tbspace32k with page size 32KB. If the table also contained a LOB column, the LOB data would be stored in a separate LOB object in the table space tbspace32k. Pages of the LOB object are not buffered in the buffer pool, whereas pages of the DAT, XDA, and INX objects are buffered. CREATE TABLE customer(id INTEGER, info XML) IN tbspace4k LONG IN tbspace32k;
Figure 3.11
Storing XML and LOBs in a separate table space
tbspace4k
ID (INT) 1001 1000 1003 1005
INFO (XML)
Regions Index
pages
INX Object
page
DAT Object tbspace32k
page
page
page
page
page
page
page
page
page
page
page
page
XDA Object Figure 3.12
Storage objects in a separate table spaces
If you had another table space named tbspace4kINX you could also direct the regions index as well as any user-defined indexes into their own table space. This layout is shown in Figure 3.13 and Figure 3.14.
3.3
XML Storage in DB2 for Linux, UNIX, and Windows
39
CREATE TABLE customer(id INTEGER, info XML) IN tbspace4k INDEX IN tbspace4kINX LONG IN tbspace32k;
Figure 3.13
Defining separate storage for indexes and XML data
tbspace4k
ID (INT) 1001 1000 1003 1005
tbspace4kINX
INFO (XML)
Regions Index
pages
INX Object
page
DAT Object tbspace32k
page
page
page
page
page
page
page
page
page
page
page
page
XDA Object Figure 3.14
Separate table spaces for relational data, XML, and indexes
In general, the fewer distinct page sizes and buffer pools you create the easier it is to tune and maintain your database. Therefore we recommend that you use different page sizes for XML and relational data only if you have evidence that it improves the performance of your workload and if you need this performance gain to meet the performance requirements of your application. Otherwise there is benefit in keeping it simple. Dedicated measurements in a prototype and test workload can help you make such decisions. Since DB2 9, new table spaces are by default large table spaces, in which the number of rows per page is no longer limited to 255. Hence, you don’t need to choose a small page size for relational data to ensure that pages are filled up and space isn’t wasted.
40
3.3.3
Chapter 3
Designing and Managing XML Storage Objects
Dropping XML Columns
In DB2 9.1 and DB2 9.5 for Linux, UNIX, and Windows you cannot drop XML columns from a table. To remove an XML column, create a new table without the XML column and use a “load from cursor” to move data from the old table to the new table. Then drop the old table and rename the new table so that it assumes the name of the old table. Alternatively, you can export data from a table and then recreate and reload the table. DB2 9.7 for Linux, UNIX, and Windows allows you to drop XML columns from a table with the ALTER TABLE statement. If a table contains multiple XML columns you can only drop all XML columns at the same time.
3.3.4
Improved XML Storage Format in DB2 9.7
DB2 9.7 uses a more optimized tree format for XML storage than prior releases. This improved format is completely transparent to all database operations such as queries, inserts, updates, indexing, and schema validation. The improved XML format is used only in tables that are created in DB2 9.7 or higher. When you migrate a table with XML data from DB2 9 or 9.5 to DB2 9.7, this XML data remains in its previous format and is not changed. Documents that you newly insert or update in such a migrated table continue to be in the format of the previous DB2 release. The previous and the improved storage format are not mixed within the XDA object of a table. The new storage format has the following benefits: • It is more compact and can reduce the space consumption of your XML data. • It allows compression of XML data in the XDA object (see section 3.5). • It allows you to use the function ADMIN_EST_INLINE_LENGTH to estimate the inline length that would allow an XML document to be inlined (see section 3.4). • It enables faster redistribution of XML data in a partitioned database; that is, you can use the NOT ROLLFORWARD RECOVERABLE option in the REDISTRIBUTE command to redistribute data in bulk and avoid logging. If you have migrated a table with XML data from DB2 9 or 9.5 to DB2 9.7 and want to bring the XML data into the new format, you need to create a new table and copy the data from the old to the new table. You can use “load from cursor” for moving data from one table to another efficiently. Then you can drop the old table and rename the new table to the old table name. Starting with DB2 9.7, copying and renaming a table can be done more elegantly and with minimal downtime by using the procedure SYSPROC.ADMIN_MOVE_TABLE. This procedure performs an online table move, which means that table data is copied to a table object with the same name, but not necessarily the same columns and storage characteristics. When the copying is complete, the source table is briefly taken offline and its name is assigned to the new copy of the table. All indexes of the table are also copied. During the copy phase, any updates, inserts, or deletes on the
3.4
Using XML Base Table Row Storage (Inlining)
41
source table are collected in a staging table and finally applied to the new table. An online table move with XML data requires that the table has at least one unique index and does not participate in foreign key constraints.
3.4
USING XML BASE TABLE ROW STORAGE (INLINING)
From DB2 9.5 for Linux, UNIX, and Windows onwards, XML documents that are small enough to fit on a single page can be stored on the same page as the relational row that they belong to. This capability is called base table row storage, or inlining. It means that the tree structure of an XML document is no longer stored on a separate XDA page, but next to the relational data inside the DAT object in the table space (Figure 3.16). XML inlining is currently not available in DB2 for z/OS. Inlining needs to be explicitly enabled as a column option because it may or may not provide performance benefits. Before we discuss the performance trade-offs, Figure 3.15 shows how to create a table with inlined XML storage. You can add a column option INLINE LENGTH to the definition of an XML column. In this example, any XML document that can be stored within 30,000 bytes is inlined. Documents that require more than 30,000 bytes are stored in the regular way (on separate XDA pages). The inlining of some or all documents is handled by the DB2 engine and completely transparent to the application. DB2’s decision about whether a given document is within the inline length is based on the size of the document in DB2’s internal tree format, after XML parsing. The decision is not based on the length of the textual (serialized) representation of the XML document. Inlined documents can be compressed, but the inlining decision is based on their space requirement prior to compression. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 30000) IN tbspace32k;
Figure 3.15
Table definition with inlined XML storage
The maximum allowed value for the inline length depends on the page size of the table space. As a rule of thumb, the inline length has to be less than the page size minus the total length of the other columns in the table and the overhead for the page header, and so on. For example, the maximum possible inline length in the example in Figure 3.15, where the table also contains an integer column and uses 32KB pages, is 32667 bytes. If an XML document is updated it might become larger or smaller as a result of the update, which affects inlining. The update may cause a previously inlined document to be moved from the DAT object to the XDA object, or vice versa. Figure 3.16 illustrates the storage objects in the table space when XML inlining is used. Three of the four documents meet the inline length and are now stored as part of the relational rows on pages in the DAT object. They do not have regions index entries. The document that belongs to the
42
Chapter 3
Designing and Managing XML Storage Objects
second row (id = 1000) is too large to be inlined. It is stored in the XDA object and spans three pages, which are linked from the row in the DAT object via the regions index. Note that inlining makes the DAT object larger, with larger and fewer rows per page. The XDA object has become smaller and the regions index has fewer entries than without inlining. Table Space: tbspace32k
ID (INT) 1001
INFO (XML) Regions Index
1000 page
pages
INX Object 1003 page
1005 page
page
page
page
page
page
page
DAT Object
XDA Object Figure 3.16
Storage objects with XML inlining
The CREATE TABLE statement in Figure 3.17 creates the customer table in table space tbspace4k, allows documents up to 3500 bytes to be inlined, and automatically directs larger documents to the table space tbspace32k. In this case the inlining takes precedence over the LONG IN clause. If a document is small enough to be inlined it will be part of the base table row and stored on DAT pages in tbspace4k. Otherwise it is stored on XDA pages in tbspace32k. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 3500) IN tbspace4k LONG IN tbspace32k;
Figure 3.17
Another table definition with inlined XML storage
The inline length of an XML column can be changed with an ALTER TABLE statement, as shown in Figure 3.18. This allows you to increase the inline length of an XML column, or to enable inlining for an XML column that wasn’t previously defined with inlining.
3.4
Using XML Base Table Row Storage (Inlining)
43
ALTER TABLE customer ALTER COLUMN info SET INLINE LENGTH 3600;
Figure 3.18
Changing the inline length of an XML column
The ALTER TABLE statement operation does not affect existing documents in the table, only documents that are inserted, loaded, or updated after the ALTER TABLE statement has been issued. If you want existing documents to obey the newly set inline length, you need to update them with themselves, as shown in Figure 3.19. Be aware that a bulk update of many XML documents can require a lot of log space. You might have to perform a series of smaller updates and commit frequently to avoid running out of log space. After you use an UPDATE statement to move XML data from the XDA object to the DAT object, you might want to reorganize the table to reclaim the freed-up space in the XDA object (see section 3.7). However, reorganization by itself does not move XML data from the XDA object to the DAT object. UPDATE customer SET info = info;
Figure 3.19
Updating existing documents to apply inlining
After you have specified an inline length for an XML column, you can only increase the inline length, not reduce it. The only way to “undo” the inlining of XML documents is to copy the documents into a new table without inlining, drop the old table, and rename the new table to the old table name. Starting with DB2 9.7 you can do this copying also with the procedure SYSPROC. ADMIN_MOVE_TABLE.
3.4.1
Monitoring and Configuring XML Inlining
After you have set the inline length for an XML column, any newly inserted or updated document is inlined if DB2’s internal tree representation of the document fits within the specified inline length. The size of an XML document in DB2’s internal tree format depends on the actual document characteristics, such as the length of element names, the length of element values, the presence of namespaces, and other factors. In particular, the space required to store a document in an XML column might be less than or greater than the size of the document in its textual representation. In DB2 9.5 and higher, the space requirement of most XML documents is between 70% and 150% of the space that they occupy in the file system. Therefore predicting whether a particular document will or will not be inlined can be difficult. Similarly, choosing an inline length that allows inlined storage of all or most documents can also be difficult. To address this problem, DB2 9.7 for Linux, UNIX, and Windows introduced the scalar functions ADMIN_IS_INLINED and ADMIN_EST_INLINE_LENGTH.
44
Chapter 3
Designing and Managing XML Storage Objects
The function ADMIN_IS_INLINED takes an XML column name as input, and returns • 1 if the document in the current row of the XML column is inlined. • 0 if the document in the current row of the XML column is not inlined. • NULL if the XML column of the current row is NULL. The query in Figure 3.20 shows how the function ADMIN_IS_INLINED can be used to examine a table with inlining, like the one defined previously in Figure 3.17. The query reveals for every document in the table whether or not it is inlined. The output indicates that the documents with the relational id values 1000 and 1002 are inlined while the other documents are not inlined. SELECT id, ADMIN_IS_INLINED(info) AS inlined FROM customer;
ID INLINED ---------------- ---------------1000 1 1001 0 1002 1 1003 0 1004 0 1005 0 6 record(s) selected.
Figure 3.20
Determining which documents are inlined
Since the query in Figure 3.20 can produce a lot of output when applied to a large table, you may want to add a WHERE clause to retrieve the inlining status only for a subset of documents. Figure 3.21 uses the ADMIN_IS_INLINED function to compute the number of documents that are inlined as well as the number of those that are not. The subselect in Figure 3.21 uses the clause FETCH FIRST 1000 ROWS ONLY to obtain inlining information based on at most 1,000 documents. This can be useful if the input table is large and you want to use the first 1,000 documents as a representative sample rather than scanning the entire table. Alternatively, you could use the keywords TABLESAMPLE BERNOULLI(n) in the FROM clause of the subselect to sample n% of all rows in the table.
3.4
Using XML Base Table Row Storage (Inlining)
45
SELECT COUNT(*) AS doc_count, CASE WHEN inlined = 1 THEN 'Yes' ELSE 'No' END AS inlined FROM (SELECT ADMIN_IS_INLINED(info) AS inlined FROM customer FETCH FIRST 1000 ROWS ONLY) GROUP BY inlined;
DOC_COUNT ---------------2 4
INLINED ---------------Yes No
2 record(s) selected.
Figure 3.21
Obtaining the number of inlined documents
The result in Figure 3.21 shows that only two out of six examined documents are inlined. This raises the question of how much you would need to increase the inline length so that most or all of the documents can be inlined. Similarly, you might have a table with an XML column for which inlining is not yet enabled. You might wonder which inline length to use so that most or all of the documents in that column get inlined. The function ADMIN_EST_INLINE_LENGTH is designed to answer these questions. The function ADMIN_EST_INLINE_LENGTH takes an XML column name as input, and returns • The lowest inline length (in bytes) that would allow the XML document in the current row to be inlined. This is an estimated value. • –1 , if the document in the current row of the XML column is too large to be inlined for the given page size. • –2 , if the required inline length cannot be estimated for the document in the current row. This is the case for any documents that have been inserted and stored prior to DB2 9.7 because DB2 9.7 uses a more optimized XML storage format (see section 3.3.4). • NULL, if the XML column of the current row is NULL. Figure 3.22 shows sample output of the function ADMIN_EST_INLINE_LENGTH. The values returned depend on the actual XML data in the table. In this example, the output shows that the first document (relational id = 1000) is already inlined and its actual size in DB2’s internal format is 770 bytes. The second document (id = 1001) is not inlined, but it can be inlined if the inline length is increased to 2345 or larger. The document with id = 1005 cannot be inlined because it is too large to fit on a single page together with the other columns in the table.
46
Chapter 3
Designing and Managing XML Storage Objects
SELECT id, ADMIN_IS_INLINED(info) AS inlined, ADMIN_EST_INLINE_LENGTH(info) AS inline_length FROM customer;
ID INLINED INLINE_LENGTH ---------------- ---------------- --------------1000 1 770 1001 0 2345 1002 1 796 1003 0 1489 1004 0 1910 1005 0 -1 6 record(s) selected.
Figure 3.22
Examining the required inlined length for specific XML documents
For a proposed inline length, such as 1500 bytes, the query in Figure 3.23 tells you how many documents in the column would be inlined if this inline length was used. SELECT COUNT(*) AS doc_count FROM customer WHERE ADMIN_EST_INLINE_LENGTH(info) BETWEEN 0 AND 1500;
DOC_COUNT ---------------3 1 record(s) selected.
Figure 3.23
Estimating the effectiveness of a proposed inline length
Figure 3.24 gives an example of a more comprehensive report on the distribution of document sizes in a table. It shows that two documents require no more than 1000 bytes each, four documents can be stored in at most 2000 bytes each, five fit into 3000 bytes each, no potentially “inlinable” document is larger than 3000 bytes, and one document is too big to be inlined.
3.4
Using XML Base Table Row Storage (Inlining)
SELECT SUM(a) AS ")
>
>
less-than symbol ( SELECT XMLCAST(XMLQUERY('$BOOKINFO/bookstore/book/title') AS VARCHAR(35)) as title FROM shelf;
TITLE -----------------------------------Helen's story about foxes & rabbits 1 record(s) selected.
Figure 4.23
4.7
Retrieving the title as SQL type VARCHAR
UNDERSTANDING XML WHITESPACE AND DOCUMENT STORAGE
Most XML documents contain whitespace, and its purpose is typically to improve readability. According to the XML standard, whitespace is any of the following characters and their respective Unicode code points. • space character (0x20) • CR, carriage return (0x0D) • LF, line feed (0x0A) • tab (0x09) The XML standard mandates that XML parsers must remove or replace any CR characters (0x0D) that appear in an XML document. Any two-character sequence CR LF is replaced by a single LF, and any CR character that is not followed by LF is also converted to a single LF. Whitespace can occur at various places in an XML document. For example, the simple document in Figure 4.24 contains whitespace in the following locations: • Between the element name “a” and the attribute “x” • On both sides of the “=” character that belongs to the attribute “x” • Within the double quotes that enclose the value of the attribute “x” • Between the start tag of element “a” and the start tag of element “b” • Trailing whitespace within the start and end tag of element “b” and within the end tag of element “a” • Between the start and end tag of element “b” • Between the end tag of element “b” and the start tag of element “c” • Inside the text value of element “c” • Between the end tag of element “c” and the end tag of element “a”
90
2
Inserting and Retrieving XML Data
A sample document with whitespace
The location of the whitespace matters. Depending on where a whitespace character occurs it is considered one of four types of whitespace: • Insignificant whitespace (trailing spaces in element or attributes names, spaces around the equality [=] symbol of an attribute, and others) • Significant whitespace (within attribute and elements values) • Boundary whitespace (between one tag and the next, if no other characters occur there) • Known whitespace (a single whitespace that precedes an attribute name) Figure 4.25 shows the same XML document as in Figure 4.24 and identifies the four types of whitespace. Note that the whitespace between the start and end tag of element “b” is considered boundary whitespace and not significant whitespace, because there are no other non-whitespace characters in the text value of element “b”. The whitespace in the text value of element “c” is significant, because there is another non-whitespace character (“2”) adjacent to this whitespace. significant
known
2
insignificant
boundary
Different types of whitespace
XML parsers always remove all insignificant whitespace, which is not specific to DB2 but required by the XML standard. The XML standard provides no option to preserve insignificant whitespace during XML parsing. On the other hand, significant whitespace is always preserved and there is no option to strip significant whitespace. Known whitespace is a single space (U+0020) that separates an attribute name from a preceding element name or attribute. Known whitespace is removed during XML parsing and not stored with the document. But, it gets reinjected during serialization when you retrieve the XML data in text format. Boundary whitespace can be preserved or removed (stripped). Figure 4.26 shows two versions of the sample document from Figure 4.25. In the first version, all insignificant and boundary whitespace has been stripped from the document. In the second version, insignificant whitespace has been stripped but boundary whitespace has been preserved. In DB2, the default behavior is to strip boundary whitespace, but you can choose to preserve boundary whitespace, if desired.
4.7
Understanding XML Whitespace and Document Storage
91
-- Document with boundary whitespace stripped:
2
-- Document with boundary whitespace preserved:
Figure 4.26
2
Sample document with and without boundary whitespace preserved
You can preserve boundary whitespace only if you insert or update documents without validation against an XML Schema. Validation always forces boundary whitespace to be stripped. NOTE
4.7.1
Preserving XML Whitespace
DB2’s default behavior to strip boundary whitespace is desirable because it saves space on disk and in memory. Additionally, whitespace is typically not meaningful for applications that consume XML data. Hence, this default is likely the right choice for your application. However, if you encounter a case where boundary whitespace has to be preserved, DB2 supports three ways to enable whitespace preservation. Ordered by their precedence, they are • The special attribute xml:space inside XML documents • The explicit strip/preserve whitespace option in the XMLPARSE function • Changing the DB2 default behavior from “strip” to “preserve” with the CURRENT IMPLICIT XMLPARSE OPTION (see section 4.7.2) The XML standard defines the optional attribute xml:space that controls the stripping or preservation of whitespace. It can have the values preserve or default, where default means that whitespace is stripped. This attribute can be included in any element in an XML document. It affects the entire subtree under this element, unless it is overridden by other xml:space attributes at a deeper level of the document. If the xml:space attribute appears only in the root element of a document then it affects all boundary whitespace in the entire document. Any xml:space attributes override any whitespace settings in the XMPARSE function or the CURRENT IMPLICIT XMLPARSE OPTION. The drawback of xml:space attributes is that they often do not occur in XML documents and it can be time consuming to add them to every document before insertion into DB2. Also, when an xml:space attribute is in place, its effect can only be changed by removing or modifying the attribute in each document. Due to this lack of flexibility it is recommended not to use xml:space attributes. Instead, use the explicit whitespace option in the XMPARSE function or the CURRENT IMPLICIT XMLPARSE OPTION, which we explain later.
92
Chapter 4
Inserting and Retrieving XML Data
Let’s look at the four INSERT statements in Table 4.3 through Table 4.6. They all insert a document with whitespace such as indentation and line breaks. The right column in each table shows the document and its whitespace after it has been retrieved from DB2. Run these INSERT statements in the CLP with the –t and the –q option (db2 –t –q). The –t option sets the semicolon as the default statement terminator. The –q option ensures that the CLP, as an application program for DB2, does not remove new line characters or other whitespace when sending statements to the DB2 server. The INSERT statement in Table 4.3 does not specify any whitespace option, which implies that all boundary whitespace is stripped. Since boundary whitespace includes line breaks, the document after retrieval is a continuous string without line breaks, spilling over multiple lines as needed. Note that significant whitespace in the title element has been preserved; that is, the spaces between the words This, is, a, space, and story. Table 4.3
Inserting XML without Preserving Whitespace
INSERT statement:
Document after retrieval from DB2:
INSERT INTO shelf VALUES (10, ' 1851586666 This is a space story ')
1851586666This is a space story
The document that is inserted in Table 4.4 carries an xml:space attribute with the value preserve, which means that all boundary whitespace in this document is preserved. Hence, when you retrieve the document from DB2 all line breaks and indentation match the original document. Table 4.4
Inserting an XML Document with xml:space Attribute
INSERT statement:
Document after retrieval from DB2:
INSERT INTO shelf VALUES (11, ' 1851586666 This is a space story ')
1851586666 This is a space story
The INSERT statement in Table 4.5 wraps the XMLPARSE function with the explicit PRESERVE WHITESPACE clause around the document, which also preserves all boundary whitespace.
4.7
Understanding XML Whitespace and Document Storage
Table 4.5
93
Inserting an XML Document with the XMLPARSE Function
INSERT statement:
Document after retrieval from DB2:
INSERT INTO shelf VALUES (12,XMLPARSE(DOCUMENT ' 1851586666 This is a space story ' PRESERVE WHITESPACE))
1851586666 This is a space story
The INSERT statement in Table 4.6 uses the XMLPARSE function with the STRIP WHITESPACE option, and the document also carries the xml:space attribute in the book element. The effect is that all boundary whitespace is stripped, except within the book element and its child elements. The line breaks and indentation within the book element have been preserved according to the xml:space attribute. Table 4.6
Interaction between the XMLPARSE Function and xml:space Attribute
INSERT statement:
Document after retrieval from DB2:
INSERT INTO shelf VALUES (13,XMLPARSE(DOCUMENT ' 1851586666 This is a space story ' STRIP WHITESPACE))
1851586666 This is a space story
4.7.2
Changing the Whitespace Default from “Strip” to “Preserve”
If you always need to preserve boundary whitespace you might find it tedious to ensure that all applications always use the XMLPARSE function with the PRESERVE WHITESPACE option. In this case it is easier to change DB2’s default behavior from STRIP WHITESPACE to PRESERVE WHITESPACE and avoid using the XMLPARSE function. In DB2 for Linux, UNIX, and Windows, the default behavior is controlled by a DB2 special register called CURRENT IMPLICIT XMLPARSE OPTION. It enables you to specify the whitespace handling per session (connection). You can change the default in several ways: • Use the following statement from an application or the DB2 CLP: SET CURRENT IMPLICIT XMLPARSE OPTION = 'PRESERVE WHITESPACE'
• For CLI applications, add the following entry to the db2cli.ini file: CurrentImplicitXMLParseOption = 'PRESERVE WHITESPACE'
94
Chapter 4
Inserting and Retrieving XML Data
You can edit this file manually, or issue the UPDATE CLI CONFIGURATION command: UPDATE CLI CONFIGURATION FOR SECTION USING CurrentImplicitXMLParseOption '"PRESERVE WHITESPACE"'
• In CLI applications you can also use the function SQLSetConnectAttr() to set the connection attribute SQL_ATTR_CURRENT_IMPLICIT_XMLPARSE_OPTION. It can be set before or after establishing a connection. Remember that the XMLPARSE function can always be used explicitly to override the default.
4.7.3
Storing XML Documents for Compliance
Many applications have the requirement that once they store an XML document they can get “the same” document back. The key question is how the application defines “the same.” In many cases “the same” means that all element and attribute tags, all element and attribute values, all comments, processing instructions and namespaces, and all significant whitespace have to be returned in the same order and representation as in the original document. This notion of “the same” is sometimes also called Document Object Model fidelity. It means that the structure and data content of your XML documents is always preserved and reproducible, including digital signatures. DB2’s pureXML storage provides this fidelity. Some applications may take their definition of “the same” one step further. They might require that any XML document that they retrieve from a database is 100% byte-for-byte identical to the one that was inserted, including all insignificant whitespace. To ensure that the documents are byte-for-byte identical you must avoid XML parsing, because the output from an XML parser does not always contain all bytes that were in the original document. This behavior is irrespective of database storage, but inherent in how XML parsing is defined by the XML standard. For example, XML parsers are required by the XML standard to remove insignificant whitespace and normalize line endings. Otherwise they are not compliant. If you require exact byte-for-byte retention of XML documents then an XML column, which stores XML in a parsed format, should not be your only storage choice for the documents. You should store a second copy of each document in a BLOB or VARCHAR FOR BIT column in the same row. The parsed XML storage allows efficient querying while the binary copy is for auditing or compliance purposes. Note that character data types, such as CLOB or VARCHAR, do not guarantee that documents are stored without any byte modifications, because character data can be subject to code page conversion. Code page issues are explained in Chapter 20.
4.8
4.8
Summary
95
SUMMARY
The basic manipulation of XML documents in DB2 is easy. You can use the familiar SQL statements INSERT, SELECT, and DELETE to add, retrieve, and remove XML documents from an XML column in a DB2 table. UPDATE statements can replace or modify XML documents, which is further discussed in Chapter 12. In INSERT, SELECT, and UPDATE statements, applications can use parameter markers and host variables to exchange XML documents with the DB2 server. Code samples in various programming languages are provided in Chapter 21. If you include an XML column name in the SELECT list of an SQL query, the column type in the result set is XML and the XML documents are implicitly serialized to their textual representation upon retrieval. Alternatively, the XMLSERIALIZE function allows you to perform explicit serialization. Explicit serialization means that the text form of the XML documents are returned in a non-XML data type of your choosing, such as BLOB, CLOB, or VARCHAR. The XMLSERIALIZE function can be used to force the generation of an XML declaration at the beginning of any document that you retrieve from DB2. The XML standard defines several reserved characters as well as whitespace characters. Reserved characters, such as the less-than sign (" OFF='1666' LEN='421' />"
Content of the delimited format flat file cust_exp.del
The file cust_exp.del.001.xml contains all the XML documents from the exported XML column concatenated together, as shown in Figure 5.4. The second of the six documents is highlighted in bold. As indicated in the DEL file, it begins at byte offset 281 and has a length of 283.
100
Chapter 5
Moving XML Data
You can actually count the characters in Figure 5.4 to verify that this is true. Also note that this concatenation of documents does not produce a single well-formed document because a single root element is missing. < name>Kathy Smith5 Rosewood< /street>TorontoOntarioM6W 1E6416-555-1358Kathy Smith25 EastCreekMarkhamOn tarioN9C 3T6905-555-7258Jim Noodle25 EastCreekMarkha mOntarioN9C 3T6905-555-7258 Robert Shoemaker1596 BaselineAuroraOntarioN8X 7F8905-55 5-7258416-555-2937905-555-8743613-555-3278...
Figure 5.4
5.1.2
Content of the XML data file cust_exp.del.001.xml
Exporting XML Documents as Individual Files
In some situations exporting each XML document into a separate file can be desirable. To do this you need to specify the clause MODIFIED BY with the option xmlinsepfiles. This is shown in Figure 5.5. EXPORT TO c:\mydata\cust_exp.del OF DEL MODIFIED BY xmlinsepfiles SELECT * FROM customer2;
Figure 5.5
Exporting XML documents as separate files
This EXPORT command produces n + 1 files where n is the number of XML documents in the exported XML column. In our example it produces the following seven files in the directory c:\mydata: • cust_exp.del • cust_exp.del.001.xml • cust_exp.del.002.xml • cust_exp.del.003.xml
5.1
Exporting XML Data in DB2 for Linux, UNIX, and Windows
101
• cust_exp.del.004.xml • cust_exp.del.005.xml • cust_exp.del.006.xml The first file is the delimited format flat file that contains the relational data of the exported result set together with pointers to the exported XML documents. These pointers (XML Data Specifiers) look different now because each XML document is exported as a separate file in the file system (see Figure 5.6). Offset and length are no longer required, just the file name of each individual XML document. These file names are derived from the name of the delimited format flat file and extended with an increasing number and the extension .xml. The file numbers start with three digits and additional digits are used as needed when large numbers of documents are exported. 1000,"" />" />"
Content of the delimited format flat file cust_exp.del
Remember that the examples in this chapter use the table customer2 which has an INTEGER column and an XML column. The table customer, which is readily available in the DB2 sample database, has an INTEGER column and two XML columns, info and history. Since the history column is initially empty (NULL), exporting all columns from the customer table leads to odd-numbered file names—cust_exp.del.001.xml, cust_exp.del.003.xml, cust_ exp.del.005.xml, and so on. The even-numbered file names would be used for the documents in the history column, but it is NULL and so these file names are not used. The xmlinsepfiles option used in Figure 5.5 is just one of many possible options that can be specified in the MODIFIED BY clause of the EXPORT command. Table 5.1 summarizes other options relevant to XML data. Table 5.1
XML Relevant Modifiers for the EXPORT Command
Modified by:
Description:
xmlinsepfiles
This option writes each XML document to a separate file. Without this option, all documents are by default concatenated into a single file.
xmlnodeclaration
This option produces XML documents without an XML declaration. Without this option the default behavior is that each exported XML document carries an XML declaration with an encoding attribute, such as
(continues)
102
Chapter 5
Table 5.1
Moving XML Data
XML Relevant Modifiers for the EXPORT Command (Continued)
Modified by:
Description:
xmlchar
This option writes the exported XML documents in the character codepage. The character codepage is the same as the application codepage unless the codepage option of the EXPORT command is specified. Without the xmlchar option, XML documents are by default written out in Unicode. Chapter 20 provides a deeper discussion of code pages and XML document encodings.
xmlgraphic
This option writes the exported XML documents in the UTF-16 code page regardless of the application code page or the codepage modifier.
5.1.3
Exporting XML Documents as Individual Files with Non-Default Names
If you want the exported XML documents to have file names that are not based on the file name of the delimited format flat file, use the XMLFILE clause of the EXPORT command to specify a different file name prefix. The command in Figure 5.7 exports the table customer2 and writes all XML documents to separate files whose names start with custdoc. EXPORT TO c:\mydata\cust_exp.del OF DEL XMLFILE custdoc MODIFIED BY xmlinsepfiles SELECT * FROM customer2;
Figure 5.7
Exporting XML documents to files with custom file names
This command produces the following files: • cust_exp.del • custdoc.001.xml • custdoc.002.xml • custdoc.003.xml • custdoc.004.xml • custdoc.005.xml • custdoc.006.xml The XMLFILE clause can also be used without the xmlinsepfiles option; that is, all documents are combined into a single file whose name starts with custdoc.
5.1.4
Exporting XML Documents to One or Multiple Dedicated Directories
The EXPORT command allows you to write the exported XML documents to a dedicated directory that is different from the directory where the delimited format file is written to. To achieve this,
5.1
Exporting XML Data in DB2 for Linux, UNIX, and Windows
103
use the XML TO clause to specify an existing directory, as shown in Figure 5.8. This EXPORT command writes the delimited format flat file cust_exp.del to the directory /mydata, and the six XML documents in six separate files to the directory /mydata/customer. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/customer MODIFIED BY xmlinsepfiles SELECT * FROM customer2;
Figure 5.8
Exporting XML documents as individual files to a dedicated directory
If the XML TO clause specifies a list of multiple directories, as in Figure 5.9, the XML documents are distributed evenly among them in a round-robin fashion. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/cust1, /mydata/cust2 XMLFILE custdoc MODIFIED BY xmlinsepfiles SELECT * FROM customer2;
Figure 5.9
Exporting XML documents as separate files to multiple directories
This EXPORT command produces the following files: • /mydata/cust1/custdoc.001.xml • /mydata/cust1/custdoc.003.xml • /mydata/cust1/custdoc.005.xml • /mydata/cust2/custdoc.002.xml • /mydata/cust2/custdoc.004.xml • /mydata/cust2/custdoc.006.xml You can later invoke the IMPORT or LOAD utility with the same two paths, /mydata/cust1 and /mydata/cust2, to have DB2 read the same documents in the same round-robin fashion. If you specify multiple target directories in the XML TO clause but omit the xmlinsepfiles option, as in Figure 5.10, then the EXPORT utility concatenates the exported XML documents into multiple large files, one per target directory. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/cust1, /mydata/cust2 XMLFILE custdoc SELECT * FROM customer2;
Figure 5.10
Exporting XML documents to multiple directories
104
Chapter 5
Moving XML Data
This EXPORT command produces the following three files: • The delimited format flat file cust_exp.del in the directory /mydata • A file called custdoc.001.xml in the directory /mydata/cust1 • A file called custdoc.002.xml in the directory /mydata/cust2 The exported XML documents are evenly distributed across the two files custdoc.001.xml and custdoc.002.xml. The delimited format flat file cust_exp.del contains the rows shown in Figure 5.11. It reveals that the first, third, and fifth documents are stored in the file custdoc. 001.xml, while the second, fourth, and sixth documents are stored in custdoc.002.xml. Each document is precisely identified by its offset and length. 1000,"" OFF='563' LEN='412' />" OFF='691' LEN='421' />"
Content of the delimited format flat file cust_exp.del
Exporting Fragments of XML Documents
Up to now we have looked at exporting whole documents. It is also possible to export document fragments that may or may not be well-formed documents. To achieve this you can use the EXPORT command with any XQuery or SQL/XML query, such as the ones that we discuss in Chapters 6 through 9, which cover XML queries. Let’s consider the following examples. The command in Figure 5.12 exports all phone elements from each of the six XML documents in the info column of the table customer2. It writes six rows to the output files, one for each XML document. Each row contains one or more phone elements, depending on the number of phone elements in the respective document. If a row contains a sequence of multiple phone elements without a common root element, then this value is not a well-formed XML document. EXPORT TO /mydata/phones.del OF DEL SELECT XMLQUERY('$INFO/customerinfo/phone') FROM customer2;
Figure 5.12
Exporting document fragments
The query in the EXPORT command can also be an XPath or XQuery expression, as shown in Figure 5.13. Similar to the previous example in Figure 5.12, this command also exports all phone
5.1
Exporting XML Data in DB2 for Linux, UNIX, and Windows
105
elements from all six customer documents. However, it writes each phone element to a separate row in the output file, even if multiple phone elements come from the same XML document. This is because XQuery and SQL/XML queries that seem to be equivalent can produce result sets with different cardinalities. For details, please refer to Chapter 8 (see section 8.3.3, Result Set Cardinalities in XQuery and SQL/XML). EXPORT TO /mydata/phones.del OF DEL XQUERY db2-fn:xmlcolumn("CUSTOMER2.INFO")/customerinfo/phone;
Figure 5.13
5.1.6
Exporting document fragments as well-formed documents
Exporting XML Data with XML Schema Information
An XML column can contain XML documents that have been validated against one or multiple XML Schemas when they were inserted or loaded. When you export validated XML documents, the EXPORT utility can produce information that tells you for each document which XML Schema it belongs to. This is achieved with the XMLSAVESCHEMA option in the EXPORT command. For each exported XML document that was validated against an XML Schema, the fully qualified SQL identifier of that XML Schema is stored as an attribute (SCH) in the corresponding XML Data Specifier (XDS). The SQL identifier of the XML Schema is the name under which you registered the XML Schema in DB2. If the exported document was not validated against an XML Schema or the schema no longer exists in the database, the SCH attribute is not included in the corresponding XDS. Figure 5.14 shows the command to export documents with XML Schema information. EXPORT TO /mydata/cust_exp.del OF DEL MODIFIED BY xmlinsepfiles XMLSAVESCHEMA SELECT * FROM customer2;
Figure 5.14
Exporting documents specifying the XML Schema
The delimited format flat file produced might look like the one in Figure 5.15. In this example it shows that the first two documents were validated against the XML Schema with the SQL identifier DB2ADMIN.CUSTXSD. The third and the fifth documents were validated against schema DB2ADMIN.CUSTXSD2, while the fourth and the sixth documents are not associated with any XML Schema. This information reflects how documents were validated at insert time, if at all. If you load or import the exported XML documents and use this delimited format flat file as input, the documents can be validated against their respective XML Schemas, if those schemas exist in the database.
106
Chapter 5
1000,"" SCH='DB2ADMIN.CUSTXSD2'/>" />"
Content of the delimited format flat file cust_exp.del
IMPORTING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS
In DB2 9.1 for Linux, UNIX, and Windows you can use the IMPORT utility to move XML data into an XML column. Since DB2 Version 9.5 you can also use the LOAD utility to load XML data. The choice between IMPORT and LOAD is largely dependent on operating considerations, which are similar for XML as for relational data: • The LOAD utility typically performs better than the IMPORT utility because • It operates at the DB2 page level, whereas the IMPORT utility operates at the row level. • The data loaded by the LOAD utility is not logged in the transaction log. • The LOAD utility automatically parallelizes its workload. • If you use the IMPORT utility, then the target table can be kept fully accessible to other applications for insert and query operations at all times. In particular, you can start an IMPORT operation while other queries on the table are in progress. The LOAD utility has an online mode that allows queries (but no writes) against the target table while the LOAD is in progress. However, queries that started prior to the LOAD must be quiesced before a LOAD or online LOAD can be started. • If you have triggers on the target table, then these are fired if the IMPORT utility is used, but are not fired if the LOAD utility is used. • Both the IMPORT and LOAD utilities can optionally perform XML Schema validation and preserve whitespace in the XML documents. The IMPORT and LOAD utilities can be viewed as inverse operations to EXPORT. In particular, the IMPORT and LOAD utilities can directly consume the output produced by the EXPORT utility; that is, a delimited format flat file that contains pointers to the XML documents that reside in one or multiple separate files. If you want to IMPORT or LOAD data that wasn’t previously exported with the EXPORT command, you need to produce a delimited format file that looks as if it had been produced by the EXPORT utility.
5.2
Importing XML Data in DB2 for Linux, UNIX, and Windows
5.2.1
107
IMPORT Command and Input Files
Assume you want to use the IMPORT command to add new rows to the table customer2, and that you have a directory c:\mydata in the file system that contains several files with XML documents that you want to import. This directory could contain thousands of files, but in this example let’s assume that you just have two XML files called data2.xml and data3.xml, each containing a single XML document. You can produce a delimited format flat file, such as the file data.del in Figure 5.16, which contains two columns. The first column holds INTEGER values for the first column of the target table, and the second column holds pointers to the XML documents that you want to import into the second column of the target table. 2000,"" 2001,""
Figure 5.16
Content of the delimited format input file data.del
With this delimited format input file you can execute the IMPORT command shown in Figure 5.17. It assumes that the file data.del as well as the XML documents data2.xml and data3.xml are all located in the current directory. The keywords OF DEL indicate that the input file data.del is of type delimited format. IMPORT FROM data.del OF DEL INSERT INTO customer2;
Figure 5.17
Importing XML documents
If the required files are not located in the local directory then you must provide appropriate paths. For example, if the file data.del is located in the directory c:\mydata, and the XML documents are in the directory c:\mydata\myxml, then the IMPORT command in Figure 5.18 obtains the files from the appropriate locations. IMPORT FROM c:\mydata\data.del OF DEL XML FROM c:\mydata\myxml INSERT INTO customer2;
Figure 5.18
Importing XML documents from specific locations
Incorrect file paths in the IMPORT command are a very common mistake, so you want to pay extra attention to them! NOTE
If you need to load XML data that was previously exported to multiple directories, specify the list of directories in the XML FROM clause. This clause corresponds to the XML TO clause of the EXPORT command.
108
Chapter 5
Moving XML Data
If the two XML documents data2.xml and data3.xml happen to be concatenated as a single file (for example, docs.xml), then the delimited format input file needs to specify offset and length for each document, as in Figure 5.19. The first XML document starts at an offset of 0 bytes into the file and is 281 bytes long. The second XML document starts at offset 281 and is 283 bytes long, and so on for all XML documents that may be in the same file. Since it is tedious to determine the number of bytes of each document, such an input file with offsets and lengths is typically only used if it is available from a previous EXPORT operation or generated by an application. 2000,"" 2001,""
Figure 5.19
Input file for multiple concatenated documents
As an aside, what happens if you have more than one XML column in the target table? To populate a table with two XML columns, the delimited format input file has to contain two XML Data Specifiers (XDS) per row, one for each XML column that you want to populate. Such an input file is shown in Figure 5.20. 2000,"","" 2001,"",""
Figure 5.20
Input file to populate an integer column and two XML columns
When you import, insert, or load XML data, insignificant whitespace is by default automatically stripped from the XML documents (see section 4.7, Understanding XML Whitespace and Document Storage). If you want to preserve whitespace, specify the XMLPARSE PRESERVE WHITESPACE clause in the IMPORT command (see Figure 5.21). IMPORT FROM c:\mydata\cust_exp.del OF DEL XML FROM c:\mydatadata XMLPARSE PRESERVE WHITESPACE INSERT INTO customer2;
Figure 5.21
5.2.2
Importing XML data into a table and preserving whitespace
Import/Insert Performance Tips
Several performance guidelines are common to all methods of populating a table with XML data. If you have multiple user-defined XML indexes on a table, it is typically better to define them before populating the table rather than creating them afterwards. It is better to define the indexes before populating the table because during INSERT, LOAD, or IMPORT, each XML document is processed only once to generate index entries for all XML indexes. However, if multiple CREATE INDEX statements are issued, all documents in the XML column will be traversed multiple times, once for each index.
5.3
Loading XML Data in DB2 for Linux, UNIX, and Windows
109
Even if you have not defined any indexes on the target table, DB2’s pureXML storage mechanism transparently maintains regions and path indexes for efficient XML storage access (see Chapter 3, Designing and Managing XML Storage Objects). Take these indexes into account when determining buffer pool sizes. Just as for relational data, you can issue the ALTER TABLE APPEND ON command, which enables append mode for the table. New data is appended to the end of the table instead of searching for free space on existing pages. This can provide for improved runtime performance of bulk inserts or import. You can avoid logging if you use the ALTER TABLE ACTIVATE NOT LOGGED INITIALLY command. However, be warned that if there is a statement failure, the table will be marked as inaccessible and must be dropped. This risk often prohibits using the NOT LOGGED INITIALLY (NLI) option for incremental bulk inserts in production systems. The option can be useful for the initial population of an empty table. Beware that NLI prevents concurrent inserts/imports into a target table and that parallelism can yield higher performance than NLI. If you use the IMPORT command, a small value for the COMMITCOUNT parameter tends to hurt performance. Committing every 100 rows or more will perform better than committing every row. An IMPORT command with an explicit COMMITCOUNT parameter is shown in Figure 5.22. IMPORT FROM c:\mydata\data.del OF DEL XML FROM c:\mydata COMMITCOUNT 100 INSERT INTO customer2;
Figure 5.22
IMPORT command with COMMITCOUNT parameter
To achieve higher performance than provided by the IMPORT utility, consider using the LOAD utility instead, which automatically parallelizes its work.
5.3
LOADING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS
Since DB2 9.5 for Linux, UNIX, and Windows you can use the LOAD utility to move XML documents into a table. The key advantages of the LOAD utility are the same for XML as for relational data. For example, the data is not logged and parallelism is automatically used to increase performance. DB2 determines a default degree of parallelism based on the number of CPUs and table space containers. The syntax for handling XML data in the LOAD command is the same as the XML-specific syntax in the IMPORT command. For example, the only difference between the LOAD command in Figure 5.23 and the IMPORT command in Figure 5.18 is that the keyword IMPORT has been replaced by the keyword LOAD.
110
Chapter 5
Moving XML Data
LOAD FROM c:\mydata\data.del OF DEL XML FROM c:\mydata\myxml INSERT INTO customer2;
Figure 5.23
Example of a LOAD command
The LOAD command has several optional parameters that can affect performance. DB2 automatically determines suitable values for these parameters, so you can usually obtain good load performance out-of-the-box without setting any parameters. If you want to try to improve load performance, consider the following parameters: • DATA BUFFER —This parameter specifies the number of 4KB pages (regardless of the degree of parallelism) to use as buffered space for transferring data within the utility. The data buffers use the utility heap, whose size can be modified through the util_heap_sz database configuration parameter. Large degrees of parallelism require a larger util_heap_sz. • CPU_PARALLELISM —This parameter specifies the number of threads that the LOAD utility uses for parsing, converting, and formatting records. • DISK_PARALLELISM —This parameter specifies the number of threads that the LOAD utility uses for writing data to the table space. After a LOAD operation, the loaded table might be in SET INTEGRITY PENDING state in either READ or NO ACCESS mode. This means that the table is only available for read or not available at all. You can check whether the loaded table is in SET INTEGRITY PENDING status (also known as CHECK PENDING status) by looking at the STATUS column of the catalog view SYSCAT.TABLES and checking for a STATUS value equal to "C" (see Figure 5.24). The value "C" means CHECK PENDING. SELECT SUBSTR(tabschema,1,10) AS tabschema, SUBSTR(tabname,1,10) AS tabname, status FROM syscat.tables WHERE status = 'C';
TABSCHEMA TABNAME STATUS ---------- ---------- -----DB2ADMIN CUSTOMER C
Figure 5.24
Listing tables that are in CHECK PENDING state
One of the most common reasons why a table is placed in CHECK PENDING state after a LOAD operation is that the table has check constraints or referential integrity constraints defined on it. To take a table out of CHECK PENDING state, issue the SET INTEGRITY command:
5.4
Unloading XML Data in DB2 for z/OS
111
SET INTEGRITY FOR db2admin.customer2 IMMEDIATE CHECKED
DB2 performs minimal logging for the LOAD utility, because the operations are performed at the DB2 page level and not the DB2 row level. If you have DB2 archive logging enabled (disabled by default) and use the LOAD command, then the table will be placed in BACKUP PENDING status after the load. After the load operation you have to take a backup of the table space containing the table before you issue the SET INTEGRITY command. An alternative to taking the backup is to specify the COPY YES option in the LOAD command. This option instructs DB2 to perform a backup of the new data while it is being loaded, which avoids the BACKUP PENDING state. Another alternative is to specify the NONRECOVERABLE option in the LOAD command. This option means the table space is not put in BACKUP PENDING state following the LOAD operation and a copy of the loaded data does not have to be made during the load. However, it is not possible to recover the table by a subsequent roll forward action. You can also move XML data from one table to another using the “load from cursor” option of the LOAD utility. This option allows you to move data between tables without having to unload the data first. In Figure 5.25 a cursor curs is declared. The subsequent LOAD command uses this cursor to move data from the table customer2 into table customer3. Loading XML data from a cursor is supported for tables in the same database but not for moving XML data from one database to another (error SQL1407N). DECLARE curs CURSOR FOR SELECT cid, info FROM customer ; LOAD FROM curs OF CURSOR INSERT INTO customer3(cid,info) ;
Figure 5.25
5.4
Example of loading data from a cursor
UNLOADING XML DATA IN DB2 FOR Z/OS
You have two options for unloading data from DB2 for z/OS. You can either use the DSNTIAUL utility or the UNLOAD utility. An example of using the DSNTIAUL utility to unload data from a table called customer is shown in Figure 5.26. The execution of the DSNTIAUL utility in Figure 5.26 produces two output files, pointed to by SYSREC00 and SYSPUNCH. The SYSPUNCH sequential dataset contains the LOAD statement for you to be able to load the unloaded data into a new table. The SYSREC00 sequential dataset contains the unloaded data, including the XML data.
112
Chapter 5
Moving XML Data
//DSNTIAUL EXEC PGM=IKJEFT01 //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSREC00 DD DSN=USER123.DSN8UNLD.SYSREC00,VOL=SER=P8P007, // UNIT=SYSDA,SPACE=(32760,(1000,500)),DISP=(,CATLG) //SYSPUNCH DD DSN=USER123.DSN8UNLD.SYSPUNCH, // UNIT=SYSDA,SPACE=(800,(15,15)),DISP=(,CATLG), // RECFM=FB,LRECL=120,BLKSIZE=1200,VOL=SER=P8P007 //SYSTSIN DD * DSN SYSTEM(ISC9) RUN PROGRAM(DSNTIAUL) PLAN(DSNTIB91) PARMS('SQL') LIB('ISC910P8.RUNLIB.LOAD') END //SYSIN DD * SELECT * FROM CUSTOMER;
Figure 5.26
Unloading data using the DSNTIAUL utility
You can also use the UNLOAD utility to unload XML data. Remember that in DB2 for z/OS, the XML data of an XML column always resides in an XML table space, separate from the base table space. In the UNLOAD statement you just need to specify the base table space. You do not have to specify the XML table space. An example is shown in Figure 5.27, where the data is unloaded in delimited format. Once you have determined the table space and database for the table you want to unload, you can plug these values into the unload job as shown in Figure 5.27. //UNLOAD EXEC DSNUPROC,PARM='ISC9,IANTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSREC DD DSN=USER123.UNLOAD.SYSREC, // DISP=(MOD,CATLG,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSPUNCH DD DSN=USER123.UNLOAD.SYSPUNCH, // DISP=(MOD,CATLG,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * UNLOAD TABLESPACE DSN00191.CUSTOMER DELIMITED CHARDEL X'22' COLDEL X'2C' DECPT X'2E' FROM TABLE CUSTOMER (CID POSITION(*) INT, INFO POSITION(*) XML) UNICODE /*
Figure 5.27
Unloading data using the UNLOAD utility
5.4
Unloading XML Data in DB2 for z/OS
113
For maximum portability, you should specify UNICODE in the UNLOAD statement and use Unicode delimiter characters. If XML columns are not being unloaded in UTF-8 CCSID 1208, the unloaded column values are prefixed with a standard XML encoding declaration that specifies the encoding that is used. If the table that you unload contains XML documents larger than 32KB, you need to use file reference variables (FRV) to unload the XML data to a separate partitioned data set (PDS) or hierarchical file system (HFS) file. Figure 5.28 shows unload to a PDS. //SYSIN DD * TEMPLATE XMLHERE DSN 'USER123.&DB..&TS..UNLOAD' DSNTYPE(PDS) UNIT(SYSDA) UNLOAD DATA DELIMITED CHARDEL X'22' COLDEL X'2C' DECPT X'2E' FROM TABLE CUSTOMER (CID INT, INFO VARCHAR(255) CLOBF XMLHERE) UNICODE /*
Figure 5.28
SYSIN cards for unloading XML documents larger than 32KB
Let’s look at how the SYSIN cards in Figure 5.28 are constructed. The first two lines define a template with the name XMLHERE. The template declares the output naming pattern for the XML data files. The variables &DB and &TS take the value of the database and table space where the XML data is unloaded from. The parameter DSNTYPE specifies the type of volume for the unloaded data. If PDS is specified, then this limits the output dataset to a single volume. This is also the default if no DSNTYPE is specified. If the output should use multiple volumes, then you must specify HFS. Next is the UNLOAD DATA statement. The line starting with DELIMITED defines how the data is to be delimited. The last line specifies that the XML documents that are unloaded from the XML column INFO are represented in the output data by file names of up to 255 characters. The type VARCHAR(255) defines the data type of the XML file names, not of the actual XML data. The keyword CLOBF tells UNLOAD to use File Reference Variables (FRV) and to store the XML documents as CLOB files. You can also specify BLOBF or DBCLOBF as possible output file formats. The template name XMLHERE tells UNLOAD to name the XML files according to the template that was defined in the first line. If you do not specify EBCDIC, ASCII, UNICODE, or CCSID, the encoding scheme of the source data is preserved. If the output PDS that will contain the XML documents does not exist, the job will create it for you. The names of the output files are stored in the SYSREC data set as strings, as shown in Figure 5.29. 1000.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQCY) 1001.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQDR) 1002.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQEB) ...
Figure 5.29
Contents of SYSREC DS when unloading documents larger than 32KB
114
Chapter 5
Moving XML Data
You can see that the value of the relational column cid is the first part of each record. Each of the output files pointed to by the remainder of the record contains an XML document. Note the random member name. If the dataset already contains members when the job is run, then the existing members are not deleted, but new members (again with random names) are added. But the dataset that SYSREC points to is overwritten with the new names. The dataset pointed to by SYSPUNCH contains the statements that you need to put into a LOAD job, as shown in Figure 5.30. Such a LOAD job is discussed in section 5.5. LOAD DATA INDDN SYSREC LOG NO RESUME YES UNICODE CCSID(01208,01208,01208) FORMAT DELIMITED COLDEL X'2C' CHARDEL X'22' DECPT X'2E' SORTKEYS 3 INTO TABLE "USER123"."CUSTOMER" ("CID" POSITION(*) INTEGER, "INFO" POSITION(*) VARCHAR CLOBF MIXED PRESERVE WHITESPACE)
Figure 5.30
5.5
Output SYSPUNCH DS when unloading records larger than 32KB
LOADING XML DATA IN DB2 FOR Z/OS
To load data into tables you use the LOAD utility, as shown in Figure 5.31. The data that was unloaded in Figure 5.27 is being loaded into a new table called customer2. This table has an INTEGER column and an XML column. Remember that only well-formed XML documents can be loaded into an XML column. //LOAD01 EXEC DSNUPROC,PARM='ISC9,IANTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK01 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK02 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK03 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK04 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //MYSYSREC DD DSN=USER123.UNLOAD.SYSREC,DISP=SHR //SYSUT1 DD UNIT=SYSDA,SPACE=(4000,(50,50),,,ROUND) //SYSERR DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSDISC DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSMAP DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * LOAD DATA INDDN (MYSYSREC) LOG NO RESUME YES UNICODE CCSID(01208,01208,01208) FORMAT DELIMITED COLDEL X'2C' CHARDEL X'22' DECPT X'2E' SORTKEYS 3 INTO TABLE "USER123"."CUSTOMER2" ( "CID" POSITION(*) INTEGER , "INFO" POSITION(*) XML PRESERVE WHITESPACE ) /*
Figure 5.31
Example of a DB2 for z/OS LOAD job
5.5
Loading XML Data in DB2 for z/OS
115
Note: • If you have unloaded the data previously, using the jobs shown in Figure 5.26 or Figure 5.27, then the SYSIN records are the contents of the SYSPUNCH DD card in these jobs. • The PRESERVE WHITESPACE option has been specified for the XML column. It can be omitted, in which case the default behavior is not to preserve whitespace. • If you omit the UNICODE CCSID line, then you get the following error: “RECORD (1) WILL BE DISCARDED DUE TO 'CID' CONVERSION ERROR”. The Unicode input data for FORMAT DELIMITED must be UTF-8, which is CCSID 1208. • The COLDEL parameter specifies the column delimiter that is used in the input file. The default value is a comma (,). For ASCII and UTF-8 data this is X'2C', and for EBCDIC data it is a X'6B'. The CHARDEL parameter specifies the character string delimiter that is used in the input file. The default value is a double quotation mark ("). For ASCII and UTF-8 data this is X'22', and for EBCDIC data it is X'3F'. The DECPT parameter specifies the decimal point character that is used in the input file. The default value is a period (.). The default decimal point character is a period in a delimited file, X'2E' in an ASCII or Unicode UTF-8 file. When the XML data is loaded as a part of regular input records, specify XML as the input field type. The target column must be an XML column. The LOAD utility treats XML columns as variable-length data when loading XML directly from input records and expects a two-byte length field preceding the actual XML value. The internal XML tables are loaded when the base table is loaded. You cannot specify the name of the internal XML table for load. You also cannot directly load the DocID column of the base table space or specify a default value for an XML column. You can load XML documents from regular input records if the total input record length is less than 32KB. XML documents that don’t fit into 32KB input records must be loaded from separate files. To achieve this you need to modify the SYSIN cards in Figure 5.31 with the one in Figure 5.30. The SYSREC input dataset is the dataset you specified in the UNLOAD job in Figure 5.27. If you have documents larger than 32KB that come from a source other than a previous unload, you can load these into a table as follows. As an example let us use a document called DOC01, which is also the member name in a partitioned dataset called USER123.XMLLOAD. First you need to edit the dataset pointed to by SYSREC and add the relational value for the Cid column of the row, as shown next: 2000.USER123.XMLLOAD(DOC01)
You can now use exactly the same SYSIN cards as before to load this document into the table customer2.
116
Chapter 5
Moving XML Data
Note that DB2 for z/OS does not compress an XML table space during the LOAD process. If the XML table space is defined with COMPRESS YES, then you have to run a REORG to compress the data.
5.6 VALIDATING XML DOCUMENTS DURING LOAD AND INSERT OPERATIONS When you use the LOAD or IMPORT utilities in DB2 for Linux, UNIX, and Windows to move a large number of XML documents into a table, you can validate these documents against an XML Schema. Simply add the clause XMLVALIDATE USING SCHEMA to the LOAD or IMPORT command, as illustrated in Figure 5.32. LOAD FROM c:\mydata\load_customer.txt OF DEL XML FROM c:\mydatadata XMLVALIDATE USING SCHEMA db2admin.custxsd INSERT INTO customer;
Figure 5.32
Performing XML Schema validation during LOAD
In DB2 for z/OS there is no XMLVALIDATE option for the LOAD utility but you can validate documents after loading them into a table. This and other validation topics are covered in Chapter 17, Validating XML Documents against XML Schemas.
5.7
SPLITTING LARGE XML DOCUMENTS INTO SMALLER DOCUMENTS
Most programmers find it convenient and efficient to work with an XML document granularity that matches the logical business objects of the application and the predominant granularity of access. For example, a single document per purchase order, per trade, per contract, per tax return, per customer, and so on is usually a good idea. Smaller documents can be manipulated more efficiently than larger ones. Also, indexed access and data retrieval is faster for smaller documents. However, for a bulk transfer of XML data outside the database, such as FTP, it is often not convenient to handle thousands or millions of separate documents. Therefore, it is common to receive large XML documents, often several hundred megabytes per file, which contain many repeating blocks that represent independent objects. Many external XML tools fail, or have severe problems, when you try to open such large XML documents, typically due to document object model (DOM) parsing and memory limitations. DB2 can ingest XML documents up to 2GB. Optionally, you can split them into smaller documents using the XMLTABLE function. The XMLTABLE function is discussed in detail in Chapter 7, Querying XML Data with SQL/XML. Here we show one simple example of how it can split up documents.
5.7
Splitting Large XML Documents into Smaller Documents
117
Assume you need to manage many XML documents with the following (simplified) structure: 1 Heather 12.34
You may receive many of these documents in one large file that has a root element . The root element is required for the file to be a well-formed document. Otherwise it cannot be processed in DB2. The large file looks like this: 1 Heather 12.34 2 Helen 56.78 …
Your first step is to insert, import, or load this document into a staging table that has a column of type XML, such as this one: CREATE TABLE staging(xcol XML)
When this table contains the large document in a single row, you can read the document from the staging table, split it into the individual account documents, and insert those into the following target table: CREATE TABLE accounts(acc XML)
To split the large document, use one of the two INSERT statements in Figure 5.33. Both accomplish the same thing; that is, they produce one row (document) in the target table for each account element in the large input document. You must create an XML document node for each newly created account document, either with the SQL/XML function XMLDOCUMENT, or with the XQuery function document{}. The latter is only available in DB2 for Linux, UNIX, and Windows. The first of the two statements in Figure 5.33 is suitable for DB2 for z/OS.
118
Chapter 5
Moving XML Data
INSERT INTO accounts(acc) SELECT XMLDOCUMENT(x.val) FROM staging, XMLTABLE('$x/accounts/account' passing xcol as "x" COLUMNS val XML PATH '.') AS x;
INSERT INTO accounts(acc) SELECT x.val FROM staging, XMLTABLE('$XCOL/accounts/account' COLUMNS val XML PATH 'document{.}') AS x;
Figure 5.33
Splitting a large document
After the insert operation, select the data from accounts to verify that the large input document has been split correctly (see Figure 5.34). SELECT acc FROM accounts;
1 Heather 12.34 2 Helen 56.78 describe xquery db2-fn:xmlcolumn('CUSTOMER.INFO') Column Information Number of columns: 1 SQL type ------------988 XML
Type length ----------0
Column name --------------------INFO
Name length ----------4
db2 =>
Figure 6.19
Describing an XQuery
You can run the query in Figure 6.18 in the DB2 Command Line Processor (CLP) or any other interface, such as the Command Editor that’s part of the DB2 Control Center, IBM Data Studio, or, for example, via JDBC from a Java application. When the XML type data is returned from the DB2 server to any such client it is automatically serialized; that is, converted from DB2’s internal tree format to XML text. The CLP displays at most 4,000 bytes of XML text per row. Any XML column values shorter than this are padded with blanks. Any XML data beyond 4,000 bytes per row is truncated in the CLP display. To avoid truncation and to see the full XML output, you can use the DB2 EXPORT utility (see Chapter 5, Moving XML Data) or a tool such as IBM Data Studio. The table and column name in the db2-fn:xmlcolumn() function must be enclosed in either single quotes or double quotes. They typically also need to be in uppercase. This is because DB2 table and column names default to uppercase, unless you use quotes in the CREATE TABLE statement to force a lowercase table or column name. Now that you are familiar with the mechanics of running XPath in DB2, let’s run the XPath expression previously shown in Figure 6.17. Simply append the path /customerinfo/phone to the db2-fn:xmlcolumn() function, as shown in Figure 6.20. The result is exactly the same as in Figure 6.17. db2 => xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo/phone 416-555-3376
5 record(s) selected. db2 =>
Figure 6.20
Executing the query from Figure 6.17 in the DB2 Command Line Processor
6.5
How to Execute XPath in DB2
139
Remember that each step in a path expression produces a sequence of so-called context nodes that are input to the next step. In the same manner, the db2-fn:xmlcolumn() function produces a sequence of XML documents that are input to the first step of the XPath expression. Hence, the XPath /customerinfo/phone is evaluated once for each document in the table. The result items from all documents, in this case phone elements, are combined into a single sequence. Each item is returned to the client as a separate row. DB2 also offers the function db2-fn:sqlquery(), which is similar to db2-fn:xmlcolumn(). While db2-fn:xmlcolumn() takes an XML column name as input and produces the sequence of all documents in that column as output, the function db2-fn:sqlquery() takes an SQL query as input and produces as output the sequence of documents that are returned by that SQL statement. This SQL query can be any query, even with joins and subselects and so on, as long as it returns a single column of type XML. Figure 6.21 is a simple example of a query that returns a sequence of documents that are a subset of the documents in the XML column info. xquery db2-fn:sqlquery("SELECT info FROM customer WHERE id > 1003")
Figure 6.21
Producing a sequence of documents with an SQL query
The key difference between db2-fn:xmlcolumn() and db2-fn:sqlquery() is that db2fn:xmlcolumn() takes all documents in an XML column as the input for your XPath expression, while db2-fn:sqlquery() allows you to use relational predicates and so on to pre-filter the set of documents that are input to the XPath query. The embedded SQL statement is parsed by DB2’s SQL parser, which means that table and column names are automatically converted to uppercase. You can append any path expression to the db2-fn:sqlquery() function to further process the returned documents. In Figure 6.22, the XPath expression /customerinfo/phone is applied to the one XML document that is identified by the embedded SQL statement. db2 => xquery db2-fn:sqlquery("select info from customer where id = 1003")/customerinfo/phone 905-555-7258 416-555-2937 905-555-8743 3 record(s) selected. db2 =>
Figure 6.22
Using db2-fn:sqlquery in the DB2 Command Line Processor
140
Chapter 6
Querying XML Data: Introduction and XPath
You can run any XPath expression that you see in this chapter simply by appending it to the db2fn:xmlcolumn() or db2-fn:sqlquery() functions and using the xquery keyword, as illustrated in the preceding figure. In the following sections we explain further features of the XPath language and provide more examples. All of them can be run in DB2 for Linux, UNIX, and Windows just like you see in Figure 6.20 and Figure 6.22.
6.6
WILDCARDS AND DOUBLE SLASHES
XPath allows the use of the * as a wildcard character to match any element name, and @* to match any attribute name. The XPath expression in Figure 6.23 uses the wildcard to return all elements that are immediate children of the assistant element. The assistant element occurs only in the second of the two documents and has two child elements, name and phone. XPath: Output:
Figure 6.23
/customerinfo/assistant/* Gopher Runner 416-555-3426
Using a wildcard to select all child elements of assistant
The wildcard in the XPath expression in Figure 6.24 matches all elements that occur directly under customerinfo. These are the elements name, addr, phone and in the second document also assistant. The sequence of these elements is input to the last step of this XPath, /name. In other words, the XPath then tries to find /customerinfo/name/name, /customerinfo/ addr/name, /customerinfo/phone/name, and /customerinfo/assistant/name. The first three of these don’t exist and so only the assistant’s name is returned.
XPath: Output:
Figure 6.24
/customerinfo/*/name Gopher Runner
Using a wildcard to match any child element of customerinfo
The query in Figure 6.25 uses two wildcards, one to match any element at the second level of the document hierarchy and one to match any element at the third level. The first wildcard matches name, addr, phone, and assistant, as in the previous example. The next wildcard then matches any child elements of these nodes. Only addr and assistant have child elements and all of those are returned. The last two elements in the result, name and phone, are children of assistant, which exists only for one of the two input documents. Customer phone elements are not included in the result, because they are at the second instead of the third level of the document. The XPath expression /*/*/* would return the same result from the sample data.
6.6
Wildcards and Double Slashes
XPath: Output:
Figure 6.25
141
/customerinfo/*/* 845 Kean Street Aurora Ontario N8X 7F8 1596 Baseline Toronto Ontario M3Z 5H9 Gopher Runner 416-555-3426
Using wildcards to return any element on the third level of the document
While * matches any element name, @* matches any attribute. The XPath in Figure 6.26 is similar to the one in Figure 6.25, but it returns any attribute at the third level of the documents because it uses @* instead of * in the last step of the path expression. Additionally, the data() function is used to return just the value of each attribute node. The sample data contains two attributes on the third level of the document, /customerinfo/addr/@country and /customerinfo/phone/ @type. The addr and phone elements are matched by the * in the second step of the XPath, and their attributes are matched by @* in the third step. Attributes of the assistant phone elements are not returned because they are at the fourth level. XPath: Output:
Figure 6.26
/customerinfo/*/data(@*) Canada work home cell Canada work home
Using wildcards to return any attribute on the third level of the document
The examples clarify that a * is a wildcard for a tag name at a very specific level of the XML documents, and you need to use multiple wildcards to match arbitrary tags at multiple levels. Another XPath construct that makes queries more general is the double slash (//). You can use it to reach descendants at any level in a document tree. An example is shown in Figure 6.27. The difference between a single slash (/) and a double slash (//) is that a / navigates exactly one level further down in the document tree while a // navigates any number of levels down the tree. In other words, a / navigates to an immediate child node while a // navigates to all descendant nodes. Descendant nodes include child nodes, grandchild nodes, great-grandchild nodes, and so on.
142
Chapter 6
Querying XML Data: Introduction and XPath
The XPath expression in Figure 6.27 consists of two steps. The first step navigates to the top-level element customerinfo. All customerinfo nodes are input (context) for the second step. The second step, //name, looks for name elements at any level in the document tree under a customerinfo node. It finds two name elements at the second level, /customerinfo/name, and one name element at the third level, /customerinfo/assistant/name.
XPath: Output:
Figure 6.27
/customerinfo//name Robert Shoemaker Matt Foreman Gopher Runner
Selecting name elements at any level under customerinfo
Figure 6.27 shows some of the benefits and some of the dangers of the //. A benefit is that the // allows you to easily navigate to all occurrences of a certain element, even if that element occurs at multiple different levels of a document tree. Another benefit can be that it allows you to find a certain element in the documents even if you do not know its exact position and therefore are unable to write a fully qualified XPath. A danger of the // can be that it might select more data than you actually intended. If the goal of the query in Figure 6.27 was to retrieve customer names only, then the result leads you to believe that there are three customers and that Gopher Runner is one of them. This is incorrect because Gopher Runner is the assistant to Matt Foreman and not a customer himself. Another disadvantage of the // is that it doesn’t specify a direct path to the desired nodes. This causes an XPath processor, such as DB2, to search exhaustively through potentially large portions of a document. For example, the query in Figure 6.27 requires DB2 to navigate into the addr branch of each document and examine each child element of addr to determine whether its element name is name. A fully specified path without // avoids this overhead and yields better performance. The // can also be used at the beginning of a path expression, such as //name, which for the sample data returns the same result as the query in Figure 6.27. The XPath //* returns all elements from all input documents, because // navigates to any level of the document and * matches any element at each of those levels. Similarly //data(@*) returns all attribute values anywhere in the documents, and //text() returns all text nodes. Use such general expressions with caution.
6.7
XPATH PREDICATES
The preceding XPath examples always return all matching nodes from the input documents. In many cases it is desirable to use search conditions (predicates) to filter the data and only return selected items. In XPath, predicates are always enclosed in square brackets and can appear in any
6.7
XPath Predicates
143
step of the path. In Figure 6.28, a predicate in square brackets is applied to the customerinfo element, which is the first step of the path. Roughly speaking, this query returns the name of the customer(s) whose Cid attribute is 1004. More precisely, the predicate checks for each customerinfo element in the input data, whether the element has an attribute by the name of Cid and whether the value of that attribute is 1004. If such a Cid attribute does not exist or if its value is not 1004, the respective customerinfo element is excluded from further consideration. Based on our input data, only the customerinfo element in the second document passes this test. This element is now the context for the next steps of the navigation, /name/text(), and the value Matt Foreman is returned.
XPath: Output:
Figure 6.28
/customerinfo[@Cid=1004]/name/text() Matt Foreman
Numeric predicate in an XPath expression
Instead of the equality comparison you can also use less than (), less than or equal (=), and not equal (!=). More details on comparison operators are provided in section 6.8. In Figure 6.29, the predicate in square brackets is applied to the addr element to return the streets of those customers who live in Toronto. If an addr element has a child element city whose value is Toronto, the addr element is used as the context for the next navigation step, /street.
XPath: Output:
Figure 6.29
/customerinfo/addr[city="Toronto"]/street 1596 Baseline
String predicate in an XPath expression
Remember that the value of an element is defined as the concatenation of all text nodes in the subtree underneath that element (see section 3.1, Understanding XML Document Trees). Since the city element has only a single text node, the predicates [city="Toronto"] and [city/ text()="Toronto"] lead to the same result. Hence, in the vast majority of cases you do not need to use /text() in predicates. The relatively rare case in which it can sometimes be useful to use /text() in predicates is when the immediate children of an element are a mix of element and text nodes. Such elements are said to have mixed content (see section 3.1). If you want to return the city element instead of the street element, a possible XPath is /customerinfo/addr[city="Toronto"]/city. The city element is referenced once to evaluate the predicate and then a second time at the end of the path to return it.
144
Chapter 6
Querying XML Data: Introduction and XPath
NUMERIC VERSUS STRING COMPARISON Note that the predicate [@Cid=1004] performs a numeric comparison while the predicate [@Cid="1004"], with double quotes around the literal value, performs a string comparison. The difference between numeric and string comparison can lead to different query results. For example, a string comparison would find that the string values “1E3” and “1000” are not equal. But, a numeric comparison would confirm that the numbers 1E3 and 1000 are equal because 1E3 is the exponential notation for 1000. Similarly, the string comparison “2” < “10” is false, but the numeric comparison 2 < 10 is true. Note also that the numeric comparison [@Cid=1004] fails with an error (SQL16061N) at runtime if a document is encountered where the value of the Cid attribute is not a number.
A predicate expression within the square brackets can contain multiple steps to navigate to the element or attribute whose value you want to check. For example, say you want to return the name of all customers in Toronto. To develop this XPath expression from scratch, first start without the predicate and write down just the path to the element that you want to return: /customerinfo/name
To restrict the result to customers in Toronto, a predicate on the city element is required. The city element is a child of the addr element, which in turn is a child of customerinfo, so this is where you need to apply the predicate: /customerinfo[addr/city ="Toronto"]/name
The predicate [addr/city ="Toronto"] checks for each customerinfo element if it has a child element addr that has a child element city whose value is Toronto. The customerinfo nodes that fulfill this condition are then the input for the next step, /name. In other words, the XPath step right after the predicate is /name and it continues navigation based on the element before the predicate (customerinfo) and not based on any element inside the square brackets. This is illustrated in Figure 6.30, where this XPath expression is shown with two branches. The horizontal branch identifies the items that are to be returned (/customerinfo/name), and the branch in the dotted box is the predicate. addr customerinfo
Figure 6.30
city = "Toronto"
name
Visualization of an XPath with a predicate
6.7
XPath Predicates
145
One XPath can contain multiple predicates, as illustrated in Figure 6.31, which returns the street of the customer whose name is Matt Foreman and whose city is Toronto.
XPath: Output:
/customerinfo[name="Matt Foreman"]/addr[city="Toronto"]/street 1596 Baseline
Figure 6.31
XPath with two predicates
When writing such a query from scratch, proper placement of the predicates is sometimes not obvious if you are new to XPath. The recommendation is again to first write the XPath without any predicates and only navigate to the element that you want to return (street). This simpler XPath looks like this: /customerinfo/addr/street
Now you can add filtering predicates for name and city. Since name is a child element of customerinfo, insert a pair of square brackets right after customerinfo for the predicate: /customerinfo[name="Matt Foreman"]/addr/street
The city element is a child of addr, so the square brackets for the second predicate come right after addr in the path expression, and this completes the query in Figure 6.31: /customerinfo[name="Matt Foreman"]/addr[city="Toronto"]/street
Again, visualizing this query as a branching expression might be helpful (see Figure 6.32). name = "Matt Foreman" city = "Toronto" customerinfo
Figure 6.32
addr
street
Visualization of an XPath with two predicates
Note that a predicate expression in square brackets can contain a / or a // but typically never starts with a / or a //. Consider the following XPath expression as an example: /customerinfo[/name="Matt Foreman"]/addr/street
This XPath returns the empty sequence because the predicate [/name="Matt Foreman"] does not use the current customerinfo element as context. That is, it does not look for name elements that are children of customerinfo. Instead, the / inside the square brackets causes it to
146
Chapter 6
Querying XML Data: Introduction and XPath
restart navigation at the very top of each document, but there is no document in the sample data where the topmost element is name. Figure 6.33 shows what can happen if you use // right at the beginning of a predicate expression in square brackets. The intention of this query was to return all cell phones by looking at type attributes anywhere under phone. However, the // inside the square brackets causes it to restart navigation at the very top of each document. Hence, the actual meaning of this query is: Retrieve all phone elements from a document if a type attribute with the value “cell” occurs anywhere in the document. In other words, return all phone elements if one of them is a cell phone. XPath: Output:
/customerinfo/phone[//@type="cell"] 905-555-7258 416-555-2937 905-555-8743
Figure 6.33
Incorrect use of // in a predicate
If you know that the type attribute is a child of phone, you could simply remove the // from the beginning of the predicate expression. Otherwise you can use a dot to force the // to only search within the subtree (within the current context) of the respective phone element (see Figure 6.34). The current context is explained in more detail in section 6.10. XPath: Output:
/customerinfo/phone[.//@type="cell"] 905-555-8743
Figure 6.34
Correct use of // in a predicate
Also note that the opening square bracket of a predicate can never follow immediately after a / or a //. For example, the XPath /customerinfo/[name="Matt Foreman"] would fail with an error (SQL16002N). A / starts a new step, which cannot begin with a predicate. A predicate always has to be preceded by a context node (such as an element name) to which it is applied. And finally, look at Figure 6.35, which uses an equality comparison without square brackets. This is just a Boolean expression of the form A = B that returns either true or false. It is not a useful predicate to select specific parts of the customer data. In particular, this query does not return the customer whose name is Matt Foreman. The query examines a sequence of name elements and returns true if at least one of them is equal to Matt Foreman. This is called existential semantics and is explained in the next section. XPath: Output:
/customerinfo/name="Matt Foreman" true
Figure 6.35
A Boolean expression, not a filtering predicate
6.8
6.8
Existential Semantics
147
EXISTENTIAL SEMANTICS
When you use XPath, existential semantics (also known as existential quantification) is applied automatically all the time. Roughly speaking, existential semantics means that the existence of at least one matching node is sufficient for a predicate to evaluate to true. Let’s look at the query in Figure 6.36 as an example. This query returns the name of those customers whose phone number is 416-555-2937. But, both of the input documents contain several occurrences of the phone element. Existential semantics means that the query in Figure 6.36 returns name elements that are children of customerinfo elements that contain at least one child element phone whose value is 416-555-2937. The existence of at least one matching phone element is sufficient to fulfill the predicate. Existential semantics is a useful concept for querying XML data, because it defines how to evaluate predicates on repeating elements (or more generally, on sequences of two or more items). XPath: Output:
Figure 6.36
/customerinfo[phone="416-555-2937"]/name Robert Shoemaker
At least one phone element must match, not all of them
Figure 6.37 shows another example of existential semantics. It includes a predicate that contains nothing but the element name assistant. The predicate evaluates to true if this element exists at the indicated position in the document tree; that is, as a child of the customerinfo element. As a result, this query returns the name of those customers who have an assistant, no matter what the assistant name or phone number is. The mere existence of an assistant element is what this predicate is looking for. Such a predicate is called a structural predicate as opposed to a value predicate, which performs a value comparison. XPath: Output:
Figure 6.37
/customerinfo[assistant]/name Matt Foreman
A structural predicate
Similarly you can check for the existence of an attribute. The query in Figure 6.38 retrieves the names of all customers who have a country attribute in the addr element. XPath: Output:
Figure 6.38
/customerinfo[addr/@country]/name Robert Shoemaker Matt Foreman
Return the name if a country attribute exists
Yet another example of existential semantics is illustrated in Figure 6.39 where the right side of the predicate is a sequence of two atomic values. This predicate is true if there is at least one value in this sequence that is equal to the value of the city element. If you are familiar with IN-list queries in SQL, this is how you can do the same in XPath.
148
Chapter 6
XPath: Output:
Figure 6.39
Querying XML Data: Introduction and XPath
/customerinfo[addr/city = ("Toronto","Aurora")]/name Robert Shoemaker Matt Foreman
Predicate is true if at least one of the values matches
What if a customer has several addresses so that addr/city evaluates to a sequence of multiple city elements? In this case, existential semantics defines that the predicate is true if at least one of these city elements is equal to at least one of the values on the right side. Let’s look at the two sequences (1,2,3,4) and (7,8,2). The comparison (1,2,3,4) = (7,8,2) evaluates to true because there is at least one item in the first sequence that is equal to at least one item in the second sequence. This item is the number 2. What might seem counterintuitive at first is that the predicate (1,2,3,4) != (7,8,2) also evaluates to true! This is again due to existential semantics, because there is at least one item in the first sequence that is not equal to at least one item in the second sequence. Figure 6.40 shows the corresponding behavior for the sample data. Remember that Robert Shoemaker lives in Aurora and Matt Foreman lives in Toronto (see Figure 6.7). The XPath in Figure 6.40 returns Robert Shoemaker’s name because his city (Aurora) is not equal to at least one item in the sequence on the right (Toronto). The same applies to Matt Foreman whose city (Toronto) is not equal to Aurora. XPath: Output:
Figure 6.40
/customerinfo[addr/city != ("Toronto","Aurora")]/name Robert Shoemaker Matt Foreman
Predicate is true if at least one of the values does not match
The lesson here is that XPath’s existential semantics is not only applied to equality predicates but also to range and inequality predicates for which the behavior is not immediately intuitive if the left side or the right side evaluates to a sequence of more than one item. For example, the predicate in Figure 6.41 only involves sequences of exactly one item on either side of the != operator. The behavior is intuitive and only Robert Shoemaker’s name is returned because he is the only customer in our sample who does not live in Toronto. XPath: Output:
Figure 6.41
6.9
/customerinfo[addr/city != "Toronto"]/name Robert Shoemaker
Not-equal predicate on single items
LOGICAL EXPRESSIONS WITH AND,
OR, NOT()
Similarly to SQL, XPath allows you to build more complex predicates with and, or, and not(). While and and or are logical operators, not() is a function that reverses the Boolean value of its argument. XPath and XQuery are case-sensitive languages and all operators and functions have to be written in lowercase.
6.9
Logical Expressions with AND, OR, NOT()
149
The query in Figure 6.42 uses the or operator to check whether there is an addr with a city element that has the value Toronto, or if there is an addr with a city element whose value is Aurora. For the sample data, this returns the same result as in Figure 6.39. Note that when we say “if there is” or “if there exists” we are hinting at the fact that existential semantic is always at play. XPath: Output:
Figure 6.42
/customerinfo[addr/city = "Toronto" or addr/city ="Aurora"]/name Robert Shoemaker Matt Foreman
Disjunction of predicates (or-’ing)
The and operator is used in Figure 6.43 to select the names of customer whose city is Aurora and whose country is Canada. XPath: Output:
Figure 6.43
/customerinfo[addr/city = "Aurora" and addr/@country = "Canada"]/name Robert Shoemaker
Conjunction of predicates (and-’ing)
The predicate in Figure 6.43 checks whether there is an addr element with a city child that has the value Aurora, and whether there is also an addr element with a country attribute whose value is Canada. In this case, both conditions are fulfilled by one and the same addr element. In general, however, they could be fulfilled by two different addr elements; for example, if a customer had two addresses. This alludes to the next interesting example. You might write the query in Figure 6.44 to find a customer whose work phone number is 416555-2937. Such a customer does not exist in our sample data, because 416-555-2937 is Robert Shoemaker’s home phone number, not his work phone number. The predicate restricts the value of the phone element to 416-555-2937, and the type attribute of the phone element to work. Still, the name Robert Shoemaker is returned. This is because existential semantics applies to both parts of the predicate. The first part of the predicate, phone = "416-555-2937", is true because there is a phone element whose value is 416-555-2937. The second part of the predicate, phone/@type = "work", is also true because there also is a phone element whose type is work. But, these two phone elements are not the same. The query result in Figure 6.44 is perfectly correct according to the existential semantics of XPath, but probably not what you wanted to achieve with this query. XPath: Output:
Figure 6.44
/customerinfo[phone = "416-555-2937" and phone/@type = "work"]/name Robert Shoemaker
Two predicates matched by different phone elements!
150
Chapter 6
Querying XML Data: Introduction and XPath
To solve this issue you need to express the predicate such that both conditions are applied to the same phone element. One way of doing this is shown in Figure 6.45 where nested square brackets are used. The outer square brackets describe a predicate that is applied to the customerinfo elements. This predicate says that a customerinfo element should only be considered if a certain phone element exists among its children. The inner square brackets are used to further constrain these phone elements by applying a predicate to them. The inner predicate [text() = "416-555-2937" and @type = "work"] says that the text value of the phone element has to be 416-555-2937 and the type of the same phone element is work. Both parts of this inner predicate are always applied together to the same phone element. Since no such customer exists in our sample data, the correct result of the query is empty. XPath:
/customerinfo[phone[text() = "416-555-2937" and @type = "work"] ]/name
Output:
Figure 6.45
Nested predicates
Figure 6.46 provides another example of the use of the or operator. It returns the names of the customers who have an assistant or a cell phone. Both of the customers are returned because one of them has a cell phone and the other has an assistant. XPath: Output:
Figure 6.46
/customerinfo[assistant or phone/@type="cell"] ]/name Robert Shoemaker Matt Foreman
A structural predicate and a value predicate
The XPath expression in Figure 6.47 lists the names of those customers who don’t have an assistant. The not() function is used in the predicate to qualify the customerinfo elements that do not have a child element with the name assistant. XPath: Output:
Figure 6.47
/customerinfo[not(assistant)]/name Robert Shoemaker
Checking for the non-existence of an element
Next, let’s look at the following pair of queries (see Figure 6.48 and Figure 6.49) to clarify the difference between using the not() function and the “not equal” comparison operator (!=). Due to existential semantics, the query in Figure 6.48 returns the names of both customers. This is because both of them have at least one phone number that is not equal to 416-555-2937. One such non-matching phone element is enough to fulfill the predicate, even if other phone elements exist that do match this number.
6.10
The Current Context and the Parent Step
151
The query in Figure 6.49 returns a result that might be more desirable: the name of the customer who does not have any phone element with the value 416-555-2937. The equality predicate inside the not() function is subject to existential semantics; that is, at least one phone element with this specific number has to exist. The outcome of this test is then negated with the not() function. In other words, the two queries differ because • The query in Figure 6.48 checks whether there is at least one phone that is not equal to 416-555-2937 (even if other phone elements are equal to this value). • The query in Figure 6.49 checks whether there is not at least one phone that is equal to 416-555-2937 (that is, there is no phone that is equal to this value).
/customerinfo[phone != "416-555-2937"]/name Robert Shoemaker Matt Foreman
XPath: Output:
Figure 6.48
/customerinfo[not(phone = "416-555-2937")]/name Matt Foreman
XPath: Output:
Figure 6.49
6.10
Predicate is true if at least one phone element does not match
Predicate is true if none of the phone elements match
THE CURRENT CONTEXT AND THE PARENT STEP
You probably know that in a file system the dot (.) denotes the current location in the file system, and two dots (..) refer to the parent directory. The same notation exists in XPath to refer to the current node when navigating a document tree, or to the parent of the current node. This is illustrated in Figure 6.50, which shows four versions of an XPath expression. All of them return the same result from our input data; that is, the name element of the customers who live in Aurora. For the discussion of these four XPath expressions you may want to refer to the document tree shown in section 3.1, Understanding XML Document Trees. Also, remember that the node name right before the square brackets of a predicate determines the input to the predicate and to the step that immediately follows the predicate. For example, XPath (a) in Figure 6.50 first produces a sequence of customerinfo elements. For each of these customerinfo elements the predicate checks whether there is an addr element that has a child element city whose value is Aurora. If so, the respective customerinfo element is input to the final step, /name, which returns the child element name. XPath (b) is different because the predicate is applied to addr, not to customerinfo. Hence, this XPath first produces a sequence of addr elements, which are input to the predicate. Any addr element that has a child element city with value Aurora is then input to the subsequent
152
Chapter 6
Querying XML Data: Introduction and XPath
step after the predicate. Since we want to return name elements, we need to navigate from addr to name, which are siblings in our documents. Because an XML document tree has no direct links between siblings, we use the parent step (..) to go one level up in the tree to their common parent, and from there to name.
XPath (a) (b) (c) (d) Output:
Figure 6.50
/customerinfo[addr/city = "Aurora"]/name /customerinfo/addr[city = "Aurora"]/../name /customerinfo/addr/city[. = "Aurora"]/../../name /customerinfo/name[../addr/city = "Aurora"] Robert Shoemaker
Four different ways to write a predicate and return the name element
In XPath (c) the predicate in square brackets is applied to city, which means that this XPath first produces a sequence of city elements, which are used as input (as context nodes) to the predicate. The predicate [. = "Aurora"] uses the dot to refer to the current context, which in this case is always a city element. Any city element for which the predicate is true is then input (context) for the subsequent navigation after the predicate. If you want to return name elements, you need to navigate from city to name, which are in different branches of the document. Hence you need to navigate via the nearest common ancestor, which is customerinfo. Since city is a grandchild of customerinfo, you need to go two levels up in the tree (/../..) before you can reach the name element (/name). XPath (d) is different from (a), (b), and (c) because there is no /name step after the predicate. Instead, XPath (d) first navigates from customerinfo to name to produce a sequence of name elements. The square brackets are applied to name, to filter the names that get returned. The predicate [../addr/city = "Aurora"] means that a name element is returned only if it has a parent that has a child element addr that has a child element city whose value is Aurora. XPath (a) is the most preferable path expression among the four options in Figure 6.50, because it avoids parent steps completely. Avoiding parent steps is good for performance and keeps queries easy to understand. Figure 6.51 shows four more XPath expressions. All of them return empty results because their navigation doesn’t correspond to the structure of the sample data. The parent step in XPath (a) is incorrect for the sample data because it navigates from customerinfo to name with an intermediate parent step as if name was a sibling of customerinfo, which is not the case. XPath (b) tries to return name elements that are children of addr. But, no such name elements exist. Similarly, XPath (c) tries to return name elements that are children of the parent of city (that is, children of addr). Again, no such name elements exist. XPath (d) intends to return name elements that have a child element addr with a city whose value is Aurora. But, this predicate is always false for the sample data because addr is not a child of name.
6.11
Positional Predicates
XPath (a) (b) (c) (d) Output:
Figure 6.51
6.11
153
/customerinfo[addr/city = "Aurora"]/../name /customerinfo/addr[city = "Aurora"]/name /customerinfo/addr/city[. = "Aurora"]/../name /customerinfo/name[addr/city = "Aurora"]
Four different XPath expressions that don’t match the sample data
POSITIONAL PREDICATES
So far you have used value predicates and structural predicates. Value predicates compare an element or attribute to a literal value such as a string or a number. Structural predicates don’t look at values but at the structure of an XML document by checking for the existence of an element or attribute by name. Positional predicates can be used to select nodes based on the order in which they appear in a document or, more generally, in a sequence. As shown in Figure 6.52, a positional predicate is simply an integer number in square brackets. Both documents in the sample data contain multiple phone elements, but this query only returns the first phone element from each document. XPath: Output:
Figure 6.52
/customerinfo/phone[1] 905-555-7258 905-555-4789
Positional predicate to select the first phone element
Similarly, the XPath in Figure 6.53 selects the third phone element under each customerinfo element. In the sample data, the customer Robert Shoemaker has three phone numbers but Matt Foreman has only two phones. Hence, the result only contains Robert’s third phone number and none of Matt’s phone numbers. XPath: Output:
Figure 6.53
/customerinfo/phone[3] 905-555-8743
Positional predicate to select the third phone element
To obtain the last phone element from each document irrespective of the number of phone elements in any given document, use the function last() in the predicate. This function takes no arguments but serves as an index to the last item in a sequence (see Figure 6.54).
154
Chapter 6
Querying XML Data: Introduction and XPath
/customerinfo/phone[last()] 905-555-8743 416-555-3376
XPath: Output:
Figure 6.54
Positional predicate to select the last phone element
Related to positional predicates is the function position(). It takes no arguments but returns the position of the context item in the sequence that is being processed. For example, the positional predicate [3] is the same as the predicate [position() = 3].
6.12
UNION AND CONSTRUCTION OF SEQUENCES
Most of the XPath examples so far have returned one type of element, such as phone numbers or names. Sometimes it is desirable to obtain multiple different elements or attributes from each document. This can be achieved with the union operator, which is either written as the union keyword or the pipe character: |. The XPath in Figure 6.55 uses the union operator in the last step of the XPath, to combine the street and city elements into a single sequence. The result contains four elements, street and city from each of the two customers in the sample data. You will later use SQL/XML to return the street and city in two separate columns, which can be a more desirable return format (see Chapter 7, Querying XML Data with SQL/XML). XPath: Output:
Figure 6.55
/customerinfo/addr/(street|city) 845 Kean Street Aurora 1596 Baseline Toronto
XPath with a union operator
The union of sequences is similar to the construction of sequences. The comma is a sequence constructor and in many cases it produces the same result as a union. For example, the XPath /customerinfo/addr/(street,city)
returns the same result as the union in Figure 6.55. However, there are a couple of differences between union and construction of sequences. The comma operator allows you to construct sequences from atomic values. The | operator cannot take atomic values as input, it has to take sequences of element or attribute nodes as input. Secondly, the union removes duplicate nodes while the comma operator does not. The de-duplicating of the union is based on node identities, not on node values. This means that two elements are not necessarily considered duplicates just because they have the same element name and value. They are considered duplicates only if they are indeed the same element from the same document.
6.13
General and Value Comparisons
155
In addition to the union operator there is also an intersect and an except operator. The intersect operator produces the nodes that occur in both sequences, and the except operator returns the nodes that are in the first but not the second sequence.
6.13
XPATH FUNCTIONS
If you look back at Figure 6.1 at the beginning of this chapter, you see that XPath and XQuery do not only share the same data model but also a common set of functions and operators. Throughout this chapter we have used some of these functions such as data(), string(), and not(). XPath and XQuery provide a large number of built-in functions. These include aggregate functions such as count() and sum(), string functions such as contains() and substring(), as well as numeric and other functions. Figure 6.56, Figure 6.57, and Figure 6.58 provide examples of how to use functions in XPath expressions. The count() function returns the number of nodes produced by the expression that is provided as the function argument. Remember that Robert Shoemaker has three phone numbers and Matt Foreman has two. Other functions such as upper-case() and concat() behave in intuitive ways. XPath: Output:
Figure 6.56
XPath: Output:
Figure 6.57
XPath: Output:
Figure 6.58
/customerinfo/count(phone) 3 2
Return the number of phone elements per document /customerinfo/upper-case(name) ROBERT SHOEMAKER MATT FOREMAN
Convert the customer names to upper case
/customerinfo/concat(name," – ", addr/city) Robert Shoemaker - Aurora Matt Foreman - Toronto
Concatenate the customer name and city
Section 8.7, XQuery Functions, contains a more extensive discussion of XPath and XQuery functions. Additionally, Appendix C provides pointers to the complete reference of all supported XPath and XQuery functions in DB2 for z/OS and DB2 for Linux, UNIX, and Windows.
156
6.14
Chapter 6
Querying XML Data: Introduction and XPath
GENERAL AND VALUE COMPARISONS
All the comparison operators that you have used so far (=, !=, =) are called general comparisons because they allow you to compare sequences of zero, one, or multiple items. This is based on existential semantics, as discussed in section 6.8. General comparisons provide a lot of flexibility and serve you well in the vast majority of cases. There are also value comparison operators, such as eq (equal), lt (less than), le (less than or equal), gt (greater than), ge (greater or equal), and ne (not equal). Value comparisons are different from general comparisons because they can only compare single items. For example, /customerinfo/addr[city eq "Toronto"] is a valid value comparison as long as there is only one city element per addr. The query /customerinfo[phone eq "408-463-4963"] will fail at runtime because the sample data contains multiple phone elements per customerinfo. The DB2 error message is SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "item()" is expected in the context.
The “( item(), item()+ )” is a regular expression that denotes a sequence of one item followed by one or more items. In total that’s two or more items. So this message is a very formal way of saying that there is a sequence of multiple items (that is, multiple phone elements) when only a single item was allowed. In many cases you can work around this error by writing the XPath expression as /customerinfo/phone[. eq "408-463-4963"] because the dot always refers to exactly one of the phone elements at a time. Another solution is to simply use a general comparison instead: /customerinfo[phone = "408-463-4963"]. Another issue with value comparisons is that they perform string comparisons by default. For example, the XPath /customerinfo/addr[pcode-zip lt 95123] will fail with the following message because it tries to use the lt operator with a numeric value (95123), instead of a string value (“95123”). SQL16003N An expression of data type "xs:integer" cannot be used when the data type "xs:string" is expected in the context. SQLSTATE=10507
You can avoid this error by casting the pcode-zip element to xs:integer, such as [xs:integer(pcode-zip) lt 95123], or by using a general comparison instead. Value comparisons have one property that general comparisons do not have, and that is transitivity. If x eq y and y eq z then you are safe to conclude that y eq z is also true. This is not possible with the existential semantics of general comparisons for sequences. For example, (1,2,3) = (3,4,5) and (3,4,5) = (5,6,7), but (1,2,3) != (5,6,7) because there is no item in (1,2,3) that is equal to any item in (5,6,7).
6.16
Summary
157
In summary, the use of value comparisons opens up various opportunities for errors but in most cases provides little gain. Most applications do not require transitivity and are well-served with general comparisons. One potential benefit of value comparisons is that you can force errors if you want to be alerted when data types or element occurrences are different than what you expect.
6.15
XPATH AXES AND UNABBREVIATED SYNTAX
We have introduced XPath through a series of practical examples. In a more formal introduction you might read about XPath axes. An axis is the direction of movement when navigating through a document. DB2 supports the child axis, the descendant axis, the attribute axis, the self axis, the parent axis, and the descendant-or-self axis. We have used all of these axes in the examples in the previous sections of this chapter. For example, the path /customerinfo/addr/@country uses the child axis to navigate from customerinfo to its child element addr, and the attribute axis to navigate from addr to its attribute country. All XPath examples in this book use the so-called abbreviated XPath syntax, because it is simple, easy to understand, and recommended. XPath also offers an unabbreviated syntax, which means that the axes are spelled out explicitly in each step of an XPath. This is rarely used. For example: Abbreviated: /customerinfo/addr/@country Unabbreviated: /child::customerinfo/child::addr/attribute::country Abbreviated: /customerinfo//phone Unabbreviated: /child::customerinfo/descendant-or-self::node()/child::phone In a nutshell, the unabbreviated XPath syntax is verbose, clumsy, and not used much in practice. We recommend that you do not use it. We have explained it here merely so that you recognize it if it ever crosses your path (no pun intended).
6.16
SUMMARY
XPath is the fundamental language for traversing XML documents, evaluating XML predicates, and retrieving XML values. A thorough understanding of XPath is a prerequisite for querying XML data in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Both SQL/XML and XQuery involve XPath. Understanding XPath begins with understanding the XQuery and XPath data model. This data model is inherently different from the relational model. The better you understand the XQuery data model the easier it is for you to write XML queries. Every value in the XQuery and XPath data model is a sequence of zero, one, or multiple items. An item is either an atomic value or a node. Commonly used nodes include document nodes, element nodes, attribute nodes, and text nodes. Element nodes can include child nodes to form hierarchies
158
Chapter 6
Querying XML Data: Introduction and XPath
of nodes, such as XML documents. Hence, a sequence of zero, one, or multiple XML documents is a value in the XQuery and XPath data model. A sequence of individual elements, a sequence of integer numbers, and so on are also values in the data model. Every XQuery or XPath query takes a value of this data model as input and produces another value of the data model as output. Most commonly an XPath expression consists of one or multiple steps, separated by a slash (/), where each step is an element name or wildcard. This allows you to navigate into an XML document tree to select specific elements. If you want to select attribute nodes then the last step in a path must be an attribute name that’s preceded by the @ sign. Since an XML document can contain elements that occur multiple times, a single XPath expression may select multiple nodes. At each step an XPath can contain a predicate to restrict the search in the document. XPath predicates must be enclosed in square brackets. The evaluation of XPath expressions and predicates is always based on existential semantics. Roughly speaking, existential semantics means that the existence of at least one matching item is sufficient for a predicate to evaluate to true. This is of particular importance when you query XML documents with repeating elements. Repeating XML elements and existential semantics are some of the most profound differences between the XML world and relational world. In the following chapters you learn how to use XPath in SQL/XML and XQuery.
C
H A P T E R
7
Querying XML Data with SQL/XML
he SQL language standard includes a variety of functions and features to process XML data. This functionality is commonly referred to as SQL/XML. The SQL/XML functions that allow you to embed XPath and XQuery expressions in SQL are of particular interest. These functions enable you to use familiar SQL statements enriched with XPath expressions to query XML data in a DB2 database. They also facilitate the simultaneous processing of XML and relational data in the same query. This marriage of two worlds, XML and relational, is extremely powerful and versatile.
T
Although SQL/XML allows the integration of SQL and XQuery, this chapter focuses on the integration of SQL and XPath, which is supported in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. The discussion of SQL/XML in this chapter assumes that you have a good understanding of XPath (see Chapter 6, Querying XML Data: Introduction and XPath). The examples in this chapter also use the same two sample documents that were used throughout Chapter 6. Please refer to Figure 6.7 in section 6.3, Sample Data for XPath, SQL/XML, and XQuery. All examples are based on the following customer table: CREATE TABLE customer(id INTEGER, info XML)
We assume that this table contains two rows with values 1003 and 1004 in the id column, and the two documents from Figure 6.7 in the XML column info. The remainder of this chapter is structured as follows: • An overview of SQL/XML is given in section 7.1. • The core SQL/XML functionality for extracting selected information from XML documents and defining XML predicates is covered in sections 7.2, 7.3, and 7.4. 159
160
Chapter 7
Querying XML Data with SQL/XML
• Common mistakes with SQL/XML predicates are highlighted in section 7.5. • Parameter markers, dynamically computed XPath, sorting of XML data, and handling of binary data are discussed in sections 7.6 through 7.9.
7.1
OVERVIEW OF SQL/XML
The term SQL/XML refers to the XML-specific features and functions in the SQL:2003 and SQL:2006 standards. SQL/XML defines the following: • The XML data type, which is a regular SQL type just like INTEGER or CHAR for example. SQL/XML defines the semantics of this type, not its storage format. • Functions that convert XML type values to and from non-XML data types, such as CHAR, VARCHAR, CLOB, and others. These functions are XMLSERIALIZE, XMLPARSE, and XMLCAST. • The function XMLVALIDATE for XML Schema validation and the predicate IS VALIDATED, which checks the validation status of an XML document or fragment. • XML publishing functions, also sometimes called constructor functions, such as XMLELEMENT, XMLATTRIBUTES, and XMLAGG, which allow you to construct new XML documents or fragments. The input data for such XML construction can come from relational columns, from XML columns, or both. This topic is covered in Chapter 10, Producing XML from Relational Data. • Functions to embed XPath and XQuery in SQL statements. These functions are XMLQUERY, XMLTABLE, and the XMLEXISTS predicate. All of these SQL/XML functions are supported in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In this chapter we focus on the following: • XMLQUERY—A scalar function that is typically used in the SELECT clause of an SQL query to extract XML fragments or values from an XML document. • XMLTABLE—A table function that is used in the FROM clause of an SQL statement. It reads one or multiple values from an XML document and returns them as a set of rows. • XMLEXISTS—A predicate that is commonly used in the WHERE clause of an SQL statement to express predicates over XML data. • XMLCAST—A function that converts individual XML values to SQL data types. Now, let’s turn to examples to see how these functions work.
7.2
Retrieving XML Documents or Document Fragments with XMLQUERY
161
7.2 RETRIEVING XML DOCUMENTS OR DOCUMENT FRAGMENTS WITH XMLQUERY The simplest way of retrieving XML data with SQL is to include an XML column name in the SELECT list of an SQL query. For example, the SQL statement in Figure 7.1 returns a single column of type XML (info) and two rows, one row for each of our two sample documents in the customer table. Below the SQL statement in Figure 7.1 you see a corresponding XQuery that returns the same result. --SQL: SELECT info FROM customer; --XQuery: xquery db2-fn:xmlcolumn('CUSTOMER.INFO');
Figure 7.1
Retrieve all documents from the table
You can extend the SQL query in Figure 7.1 with other features of the SQL language, such as a WHERE clause to select only specific rows (documents) from the table. This is shown in Figure 7.2, together with an equivalent XQuery for comparison. --SQL: SELECT info FROM customer WHERE id = 1003; --XQuery: xquery db2-fn:sqlquery('SELECT info FROM customer WHERE id = 1003');
Figure 7.2
Retrieve selected documents from the table
In many situations it is desirable not to retrieve full documents from the database, but just specific XML elements, attributes, or fragments that are of interest. For example, if you only need to retrieve the customer names, you can use the XMLQUERY function in the SELECT clause to extract just that element (see Figure 7.3). The argument of the XMLQUERY function can be any XQuery or XPath expression. This expression needs to know which column to operate on, because a table could have multiple XML columns. The solution is to prefix the XPath with $INFO, a reference to the XML column in our sample table. This reference has to be in uppercase and must start with the $ sign (see section 7.2.1 for details). The SQL/XML statement in Figure 7.3 uses SQL as the top-level language and has an embedded XPath expression. Below it you see a corresponding XQuery that executes the same XPath expression without the use of any SQL. The query result and performance is the same. In particular, note that the return type of the XMLQUERY function is always XML. We will later discuss cases where SQL/XML can have advantages over XQuery and vice versa.
162
Chapter 7
Querying XML Data with SQL/XML
--SQL/XML: SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer --XQuery: xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo/name; --Output: Robert Shoemaker Matt Foreman 2 record(s) selected.
Figure 7.3
Extracting one element from each document
The XMLQUERY function in Figure 7.3 is a scalar function, which means that it takes one value as input and produces one value as output. The XMLQUERY function is applied to one row at a time and so its input value is always the XML document of the current row. The XMLQUERY function typically never processes XML documents from multiple rows at the same time. Its output value is the result of the XPath expression applied to the current document. This result is always a sequence of zero, one, or more items. Such a sequence represents a single value (instance) of the XQuery Data Model.
7.2.1
Referencing XML Columns in SQL/XML Functions
Figure 7.3 shows only one of three ways in which the XML column can be referenced inside the XMLQUERY function. Here are all three ways in more detail: • Direct reference of the XML column name as $INFO. This $INFO is an XQuery variable that is implicitly bound to an XML column of the same name. This is only supported in DB2 for Linux, UNIX, and Windows version 9.5 and higher. It only works if the XML column name is unique across all tables that are referenced in the FROM clause. For brevity we will use this notation in most of the examples in this chapter. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer;
• Explicit assignment of the XML column name to an alias of your choice, which is then used as the context at the beginning of the XPath expression. This assignment is done in the passing clause of the XMLQUERY function. It also allows you to qualify the column name with its table name (passing customer.info AS "i") to avoid ambiguity. The variable name $i has to be unique within each SQL/XML function, not across all functions. You will later see that this passing clause also allows you to pass parameter markers or expressions into the embedded XQuery. This is supported since version 9 of DB2 for z/OS and DB2 for Linux, UNIX, and Windows.
7.2
Retrieving XML Documents or Document Fragments with XMLQUERY
163
SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") FROM customer; -- query with two tables, both have an XML column "info": SELECT XMLQUERY('$i/customerinfo/name' passing c1.info as "i"), XMLQUERY('$i/customerinfo/name' passing c2.info as "i") FROM customer c1, customer2 c2;
• No XQuery variable at the beginning of the XPath expression. Instead, the XML column name is identified in the passing clause without assignment to a variable. This is only supported in DB2 for z/OS. SELECT XMLQUERY('/customerinfo/name' passing info) FROM customer;
7.2.2
Retrieving Element Values Without XML Tags
There are several ways in which you can return the customer names without the element tags around them. One option is to use /text() in the XPath expression to only return the text node of the name element, as in Figure 7.4 (a). The column in the query result set is still of type XML. Alternatively, you can wrap the function XMLCAST() around the XMLQUERY function to convert the XML result to a non-XML type, as in Figure 7.4 (b). XMLCAST() automatically removes the tags from the returned elements. The output is the same as from Figure 7.4 (a), except that the return type is VARCHAR(25) instead of XML. --(a) SQL/XML: SELECT XMLQUERY('$INFO/customerinfo/name/text()') FROM customer; --(b) SQL/XML: SELECT XMLCAST( XMLQUERY('$INFO/customerinfo/name') AS VARCHAR(25)) FROM customer; --Output: Robert Shoemaker Matt Foreman 2 record(s) selected.
Figure 7.4
Returning element values without tags
A common requirement is to retrieve multiple values from a document, such as the customers’ street and city, and to return them in separate columns of the same result row. Separate columns can be produced by using multiple XMLQUERY functions in the SELECT clause (see Figure 7.5).
164
Chapter 7
Querying XML Data with SQL/XML
The same can be achieved with the XMLTABLE function, which is discussed later. Figure 7.5 also shows that you can return a mix of relational columns and XML values. SELECT id, XMLQUERY('$INFO/customerinfo/addr/street/text()'), XMLQUERY('$INFO/customerinfo/addr/city/text()') FROM customer;
1003 1004
845 Kean Street 1596 Baseline
Aurora Toronto
2 record(s) selected.
Figure 7.5
7.2.3
Returning multiple element values in separate columns
Retrieving Repeating Elements with XMLQUERY
The SQL/XML query in Figure 7.6 uses the path expression /customerinfo/phone, which you know returns multiple elements from each of the two input documents. This SELECT statement produces one result row for each of the two input rows. Each result row contains the sequence of phone numbers from the corresponding input document. Each of these two sequences is returned as a string, which the consuming application then needs to break down. However, such a sequence of two or more phone elements is not a well-formed XML document, because a single common root element is missing. Hence, if your application uses an XML parser to process this non-well-formed query result, it will fail with an error. SELECT id, XMLQUERY('$INFO/customerinfo/phone') FROM customer;
1003
1004
905-555-7258416-555-2937905555-8743 905-555-4789416-555-3376
2 record(s) selected.
Figure 7.6
Returning a sequence of elements from each document
Figure 7.7 shows the same query with /text(), and you see that the result values in each sequence are simply concatenated.
7.3
Retrieving XML Values in Relational Format with XMLTABLE
165
SELECT id, XMLQUERY('$INFO/customerinfo/phone/text()') FROM customer;
1003 1004
905-555-7258416-555-2937905-555-8743 905-555-4789416-555-3376
2 record(s) selected.
Figure 7.7
Returning a sequence of text nodes from each document
The conclusion is that the XMLQUERY function is typically not very useful to return repeating elements. As a solution, use the XMLTABLE function, which is explained in the next section.
7.3
RETRIEVING XML VALUES IN RELATIONAL FORMAT WITH XMLTABLE
The XMLTABLE function is very versatile and one of the most powerful SQL/XML functions. Let’s start with some simple examples of the XMLTABLE function and then get back to returning the repeating phone elements in a more suitable format.
7.3.1
Generating Rows and Columns from XML Data
The query in Figure 7.8 uses the XMLTABLE function in the FROM clause. The XMLTABLE function references the info column and is therefore implicitly joined with the table customer. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custID INTEGER PATH custname VARCHAR(20) PATH street VARCHAR(20) PATH city VARCHAR(16) PATH
CUSTID -----1003 1004
CUSTNAME -------------------Robert Shoemaker Matt Foreman
'@Cid', 'name', 'addr/street', 'addr/city') AS T;
STREET -------------------845 Kean Street 1596 Baseline
CITY -----------Aurora Toronto
2 record(s) selected.
Figure 7.8
Using XMLTABLE to return XML values in relational columns
In DB2 for z/OS the XMLTABLE function must contain a PASSING clause to define the reference to the XML column, like this: XMLTABLE('$i/customerinfo' PASSING info AS "i"
166
Chapter 7
Querying XML Data with SQL/XML
The XMLTABLE function contains one row-generating XQuery expression and, in the COLUMNS clause, multiple column-generating expressions. The row-generating expression is the XPath $INFO/customerinfo and is applied to each XML document in the XML column and produces one or multiple rows per document. The row-generating expression produces one customerinfo element (fragment) per document. The output of the XMLTABLE function contains one row for each of these customerinfo elements. The number of elements produced by the row-generating XQuery expression determines the number of rows produced by the XMLTABLE function. The COLUMNS clause transforms XML data into relational format. Each of the entries in this clause defines a column with a column name and an SQL data type. In Figure 7.8, the returned rows have four columns named custID, custname, street, and city. The values for each column are extracted from the customerinfo fragments that are produced by the row-generating expression, and then cast to the SQL data types. For example, the path addr/city is applied to each customerinfo element to obtain the value for the column city. The row-generating expression provides the context for the column-generating expressions. This means that the column-generating expressions are not absolute paths, but relative to the row-generating expression. You can typically append the column-generating expressions to the row-generating expression to get an intuitive idea of what a given XMLTABLE function returns in its columns. The result set of the XMLTABLE query can be treated like any SQL table. You can query and manipulate it much like you use regular row sets or views. The column definitions in the COLUMNS clause can use any SQL data type, such as INTEGER, DECIMAL, CHAR, DATE, and so on. If an extracted XML value cannot be cast to the assigned SQL type, the query fails with an error message. DB2 for Linux, UNIX, and Windows also allows you to use the db2-fn:xmlcolumn() or db2-fn:sqlquery() functions in the row-generating expression of the XMLTABLE function (see Figure 7.9). In this case you omit the table name customer from the FROM clause. The query result is the same as in Figure 7.8. (This is not available in DB2 for z/OS.) SELECT T.* FROM XMLTABLE('db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo' COLUMNS custID INTEGER PATH '@Cid', custname VARCHAR(20) PATH 'name', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city') AS T;
Figure 7.9
Alternative syntax in DB2 for Linux, UNIX, and Windows
7.3
Retrieving XML Values in Relational Format with XMLTABLE
7.3.2
167
Dealing with Missing Elements
XML data can contain optional elements that are not present in all documents. For example, in our sample data you can see that Robert Shoemaker does not have an assistant element. What happens if the optional element assistant is referenced in the row-generating or a columngenerating expression, respectively? Let’s look at these two cases separately. In Figure 7.10 the optional assistant element is referenced in the row-generating expression of the XMLTABLE function. The query seeks to return the name and phone number of all assistants in our customer data. Since the XMLTABLE function returns exactly one row for each node that is produced by the row-generating expression, it does not return any rows for the documents that do not contain an assistant element. Therefore, the query in Figure 7.10 returns the name and phone number of Matt Foreman’s assistant, but no information from Robert Shoemaker’s XML document where no assistant element is present. We will revisit this situation at the end of section 7.3. in a more complex scenario. SELECT T.* FROM customer, XMLTABLE('$i/customerinfo/assistant' PASSING info AS "i" COLUMNS a_name VARCHAR(20) PATH 'name', a_phone VARCHAR(20) PATH 'phone') AS T;
A_NAME A_PHONE -------------------- -------------------Gopher Runner 416-555-3426 1 record(s) selected.
Figure 7.10
Optional element in the row-generating expression
In Figure 7.11 the optional assistant element is referenced in a column-generating expression of the XMLTABLE function. This query intends to return the customer name and the assistant name from each document. For each document where the assistant element does not exist, the column expression assistant/name produces an empty sequence, which is automatically converted to a NULL value.
168
Chapter 7
Querying XML Data with SQL/XML
SELECT T.* FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS c_name VARCHAR(20) PATH 'name', a_name VARCHAR(20) PATH 'assistant/name') AS T; C_NAME -------------------Robert Shoemaker Matt Foreman
A_NAME -------------------NULL Gopher Runner
2 record(s) selected.
Figure 7.11
Optional element in a column-generating expression
If you prefer to generate a default value for missing elements instead of NULL values, use the default clause to define a default value other than NULL. This is done in Figure 7.12. SELECT T.* FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS c_name VARCHAR(20) PATH 'name', a_name VARCHAR(20) default 'none' PATH 'assistant/name') AS T;
C_NAME -------------------Robert Shoemaker Matt Foreman
A_NAME -------------------none Gopher Runner
2 record(s) selected.
Figure 7.12
7.3.3
Defining a default value for missing elements
Avoiding Type Errors
Be aware that every expression in the COLUMNS clause must return a value that can be cast to the specified data type. Otherwise the XMLTABLE execution fails. Consider the following cases: • Incompatible data types. For example, the query in Figure 7.8 fails when it encounters an XML document where the Cid attribute has a non-numeric value, which cannot be cast to INTEGER. • String length. If the XMLTABLE function defines a column of type CHAR(n) or VARCHAR(n), and the column-generating expression produces a string value that’s longer than n, then either one of two things happen:
7.3
Retrieving XML Values in Relational Format with XMLTABLE
169
•
The value is truncated to n bytes, without warning or error. This truncation is mandated by the latest SQL/XML standard and implemented in DB2 for z/OS.
•
The query fails with error SQL16061N. This behavior was allowed by a previous version of the SQL/XML standard and is still effective in DB2 for Linux, UNIX, and Windows.
The following examples show how such cases can be handled. In Figure 7.13, the definition of the custID column uses the XQuery if-then-else and castable expressions to check whether the Cid attribute can indeed be cast to INTEGER, and returns -1 if not. The value for the column custname is produced by the substring function so that only the first 20 characters of the actual name are used. The column-generating expression for the city uses if-then-else and the string-length function to test the length of the city value and returns an error flag if it is too long. Such techniques can be useful if strict data types are not enforced with XML Schema validation. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custID INTEGER PATH '(if (@Cid castable as xs:integer) then @Cid else -1)', custname VARCHAR(20) PATH 'name/substring(.,1,20)', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city/(if (string-length(.) xquery for $i in (1,5,3) return {$i}; 1 5 3 3 record(s) selected.
db2 => xquery let $j := (1,5,3) return {$j}; 1 5 3 1 record(s) selected.
Figure 8.5
8.2.3
The difference between for and let
Understanding the where and order by Clauses
Figure 8.6 shows two more versions of the previous query with the for clause. The first version has an additional where clause to restrict the result set to values greater than 2. The second query in Figure 8.6 adds an order by clause to return the result items in ascending order. Both the where and the order by clause use the variable $i that is introduced in the for clause. db2 => xquery for $i in (1,5,3) where $i > 2 return {$i}; 5 3 2 record(s) selected.
db2 => xquery for $i in (1,5,3) where $i > 2 order by $i return {$i}; 3 5 2 record(s) selected.
Figure 8.6
The effect of the where and order by clauses
8.2
Processing XML Data with FLWOR Expressions
8.2.4
195
FLWOR Expressions with Multiple for and let Clauses
An XQuery FLWOR expression can contain multiple for or let clauses. Figure 8.7 shows two nested for clauses that act similarly to nested loops in a programming language. The outer for clause iterates over the sequence (1,5,3) and the inner for iterates over the sequence ("a","b"). For each iteration of the outer for clause, the inner for clause iterates over all the items in its sequence. This generates the full Cartesian product between the input sequences. An analogy in the SQL world is a SELECT statement with two tables in the FROM clause and no join predicate. db2 => xquery for $i in (1,5,3) for $j in ("a","b") return {$i,$j}; 1 1 5 5 3 3
a b a b a b
6 record(s) selected.
Figure 8.7
Two nested for clauses produce a Cartesian product
The XQuery in Figure 8.8 also contains two nested for clauses. Their input sequences contain a common item, the atomic value 5, which is identified by a join predicate in the where clause. This is analogous to an SQL join. The difference is that SQL operates on sets of relational rows while XQuery operates on sequences of items. In these examples the items are just atomic values to allow for an easy introduction of the language. In the following sections we return to the customer sample data where the items are XML nodes, including elements, attributes, and full documents. db2 => xquery for $i in (1,5,3) for $j in (7,5) where $i = $j return {$i,$j}; 5 5 1 record(s) selected.
Figure 8.8
Two nested for clauses with a join predicate
Since the XQuery let clause does not iterate, it does not contribute to the generation of a Cartesian product of sequences. For example, the query in Figure 8.9 contains a for clause and two let clauses. Each iteration of the for clause leads to one item in the query result. The return
196
Chapter 8
Querying XML Data with XQuery
clause constructs result elements. The value of each result element is the sequence of the values of the variables $i, $j, and $k. db2 => xquery for $i in (1,5,3) let $j := ("a","b") let $k := $i *2 return {$i,$j,$k}; 1 a b 2 5 a b 10 3 a b 6 3 record(s) selected.
Figure 8.9
A FLWOR expression with for and let clauses
All variable names in XQuery have to be preceded by the dollar sign ($). The XQuery standard allows one or multiple spaces between the dollar sign and the beginning of the actual variables, so that both $var and $ var are valid variable names. However, for readability and to avoid confusion it’s best to not use spaces. The same applies to hyphens. Note that $a-b and $ a-b are valid variable names that happen to contain a hyphen. But, a – b is interpreted as an arithmetic operation because there are spaces between the hyphen and the characters a and b. LEARNING XQUERY When it comes to learning a new language there is no better way than learning by doing. We suggest that you download and install the latest version of DB2 Express-C, which is free, so that you can run the XQuery examples in this section hands-on. The examples show that you can explore the behavior of XQuery even without any tables in the database.We encourage you to extend and modify these examples and to try other combinations of for, let, where, order by, and return clauses.You may find that XQuery becomes intuitive quite quickly.
8.3 COMPARING FLWOR EXPRESSIONS, XPATH EXPRESSIONS, AND SQL/XML This section compares and examines XPath, FLWOR, and SQL/XML queries in several ways. We look at traversing XML documents to extract specific elements, coding and placing XML predicates, result set cardinalities, and the integration of FLWOR expressions in SQL statements. We discuss several examples of how “the same” query can be written in several different ways. By “the same” we mean that the same result is returned from the sample data. The examples are not exhaustive; that is, they do not show all possible ways in which a certain query can be written.
8.3
Comparing FLWOR Expressions, XPath Expressions, and SQL/XML
8.3.1
197
Traversing XML Documents
Figure 8.10 illustrates five different ways to retrieve the customer name elements. There is no significant performance difference between them, but for readability and maintainability it is a good idea to use as simple a syntax as possible to express a query. Hence, options (4) and (5) are good choices in Figure 8.10. The first FLWOR expression in Figure 8.10 iterates over the customerinfo elements and binds them to the variable $c, one at a time. The return clause then uses $c as the context to navigate to the name element. The second FLWOR expression iterates directly over the name elements and binds them to the variable $n, one at a time. The return clause then only emits the values of $n. The navigation to the name element has shifted from the return clause to the for clause. The third FLWOR expression iterates over the customer documents; that is, over the document nodes that are at the top of each document tree. The return clause then navigates from these document nodes, represented by $i, to the customerinfo/name elements. You will see shortly that the decision of what to iterate over in the for clause makes a difference as soon as you add predicates to the query. The fourth expression is a simple XPath that returns the sequence of all name elements. The fifth query is an SQL/XML statement that uses the XMLQUERY function to extract the name elements. --(1) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $c/name; --(2) xquery for $n in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name return $n; --(3) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") return $i/customerinfo/name; --(4) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name; --(5) SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer;
Figure 8.10
Five different ways to retrieve the customer name elements
198
8.3.2
Chapter 8
Querying XML Data with XQuery
Using XML Predicates
Figure 8.11 extends the sample queries of Figure 8.10 by adding a predicate to only return the name of the customer whose Cid attribute has the value 1003. All five queries return the same result. Again, the first two FLWOR expressions in Figure 8.11 differ in whether the step to the name element happens in the for or the return clause. This difference affects the where clause, which uses the variable from the for clause. If the for clause assigns the variable $i to customerinfo elements, then the where clause can simply use the XPath $i/@Cid to access the Cid attribute. This is because Cid is a child of customerinfo. The second FLWOR expression, however, binds the variable $i to name elements. This forces the where clause to use a parent step to navigate from $i to the Cid attribute. This is an extra navigation step, which makes the second FLWOR expression slightly more expensive. The third FLWOR expression shows that filtering predicates can not only be located in the where clause but also in the XPath expression of the for clause. In fact, the entire query can again be expressed as a single XPath, which is the fourth query. And finally, the fifth query is an SQL/XML statement, which uses the XMLEXISTS predicate to properly include the filtering condition. --(1) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return $i/name; --(2) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name where $i/../@Cid = 1003 return $i; --(3) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid = 1003] return $c/name; --(4) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid = 1003]/name; --(5) SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]');
Figure 8.11
Five different ways to apply a predicate
8.3
Comparing FLWOR Expressions, XPath Expressions, and SQL/XML
199
The next example (Figure 8.12) shows four different queries that return phone elements whose attribute type has the value cell. The first FLWOR expression uses two nested for clauses. The outer for clause iterates over the customerinfo elements and assigns them to the variable $c. The inner for clause uses the path $c/phone to iterate over the phone elements of the current customer. For each such phone element, the where clause checks whether the type attribute has the value cell. If so, the return clause returns that phone element. The second FLWOR expression shows that the same query result can be achieved without nested for clauses. It uses only a single for clause to iterate directly over the phone elements. The predicate could be applied in the where clause, but this query adds the predicate to the return clause. You will see later that predicates in the return clause can lead to different query results if element construction is involved. The third query is a simple XPath without any FLWOR clauses. The last query is an SQL/XML statement that uses the XMLTABLE function to produce one result row per cell phone, just like the other queries. --(1) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo for $p in $c/phone where $p/@type = "cell" return $p; --(2) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone return $i[@type = "cell"]; --(3) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone[ @type="cell"]; --(4) SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone[@type="cell"]' COLUMNS phone XML PATH '.') as T;
Figure 8.12
Four different queries that return the same phone elements NOTE An advantage of SQL/XML queries is that they can contain parameter markers and host variables in their predicates, as discussed in section 7.6.This is not possible when you use XQuery without SQL.
200
Chapter 8
8.3.3
Querying XML Data with XQuery
Result Set Cardinalities in XQuery and SQL/XML
Let’s look at result set cardinalities using the three queries in Figure 8.13 as examples. Each of the three queries returns all five customer phone numbers, three from one of our sample documents and two from the other. The first query is an XPath expression that produces a sequence of five text nodes, and each item in that sequence is returned as a separate result row. The second query uses the XMLQUERY function and returns the same five phone numbers in two result rows. The reason is that XMLQUERY is a scalar function in an SQL statement, and scalar functions produce one value for each input row. In our example there are two input rows (documents) and for each of them XMLQUERY produces one sequence of phone numbers. You can turn the items in these sequences into separate rows only if you use a table function (as opposed to a scalar function), which generates a set of rows. This is what the XMLTABLE function does. xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone/text(); 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 5 record(s) selected. SELECT XMLQUERY('$INFO/customerinfo/phone/text()') FROM customer; 905-555-7258416-555-2937905-555-8743 905-555-4789416-555-3376 2 record(s) selected. SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone' COLUMNS phone VARCHAR(20) PATH '.') as T; 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 5 record(s) selected.
Figure 8.13
Three different queries that return the same five phone numbers
8.3
Comparing FLWOR Expressions, XPath Expressions, and SQL/XML
201
A key difference between XPath or XQuery expressions on the one hand and SQL/XML statements on the other is that XPath and XQuery expressions always return a single column of type XML. XQuery cannot return multiple columns in a result set or data types other than XML. SQL/XML statements can read values from XML documents and return them as relational result sets that have multiple columns and traditional SQL data types (see section 7.3, Retrieving XML Values in Relational Format with XMLTABLE). NOTE
The examples in this section have shown that many simple queries do not require XQuery FLWOR expressions but can be written much simpler as plain XPath expressions. Indeed, many applications are well-served by combining XPath and SQL and do not necessarily require the extra power of XQuery. However, XQuery has very valuable features that XPath alone does not provide. For example, construction of XML data and joins across multiple XML documents is not possible with XPath alone. Section 8.4 and Chapter 9, Querying XML Data: Advanced Queries and Troubleshooting, provide examples.
8.3.4
Using FLWOR Expressions in SQL/XML
Note that SQL/XML and XQuery are not mutually exclusive. Chapter 7 focused on examples that combine XPath and SQL, which is supported both in DB2 for Linux, UNIX, and Windows and DB2 for z/OS. In DB2 for Linux, UNIX, and Windows, the same SQL/XML functions can also take more complex XQuery expressions as input, such as FLWOR expressions. Figure 8.14 shows an example. It returns the name of the customer whose Cid attribute has the value 1003. Remember that the XMLEXISTS predicate is truly an existence check. If the XQuery or XPath expression in the XMLEXISTS returns an empty sequence, then XMLEXISTS evaluates to FALSE and the current row is eliminated. SELECT XMLQUERY('for $i in $INFO/customerinfo/name return $i/text()') FROM customer WHERE XMLEXISTS('let $i := $INFO/customerinfo where $i/@Cid = 1003 return $i');
Figure 8.14
Return the name of the customer whose Cid is 1003
If the same result can be achieved with simple XPath then for simplicity it is recommended to avoid FLWOR expressions in SQL/XML functions. For example, the query in Figure 8.15 is simpler than the query in Figure 8.14 and returns an identical result set.
202
Chapter 8
Querying XML Data with XQuery
SELECT XMLQUERY('$INFO/customerinfo/name/text()') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]');
Figure 8.15
A simpler query to return the same result as Figure 8.14
Figure 8.16 provides an example of how you should not integrate XQuery in SQL. The problem with this query is that the predicate on the Cid attribute is included in the FLWOR expression in the SELECT clause of the SQL statement. In this location, the predicate does not eliminate any rows from the customer table. To work as expected, the predicate needs to be in the WHERE clause of the SQL statement, using XMLEXISTS. This issue has been discussed in section 7.5, Common Mistakes with SQL/XML Predicates. SELECT XMLQUERY('for $i in $INFO/customerinfo/name where $i/@Cid = 1003 return $i/text()') FROM customer;
Figure 8.16
8.4
Do not place row-filtering predicates in the SELECT clause!
CONSTRUCTING XML DATA
Constructing XML data in XQuery is easy. You can simply type regular XML tags as part of your XQuery. This method is called direct XML construction. For example, an XML element or document just by itself is already a valid XQuery expression. Figure 8.17 is a simple example where the XQuery consists of nothing but a direct element constructor. The name of the constructed element is title and its value is the literal string Hello. The result of the XQuery is the constructed element itself. This cannot be done with XPath alone. db2 => xquery Hello;
Hello 1 record(s) selected. db2 =>
Figure 8.17
8.4.1
Constructing the element title with the value "Hello"
Constructing Elements with Computed Values
It is often desirable to generate XML elements whose values are dynamically computed during query execution. Constructed elements can have computed values if they contain XQuery variables or other dynamic expressions. Such expressions must be enclosed in curly brackets and are
8.4
Constructing XML Data
203
often used in the return clause of a FLWOR expression. For example, the query in Figure 8.18 retrieves the name and city values, and returns this information in a newly constructed XML document. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return {$i/name/text()} {$i/addr/city/text()} ;
Robert ShoemakerAuro ra 1 record(s) selected.
Figure 8.18
Construction of an XML document with dynamic values
Several things are noteworthy about Figure 8.18. The returned XML data uses XML element names that do not exist in the XML documents that are stored in the table. In other words, the query reads one XML format but returns another. This performs a transformation of the data. Although XQuery is not always a substitute for XSLT (Extensible Stylesheet Language Transformations), it can carry out many transformations easily and efficiently. In contrast to Figure 8.17, the values of the constructed elements in Figure 8.18 are not provided as literal strings but computed by XPath expressions. These XPath expressions must be enclosed in curly brackets to indicate that they are to be evaluated and not used as literal string values. If you forget the curly brackets, the query result contains the actual path expressions, which is not useful: $i/name/text() 10 return $i; SQL16061N The value "Unshipped" cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "xs:double". Error QName=err:FORG0001. SQLSTATE=10608
Figure 8.33
Cannot compare xs:string to xs:double!
What if you have some documents where the Status attribute contains numeric values and some documents where it contains alphanumeric string values? In that case you might still want to use the query in Figure 8.33 to find all orders whose Status has a numeric value greater than 10. You can use the XQuery expression castable together with the if-then-else expression to apply the numeric predicate only if the Status attribute of a given document is a valid integer number. For all other documents the value false is produced to exclude them from the result set. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/( if (@Status castable as xs:integer) then (@Status > 10) else false ) return $i;
Figure 8.34
XQuery with the expression castable
The SQL/XML statement in Figure 8.35 intends to read all purchase orders where the first item in the order is less expensive than the second item. Clearly, the purchase order in Figure 8.28 should be in the result set because the price of its first item is 9.99 while the price of the second item is 49.99. But, opposite to what you might expect, the predicate in Figure 8.35 does not select the purchase order in Figure 8.28. Let’s examine why that is. First of all, note that the predicate [item[1]/price < item[2]/price] does not include any literal value that could provide an indication of the data type of the comparison. Hence, according to the XQuery standard, DB2
212
Chapter 8
Querying XML Data with XQuery
simply performs a string comparison, and the string “9.99” is greater than the string “49.99”. In summary, the query in Figure 8.35 runs, but does not work the way you want. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[ item[1]/price < item[2]/price]');
Figure 8.35
String comparison between two elements in a document
The solution is to cast either the left side of the predicate, or the right side, or both to xs:double, as shown in Figure 8.36. If at least one of the two operands is cast to a specific data type, then this determines the data type of the comparison operation and DB2 tries to cast the other operand to the same data type. Consequently, the query in Figure 8.36 performs a numeric comparison of the two price elements and therefore includes the purchase order in Figure 8.28 in the result set, as expected. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[ item[1]/xs:double(price) < item[2]/price]');
Figure 8.36
Numeric comparison between two elements in a document
Note that the casting functions, which are actually called type constructors, can only cast at most one item at a time. The following expression would fail because one purchase order contains multiple item elements, and a sequence of two or more items cannot be cast to a double value. xs:double($i/PurchaseOrder/item/price)
To cast all items in the sequence, use the type constructor at the end of the XPath expressions, such as the following: $i/PurchaseOrder/item/xs:double(price) $i/PurchaseOrder/item/price/xs:double(.)
8.6
ARITHMETIC EXPRESSIONS
XQuery provides arithmetic operators for addition (+), subtraction (–), multiplication (*), division (div), integer division (idiv), and modulus (mod). A subtraction operator must be preceded by whitespace if it could otherwise be interpreted as part of a variable or tag name. For example, price-discount will be interpreted as a single name, but price -discount and price - discount will be interpreted as arithmetic expressions between two separate items. Arithmetic operators can be used with elements, attributes, or a mix of both.
8.6
Arithmetic Expressions
213
Figure 8.37 provides two examples, one in SQL/XML and one in XQuery notation. Both multiply the quantity and the price of each item in the purchase order that has PoNum=5000. Note that the for clause of the XQuery iterates over item elements and computes the value of each item separately. SELECT T.id, T.itemvalue FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS id VARCHAR(15) PATH 'partid', value DECIMAL(9,2) PATH 'quantity * price') as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum= 5000]'); ID ITEMVALUE --------------- ----------100-100-01 29.97 100-103-01 249.95 2 record(s) selected.
xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder[@PoNum= 5000]/item let $q := $i/quantity let $p := $i/price return {$q * $p}; 29.97 249.95 2 record(s) selected.
Figure 8.37
SQL/XML and XQuery with arithmetic expression
The first step in evaluating an arithmetic expression is to evaluate its operands. If one of the operands is an empty sequence, the result of the arithmetic expression is also an empty sequence. If one of the operands is a sequence of more than one item, a type error is raised. This happens in Figure 8.38. This query iterates over purchase orders, not over items. Since a purchase order typically has multiple items, the let clauses bind a sequence of multiple quantity elements to $q and a sequence of multiple price elements to $p. This leads to an error in the multiplication in the return clause. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder[@PoNum= 5000] let $q := $i/item/quantity let $p := $i/item/price return {$q * $p}; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "item()" is expected.
Figure 8.38
Operands in an arithmetic expression must be zero or one item
214
Chapter 8
Querying XML Data with XQuery
An error is also raised if one of the operands cannot be cast to xs:double. For example, if a quantity element contains the string value “five” then the arithmetic expression fails at runtime. XQuery provides a division operator (div) and an integer division operator (idiv). The latter simply casts its result to type xs:integer. For example, the expression 5 div 2 returns the value 2.5, whereas the expression 5 idiv 2 produces the value 2. The idiv operator always rounds down to next integer value, which is forced by the cast to xs:integer. For testing purposes you can run XQuery expressions with cast and arithmetic operations in the DB2 Command Line Processor, such as in Figure 8.39. xquery xs:integer(3.9); 3 1 record(s) selected.
xquery
10 + 100 idiv 9;
21 1 record(s) selected.
Figure 8.39
8.7
Testing XQuery expressions in the CLP
XQUERY FUNCTIONS
The XQuery language provides a large number of built-in functions. These include aggregate functions such as count and sum, string functions such as contains and starts-with, functions to manipulate date and timestamp values, numeric functions, and others. A complete discussion of all functions is beyond the scope of this book. Appendix C, Further References, contains pointers to the complete reference of all supported XPath and XQuery functions in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In this section we list only a subset of the available XQuery functions to highlight those that are most frequently used and have been found useful in DB2 pureXML production applications. We provide some examples and encourage you to try more functions and queries hands-on with the DB2 sample database. In general, all functions can be applied to elements as well as to attributes. We categorize the discussion of XQuery functions as follows: • String functions (section 8.7.1) • Number and aggregation functions (section 8.7.2) • Sequence functions (section 8.7.3)
8.7
XQuery Functions
215
• Node and namespace functions (section 8.7.4) • Date and time functions (section 8.7.5) • Boolean functions (section 8.7.6) All XQuery functions belong to a default namespace that is always implicitly bound to the namespace prefix fn. Since it is a default namespace, the prefix can be omitted. For example, concat and fn:concat refer to the same concatenation function.
8.7.1
String Functions
Some of the most commonly used string functions are listed in Table 8.1. Table 8.1
Commonly Used String Functions
String Functions
Description
concat
The function fn:concat returns a string that is the concatenation of two or more atomic values.
string-join
The function fn:string-join takes as input a sequence of string values and a separator character. It returns a single string in which the input strings are concatenated but separated by the separator character.
contains
The function fn:contains returns true if a string contains a given substring.
matches
The function fn:matches returns true if a string matches a given regular expression.
starts-with
The function fn:starts-with returns true if a string begins with a given substring.
ends-with
The function fn:ends-with returns true if a string ends with a given substring.
lower-case
The function fn:lower-case converts a string to lowercase.
upper-case
The function fn:upper-case converts a string to uppercase.
translate
The fn:translate function replaces selected characters in a string with replacement characters.
string
The function fn:string returns the string representation of a value.
string-length
The function fn:string-length returns the length of a string.
substring
The function fn:substring returns a substring of a string, based on a start position and a length. It is similar to the substr function in SQL.
substring-after
The function fn:substring-after returns the tail of the input string after the first occurrence of a given search string.
(continues)
216
Chapter 8
Table 8.1
Querying XML Data with XQuery
Commonly Used String Functions (Continued)
String Functions
Description
substring-before
The function fn:substring-before returns the beginning of the input string up to (but excluding) the first occurrence of a given search string.
tokenize
The function fn:tokenize breaks a string into a sequence of substrings.
normalize-space
The function fn:normalize-space strips leading and trailing whitespace characters from a string and replaces each internal sequence of whitespace characters with a single space character.
A simple example of the concat function is shown in Figure 8.40. Here, the concat function has four arguments. The first and third arguments are literal string values, while the second and fourth parameters are expressions based on the variable $i that is bound in the for clause. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item where $i/../@PoNum=5000 return concat("Order ",$i/../@PoNum," – Item ",$i/partid); Order 5000 - Item 100-100-01 Order 5000 - Item 100-103-01 2 record(s) selected.
Figure 8.40
Concatenation of string literals and expressions
Figure 8.41 demonstrates three string functions. The query uses the concat function to concatenate the values of the attributes PoNum and Status into a single string. In the second column it utilizes the string-join function to produce a list of partid values that are separated by the semicolon. Note that the arguments of the concat functions are single values while the first argument of the string-join function evaluates to a sequence of multiple elements. The contains function in the WHERE clause restricts the result set to purchase orders that have at least one item whose name contains the word “Super”. SELECT XMLQUERY('$PORDER/PurchaseOrder/concat(@PoNum,@Status)') AS id, XMLQUERY('string-join($PORDER/PurchaseOrder/item/partid,";")') AS items FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder/item[contains(name,"Super")]');
Figure 8.41
Query with three XQuery string functions
8.7
XQuery Functions
IDSTATUS -----------------5000Unshipped 5001Shipped 5004Shipped
217
ITEMS -----------------------------------------100-100-01;100-103-01 100-101-01;100-103-01;100-201-01 100-100-01;100-103-01
3 record(s) selected.
Figure 8.41
Query with three XQuery string functions (Continued)
XQuery functions can be nested. The query in Figure 8.42 returns the name of an item from purchase order 5000, if the item name contains a comma and contains the word Basic after the comma. The function substring-after is the first argument of the contains function and produces the part of the name after the comma. Thus, the contains function is applied only to that second part of each item name. SELECT XMLQUERY('$PORDER/PurchaseOrder/item[ contains(substring-after(name,","), "Basic")]/name') FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');
Snow Shovel, Basic 22 inch 1 record(s) selected.
Figure 8.42
Query with nested XQuery string functions
You can use the function tokenize to split a string into multiple smaller strings. For example, the query in Figure 8.43 splits the values of the partid elements based on the occurrences of the “-” character. The function returns the substrings as a sequence. Instead of using a single character to split the input string, you can also tokenize a string based on the occurrences of a substring or regular expression. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ @PoNum=5000]/item/tokenize(partid,"-"); 100 100 01 100 103 01 6 record(s) selected.
Figure 8.43
Splitting a string into a sequence of separate items
218
Chapter 8
Querying XML Data with XQuery
Although the query in Figure 8.43 returns the tokenized substrings in separate rows, it can be more useful to return them in separate columns instead, which happens in Figure 8.44. The query in Figure 8.44 uses the XMLTABLE function to generate one row per order item. Each generated row has an INTEGER column called OrderNo and an XML column called partid. The INTEGER column contains the purchase order number (PoNum), and the XML column contains the sequence of substrings produced by the tokenize function. In the SELECT clause, this XML column is not returned as-is, but used as input to each of three XMLQUERY functions. They use positional predicates [1], [2], and [3], respectively, to obtain the first, second, and third token of the sequence separately. SELECT T.orderno, XMLCAST(XMLQUERY('$PARTID[1]') as CHAR(3)) as id1, XMLCAST(XMLQUERY('$PARTID[2]') as CHAR(3)) as id2, XMLCAST(XMLQUERY('$PARTID[3]') as CHAR(3)) as id3 FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS OrderNo INTEGER PATH '../@PoNum', partid XML PATH 'tokenize(partid,"-")') as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');
ORDERNO ------------5000 5000
ID1 --100 100
ID2 --100 103
ID3 --01 01
2 record(s) selected.
Figure 8.44
Splitting a string into separate columns
We encourage you to try other string functions on your own. For example, use the translate function to change the delimiter in the partid values from 100-103-01 to 100/103/01. Or, use the starts-with function to find all items whose name begins with the word “Snow”.
8.7.2
Number and Aggregation Functions
Let’s turn to numeric XQuery functions, some of which are shown in Table 8.2. Table 8.2
Commonly Used Number and Aggregation Functions
Numeric and Aggregation Functions
Description
sum
The function fn:sum returns the sum of the values in a sequence.
avg
The function fn:avg returns the average of the values in a sequence.
8.7
XQuery Functions
Table 8.2
219
Commonly Used Number and Aggregation Functions (Continued)
Numeric and Aggregation Functions
Description
max
The function fn:max returns the maximum of the values in a sequence.
min
The function fn:min returns the minimum of the values in a sequence.
abs
The function fn:abs returns the absolute value of a numeric value.
round
The function fn:round returns the integer that is closest to the given numeric value.
Figure 8.45 shows two XQuery expressions with number and string functions. The first one returns the sum of item prices for each purchase order where the value of the Status attribute starts with “Ship”. For example, this includes orders where the status is Shipped or Shipping. A separate sum is computed for the items within each such purchase order. The second query computes the average item price across all orders that match the starts-with predicate. A single average value is computed for these orders, because the XPath expression that produces the sequence of purchase orders is the argument of the avg function. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ starts-with(@Status,"Ship")]/sum(item/price); 73.97 33.97 59.98 33.97 4 record(s) selected. xquery avg( db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ starts-with(@Status,"Ship")]/item/price ); 18.3536363636364 1 record(s) selected.
Figure 8.45
The XQuery aggregation functions sum and avg
The same two queries can be coded in SQL/XML notation, as shown in Figure 8.46. They produce the same results as their counterparts in Figure 8.45. Note that the second SELECT statement in Figure 8.46 uses the SQL function AVG, not the XQuery function avg.
220
Chapter 8
Querying XML Data with XQuery
SELECT XMLQUERY('$PORDER/PurchaseOrder/sum(item/price)') FROM purchaseorder WHERE XMLEXISTS ('$PORDER/PurchaseOrder[starts-with(@Status,"Ship")]');
SELECT AVG(T.itemprice) FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS itemprice DECIMAL(9,2) PATH 'price') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[starts-with(@Status,"Ship")]');
Figure 8.46
The XQuery functions sum and the SQL functions avg
In Figure 8.45 and Figure 8.46 you can replace the functions sum and avg with the function count to obtain the number of elements rather than the sum or average of their values. Try it out.
8.7.3
Sequence Functions
The count function is not a numeric function but a sequence function (see Table 8.3) because it counts the number of items in a sequence. Table 8.3
Commonly Used Sequence Functions
Date and Time Functions
Description
count
The function fn:count returns the number of items in a sequence.
data
The function fn:data returns the input sequence but replaces any nodes in the sequence with their values.
distinct-values
The function fn:distinct-values returns the distinct values in a sequence. It is similar to the SQL function distinct.
deep-equal
The function fn:deep-equal compares two documents or sequences and returns true if they meet the requirements for deep equality. Roughly speaking, two documents or sequences are deep equal if every aspect of their structure, values, and data type is equal.
empty
The function fn:empty returns true if the argument is an empty sequence.
exactly-one
The function fn:exactly-one returns its argument if the argument contains exactly one item.
zero-or-one
The function fn:zero-or-one returns its argument if the argument contains one item or is an empty sequence.
one-or-more
The function fn:one-or-more returns its argument if the argument is a sequence of one or more items.
8.7
XQuery Functions
Table 8.3
221
Commonly Used Sequence Functions (Continued)
Date and Time Functions
Description
last
The function fn:last takes no parameters but returns the number of items in the sequence that is currently being processed. It is usually used in a positional predicate to return the last item in a sequence.
position
The function fn:position returns the position of the context item in the sequence that is currently being processed.
Figure 8.47 shows three examples that use sequence functions. The goal is to find all the different values that Status attributes in purchase orders can have. The first XQuery in Figure 8.47 returns the value of the Status attribute from all purchase orders in the purchaseorder table. It uses the function data to obtain the attribute values instead of the attribute nodes. The second XQuery uses the distinct-values function to retrieve unique Status values only. The result shows that the sample data contains two different spellings of the value Unshipped, one with lowercase s and one with uppercase S. To address this, the third XQuery uses the string function upper-case to convert all Status values to uppercase. The SQL/XML statement in Figure 8.48 produces the same result by using the SQL functions DISTINCT and UPPER. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/data(@Status); Unshipped Shipped Shipped UnShipped Shipped Shipped 6 record(s) selected. xquery distinct-values(db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/@Status); Unshipped Shipped UnShipped 3 record(s) selected. xquery distinct-values(db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/upper-case(@Status)); UNSHIPPED SHIPPED 2 record(s) selected.
Figure 8.47
Using the XQuery sequence functions data() and distinct-values()
222
Chapter 8
Querying XML Data with XQuery
SELECT DISTINCT(UPPER(T.stat)) FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS stat VARCHAR(15) PATH '@Status') AS T;
Figure 8.48
Using the SQL function DISTINCT
The SQL/XML statement in Figure 8.49 returns the first and the last item of purchase order 5000 in two separate columns of type XML. The function last(), with no argument, returns the number of items in the sequence and therefore points to the last item. SELECT XMLQUERY('$PORDER/PurchaseOrder/item[1]'), XMLQUERY('$PORDER/PurchaseOrder/item[last()]') FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');
Figure 8.49
8.7.4
Positional predicates to obtain the first and last items
Namespace and Node Functions
Some commonly used namespace and node functions are listed in Table 8.4. The namespace functions are discussed in Chapter 15, Managing XML Data with Namespaces. Table 8.4
Commonly Used Namespace and Node Functions
Name and Node Functions
Description
name
The function fn:name returns the name of a node, typically an element or attribute name. The returned name includes the namespace prefix of the node, if applicable.
local-name
The function fn:local-name returns the name of a node, but does not include a namespace prefix.
namespace-uri
The function fn:namespace-uri returns the namespace URI of the given node.
namespace-urifor-prefix
The function fn:namespace-uri-for-prefix returns the namespace URI that is associated with a namespace prefix for an element.
in-scope-prefixes
The function fn:in-scope-prefixes returns a list of prefixes for all inscope namespaces of an element.
The functions name and local-name are very powerful because they allow access to element and attribute names. In contrast, all previous queries in this chapter used element and attribute
8.7
XQuery Functions
223
names only to get to their values. As an example, the XMLTABLE function in Figure 8.50 iterates over all the child elements of the item elements of purchase order 5000. For each child element it returns the element’s name and value together with the PoNum of the purchase order. Note that the row-generating expression ends with a wildcard that selects all child elements under item. The expressions 'local-name(.)' and '.' in the column definitions use the dot to refer to whatever the current child element is. SELECT T.OrderNo, T.node, T.value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item/*' COLUMNS OrderNo INTEGER PATH '../../@PoNum', node VARCHAR(10) PATH 'local-name(.)', value VARCHAR(40) PATH '.' ) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');
ORDERNO NODE VALUE ---------- ---------- -------------------------------------5000 partid 100-100-01 5000 name Snow Shovel, Basic 22 inch 5000 quantity 3 5000 price 9.99 5000 partid 100-103-01 5000 name Snow Shovel, Super Deluxe 26 inch 5000 quantity 5 5000 price 49.99 8 record(s) selected.
Figure 8.50
Producing a list of element names and values
Similarly you can use the function local-name to produce a list of all tags that occur in a given document. This is shown in Figure 8.51. The row-generating expression of the XMLTABLE function is //(*, @*). To understand what this means, remember that //* selects all elements at all levels of the document, and //@* selects all attributes at all levels of the document. In the expression //(*, @*) the parentheses and the comma construct a sequence that combines all elements and all attributes at all levels. In short, the row-generating expression produces all elements and attributes of the document. The column seq indicates the order in which the nodes appear in the document, and the column node produces their names. The column type determines whether the node is an attribute, an element, or a leaf element. The if-then-else expression uses the node test self::attribute() which evaluates to true if the node is an attribute. The else branch contains another if-then-else expression to check whether the current node has any element children. If yes, it must be an element itself. Otherwise it’s considered a leaf-element.
224
Chapter 8
SELECT T.* FROM purchaseorder, XMLTABLE('$PORDER//(*, @*)' COLUMNS seq FOR ORDINALITY, node VARCHAR(20) PATH type VARCHAR(15) PATH
'local-name(.)', 'if (self::attribute()) then "Attribute" else (if (./*) then "Element" else "Leaf-Element")'
) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');
SEQ NODE ----------- -----------------1 PurchaseOrder 2 PoNum 3 OrderDate 4 Status 5 item 6 partid 7 name 8 quantity 9 price 10 item 11 partid 12 name 13 quantity 14 price
TYPE -----------------Element Attribute Attribute Attribute Element Leaf-Element Leaf-Element Leaf-Element Leaf-Element Element Leaf-Element Leaf-Element Leaf-Element Leaf-Element
14 record(s) selected.
Figure 8.51
8.7.5
Querying XML Data with XQuery
Producing a list of all element and attribute names
Date and Time Functions
Some noteworthy date and time functions are listed in Table 8.5.
8.7
XQuery Functions
Table 8.5
225
Commonly Used Date and Time Functions
Date and Time Functions
Description
adjust-date-totimezone
The function fn:adjust-date-to-timezone adjusts an xs:date value to a specific time zone, or removes the timezone component from the value. Similar functions exist for xs:time and xs:dateTime values.
current-date, current-time, current-dateTime
These functions return the current date, time, or date and time in the UTC timezone (UTC = Coordinated Universal Time, which is Greenwich Mean Time).
current-local-date, current-local-time, current-local-dateTime
These functions return the current date, time, or date and time in the local time zone of the operating system, without time zone indicator. (DB2 for Linux, UNIX, Windows, version 9.5 FP5, and 9.7 FP1.)
dateTime
The function fn:dateTime constructs an xs:dateTime value from an xs:date value and an xs:time value.
day-from-date
The function fn:day-from-date returns the day component of an xs:date value. Similar functions exist to extract the months or year from an xs:date value, or to extract the hours, minutes, seconds, or timezone from xs:time or xs:dataTime values.
An example of an SQL/XML query that manipulates dates is shown in Figure 8.52. The goal of the query is to list the identifier, order date, year, and age of all orders that are older than 90 days. Let’s look at the predicate in the WHERE clause first. The predicate selects all orders whose OrderDate attribute is less than the current date minus 90 days. The string literal P90D denotes a duration of 90 days. The P is the duration indicator, and 90D specifies the length of the duration. Similarly, the string P2DT5H45M could be used to denote a duration of 2 days, 5 hours, and 45 minutes. Any such duration string needs to be cast to the type xdt:dayTimeDuration to be interpreted as a duration and not as xs:string. This casting allows you to subtract the duration from the current date to produce a date in the past (90 days ago). For each matching order, the XMLTABLE function in Figure 8.52 extracts the OrderDate, the year portion of the date, and the age of the order. The age is calculated by subtracting the current date from the order date. Subtraction of one date from another produces a duration. In this example, the returned durations are negative, because current-date() is always larger than any existing OrderDate. The query result shows, for example, that purchase order 5000 has been placed 1069 days prior to January 21, 2009.
226
Chapter 8
Querying XML Data with XQuery
SELECT poid, CURRENT DATE as today, T.odate, T.year, T.age FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS odate DATE PATH '@OrderDate', year CHAR(4) PATH 'year-from-date(@OrderDate)', age CHAR(15) PATH 'xs:date(@OrderDate) - current-date()' ) as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate < current-date() - xdt:dayTimeDuration("P90D")]');
POID ----------5000 5001 5002 5003 5004 5006
TODAY ---------01/21/2009 01/21/2009 01/21/2009 01/21/2009 01/21/2009 01/21/2009
ODATE ---------02/18/2006 02/03/2005 02/29/2004 02/28/2005 11/18/2005 03/01/2006
YEAR ---2006 2005 2004 2005 2005 2006
AGE ----------P1069D -P1449D -P1789D -P1424D -P1161D -P1058D
6 record(s) selected.
Figure 8.52
Using date types and functions
Note that current-date() produces the current date in UTC time. If you are living in California, where the local time is eight hours behind UTC, then from 4 p.m. onwards currentdate() gives you tomorrow’s date. New functions to produce the local date and time are being added (refer to Table 8.5) but you can also use XQuery functions to adjust a date or a time to a given time zone, such as in the following query: xquery adjust-date-to-timezone(current-date(), xdt:dayTimeDuration("-PT8H"));
8.7.6
Boolean Functions
And finally, XQuery Boolean functions are listed in Table 8.6. An example of using the function fn:false is Figure 8.34 in section 8.5 of this chapter. The use of the function fn:not() was discussed in the context of XPath in section 6.9. Please refer to these sections for examples. Table 8.6
Commonly Used Boolean Functions
Boolean Functions
Description
not
The function fn:not returns false if the effective Boolean value of a sequence is true, and true if the effective Boolean value of a sequence is false.
false
The function fn:false returns the value false.
true
The function fn:true returns the value true.
8.8
8.8
Embedding SQL in XQuery
227
EMBEDDING SQL IN XQUERY
In section 6.5, How to Execute XPath in DB2, we explained how the function db2-fn:sqlquery lets you embed SQL in XPath queries. The same works in XQuery FLWOR expressions and it allows you to include relational predicates in your XQuery. You can even pass parameters from the outer XQuery to the embedded SQL statement. Remember that the embedded SQL statement has to return a single column of type XML. For the following examples, note that the table purchaseorder has several relational columns that contain values extracted from the XML document in the same row. CREATE TABLE purchaseorder(poid BIGINT, status VARCHAR(10), custid BIGINT, orderdate DATE, porder XML);
An interesting pair of queries is shown in Figure 8.53. The first query is an SQL/XML statement that uses the XMLQUERY function in the SELECT clause to compute the sum of the item prices of any selected order. The WHERE clause restricts the result set to those orders in the table where the relational column status has the value Unshipped, the column orderdate has the value 2006-02-18, and the order information in the XML column contains at least one item with a price greater than 40. For each of these orders, the query computes the sum of all item prices. The second query is a FLWOR expression that produces the same result from our sample data. Its input is defined by the function db2-fn:sqlquery, which produces the sequence of XML documents that are selected by the embedded SQL statement. This allows you to use relational predicates in an XQuery. The XQuery iterates with the for clause over the PurchaseOrder elements of these input documents. For each such element it evaluates the XML predicate on price and returns the sum of item prices for any matching order. SELECT XMLQUERY('$PORDER/PurchaseOrder/sum(item/price)') FROM purchaseorder WHERE status = 'Unshipped' AND orderdate = '2006-02-18' AND XMLEXISTS('$PORDER/PurchaseOrder/item[price > 40]'); xquery for $i in db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE status = 'Unshipped' AND orderdate = '2006-02-18'" )/PurchaseOrder where $i/item[price >= 40] return sum($i/item/price);
Figure 8.53
Two queries that produce the same result
There is typically no significant performance difference between the two queries in Figure 8.53. Both can use an XML index on /PurchaseOrder/item/price and relational indexes on status and orderdate at the same time.
228
Chapter 8
Querying XML Data with XQuery
Let’s extend the previous example slightly to illustrate parameter passing from XQuery to the enclosed SQL statement. Assume you want to return all orders that have the same shipping status and order date as the purchase order with number 5000. The XQuery in Figure 8.54 does that easily. It uses the for and where clauses to select purchase order 5000 and assign it to the variable $i. The return clause then produces the sequence of all orders where the relational columns status and orderdate have the same value as $i/@Status and $i/@OrderDate respectively. The functions parameter(1) and parameter(2) can only be used in SQL statements inside the db2-fn:sqlquery function. They refer to the XQuery expressions that are provided as additional arguments to the db2-fn:sqlquery function, according to the order in which they appear. That is, $i/@Status is bound to parameter(1) and $i/@OrderDate to parameter(2). Effectively, this is a self-join on the purchaseorder table. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder where $i/@PoNum = 5000 return db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE status = parameter(1) AND orderdate = parameter(2)", $i/@Status, $i/@OrderDate );
Figure 8.54
XQuery that contains an SQL statement with parameters
Figure 8.55 shows how you can code the same self-join in SQL/XML notation without any XQuery concepts beyond XPath. The FROM clause contains two references to the purchaseorder table, p1 and p2. The alias p1 is used in the XMLTABLE function to find purchase order 5000 and to extract Status and OrderDate from it. These generated relational columns are then joined with alias p2 in the WHERE clause to find all orders with the same status and date. The queries in Figure 8.54 and Figure 8.55 look very different from each other, but the DB2 query compiler generates the same execution plan for both. SELECT p2.porder FROM purchaseorder p1, purchaseorder p2, XMLTABLE('$po1/PurchaseOrder[@PoNum = 5000]' passing p1.porder as "po1" COLUMNS status VARCHAR(10) PATH '@Status', orderdate DATE PATH '@OrderDate' ) AS T WHERE p2.status = T.status AND p2.orderdate = T.orderdate;
Figure 8.55
A different notation for the same self-join as in Figure 8.54
8.9
8.9
Using SQL Functions and User-Defined Functions in XQuery
229
USING SQL FUNCTIONS AND USER-DEFINED FUNCTIONS IN XQUERY
There are many built-in SQL functions that are not part of the XQuery language. For example, functions such as sqrt (square root), rand (random number), or cos (cosine) are available as SQL functions in DB2 but they are not available as built-in XQuery functions. Additionally you might have developed your own user-defined functions (UDFs), either in the SQL Procedural Language (SQP PL) or in an external programming language such as Java or C. It is possible to use such functions from the SQL world within XQuery expressions. The trick is to use the db2fn:sqlquery function to embed SQL functions in XQuery. Assume that you have a legacy application that processes partid values, which are product identifiers, in a different format. For example, a partid such as 100-103-01 needs to be converted to 01(100)103. This is achieved by the UDF in Figure 8.56. It breaks a given partid into its three pieces and assembles them in a different way to meet the requirements of the legacy system. CREATE FUNCTION convert(partid VARCHAR(15)) RETURNS VARCHAR(15) BEGIN ATOMIC DECLARE p1, p2, p3, new VARCHAR(10) DEFAULT ''; SET p1 = substr(partid,1,3); SET p2 = substr(partid,5,3); SET p3 = substr(partid,9,2); SET new = p3||'('||p1||')'||p2; RETURN new; END#
Figure 8.56
User-defined function to convert product identifiers
The FLWOR expression in Figure 8.57 uses this UDF in its let clause to convert every partid in purchase order 5000 to the different format. The db2-fn:sqlquery function contains an SQL statement, which in this case is simply a VALUES clause. Since the result of the embedded SQL statement must be of type XML, the XMLTEXT function is used to turn the VARCHAR result value of the function convert into an XML text node. The convert function takes a single parameter, which has to be cast to the input type of the function, that is, VARCHAR(15). The expression $i/partid provides the actual value that is passed into the convert function.
230
Chapter 8
Querying XML Data with XQuery
xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item let $new := db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST (parameter(1)as VARCHAR(15)))))", $i/partid) where $i/../@PoNum = 5000 return {$i/partid/text()}{$new};
100-100-0101(100)100 100-103-0101(100)103 2 record(s) selected.
Figure 8.57
Using an SQL UDF within an XQuery
You can use the db2-fn:sqlquery function anywhere where built-in XQuery functions are allowed. Figure 8.58 gives you a couple of ideas. The first FLWOR expression uses the db2-fn:sqlquery function in the construction of the element new. Note that it has to be in curly brackets so that it gets properly evaluated and not treated as a literal string. The second XQuery uses db2-fn:sqlquery in a path expression. The XPath in the return clause is $i/PurchaseOrder/item/partid except that the db2-fn:sqlquery function is applied to the last step, partid. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item where $i/../@PoNum = 5000 return {$i/partid/text()} { db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST(parameter(1) AS VARCHAR(15)))))", $i/partid) };
xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/@PoNum = 5000 return $i/PurchaseOrder/item/db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST(parameter(1) AS VARCHAR(15)))))", partid);
Figure 8.58
8.10
Further examples of using the db2-fn:sqlquery function
SUMMARY
XQuery is a powerful query language for XML data. XPath is a subset of the XQuery language and used in every XQuery expression that accesses XML documents. Hence, XPath is a critical part of XQuery.
8.10
Summary
231
One of the most commonly used expressions in XQuery is the FLWOR expression, which is named after its keywords for, let, where, order by, and return. The for clause of a FLWOR expression lets you iterate over documents, elements, attributes, atomics values, or any sequence of items in the XQuery data model. In each iteration, a variable is assigned to the next item in the sequence for further manipulation. The let clause allows you to assign an entire sequence, such as an intermediate result, to a single variable. The where and order by clauses are used to filter and sort the result of the FLWOR expression. The result is then returned by the return clause, possibly with further manipulation. FLWOR expressions can express queries over sets of documents, perform joins across documents, and combine data from multiple XML documents or different parts of a single document into a query result. Other important expressions in XQuery include constructor expressions, such as direct element and attribute constructors, which are used to create XML nodes and construct new XML documents within a query. Conditional expressions (if-then-else) allow for advanced logic. Additionally, XQuery supports cast expressions, arithmetic expressions, logical and comparison operators, and sequence and transform expressions. XQuery also offers a rich set of built-in functions, such as string functions, numeric functions, aggregation functions, and date and time functions. Not every XML application requires XQuery. Many applications are well-served with the combined power of XPath and SQL. In fact, many queries in XQuery notation can also be expressed in SQL/XML with embedded XPath.
This page intentionally left blank
C
H A P T E R
9
Querying XML Data: Advanced Queries & Troubleshooting
n this chapter we discuss advanced XML query topics, common errors, and guidelines for avoiding performance pitfalls. The examples include both XQuery and SQL/XML queries. This chapter is organized along the following topics:
I
• Aggregation and grouping in XML queries (section 9.1) • Joins between XML columns as well as joins between XML and relational data (section 9.2) • XML queries with case-insensitive string predicates (section 9.3) • Guidelines for avoiding common performance problem (section 9.4) • Common errors in XML queries and how to resolve them (section 9.5)
9.1
AGGREGATION AND GROUPING OF XML DATA
The recommended and most efficient way to perform grouping and aggregation of XML data is to use the XMLTABLE function to extract XML values to relational columns, and then to apply the SQL GROUP BY clause and SQL aggregation functions to these columns. The XQuery 1.0 language by itself, specifically the FLWOR expression, does not have a GROUP BY clause. This shortcoming makes grouping more difficult in XQuery than SQL, although not entirely impossible. In the following we discuss grouping and aggregation queries that use the purchase order sample data as input. A sample document is shown in Figure 9.1.
233
234
Chapter 9
Querying XML Data: Advanced Queries & Troubleshooting
100-100-01 Snow Shovel, Basic 22 inch 3 9.99 100-103-01 Snow Shovel, Super Deluxe 26 inch 5 49.99
Figure 9.1
9.1.1
Sample document in the purchaseorder table
Aggregation and Grouping Queries with XMLTABLE
As an example, let’s determine the number of purchase orders per year since 2004. This is done in Figure 9.2. The XMLTABLE function together with the year-from-date function produces a relational column year of type CHAR(4). This year column is then used in both the SELECT clause and in the GROUP BY clause, as you normally would with relational columns. The relational COUNT() function produces the desired aggregation. The XMLEXISTS predicate in the WHERE clause ensures that the query only looks at orders that were placed in 2004 or later. SELECT year, COUNT(*) AS num_orders FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS year CHAR(4) PATH 'year-from-date(@OrderDate)') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate >= xs:date("2004-01-01")]') GROUP BY year;
YEAR NUM_ORDERS ---- ----------2004 1 2005 3 2006 2 3 record(s) selected.
Figure 9.2
Using SQL group by and aggregation on extracted XML values
9.1
Aggregation and Grouping of XML Data
235
This pattern of writing XML queries has been found very useful. The XMLTABLE function raises selected values from the XML level to the SQL level, and then you can apply SQL functions and groupings to these values as you normally do in purely relational queries. Let’s apply this pattern to another business question. What is the total value of shipped and unshipped items that were ordered in 2006? The answer is computed by the query in Figure 9.3. To write this query, you might want to start with the WHERE clause to restrict the orders to 2006. The path expression in the XMLEXISTS predicate navigates to the OrderDate attribute and checks whether it is greater than or equal to the first day of 2006, and less than or equal to the last day of 2006. Note that both dots in the predicate refer to the OrderDate attribute, which is the current node in the navigation. In the XMLEXISTS predicate, don’t use the year-fromdate function to restrict the orders to 2006 because that function would prevent the use of an XML index that might exist on the OrderDate attribute.
NOTE
While the WHERE clause takes care of the filtering, the XMLTABLE function extracts the data items needed to aggregate the value of shipped and unshipped items. For each item in an order it produces one row with the item price, quantity, and shipping status. This allows you to use SQL concepts to group by the status and to sum the item values. The value of an item in an order is the item price multiplied by its quantity. SELECT orderstatus, SUM(itemprice * itemqty) AS value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS orderstatus VARCHAR(10) PATH 'upper-case(../@Status)', itemprice DECIMAL(9,2) PATH 'price', itemqty INTEGER PATH 'quantity') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder/@OrderDate[ . >= xs:date("2006-01-01") and . = xs:date("2005-01-01") and . xs:date("2006-01-01") and item/price >= 20 and item/price < 30 ]');
100-100-01Snow Shovel, Basic 22 inch39.99100-103-01Snow Shovel, Super Deluxe 2 6 inch549.99 1 record(s) selected.
Figure 9.30
Wrong way to write a between predicate
Both SQL/XML statements in Figure 9.31 write the “between” condition correctly and ensure that both range predicates are applied to the same item price. In the expression item/price[. >= 20 and . < 30], both dots refer to the same price element. Hence, this query selects orders that have at least one item with at least one price element whose value is indeed between 20 and 30. (No such order exists in the sample database.) Based on this notation, DB2 knows that both range predicates are always applied to the same XML node. This allows DB2 to evaluate both predicates with a single start-stop scan (start at 20, stop at 30) over an XML index defined on the price element. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01") and item/price[. >= 20 and . < 30]]'); SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01")]/item/price[. >= 20 and . < 30]'); 0 record(s) selected.
Figure 9.31
Correct way to write a between predicate
If each item element has at most one price element, then the expression item[price >= 20 and price < 30] also selects the correct query result. However, DB2 does not know that each item has at most one price and therefore cannot apply a single start-stop index scan. Instead, DB2 has to use two separate index scans plus an index ANDing operator to combine the result (see Table 9.1). This is less efficient. Therefore it is always recommended to write “between”
256
Chapter 9
Querying XML Data: Advanced Queries & Troubleshooting
predicates with the “dot” (current context), as shown in Figure 9.31. Further details on XML index usage and execution plans are provided in Chapters 13 and 14. Table 9.1
Optimal (left) and Suboptimal Execution Plan (right)
price[. >= 20 and . < 30]
[price >= 20 and price < 30]
RETURN | NLJOIN | /-+-\ / \ FETCH XSCAN | /---+---\ / \ RIDSCN TABLE: | purchaseorder SORT | XISCAN 20 = 20 price < 30
Index 20 30
9.4.3
Index
20 30
Large Global Sequences
Figure 9.32 provides another example of how you should not write queries. The idea of this query comes from a real XML application, but is changed here to fit the purchase order data. The query starts with a let clause and assigns the sequence of all purchase order items in the table to the variable $allitems. This is the first of multiple problems in this query. Unless the table is tiny, the sequence in $allitems is typically very large. Using let to combine items from all (or many) documents in the entire table often results in suboptimal performance. The next step of the query, for $pid…, iterates over the distinct partid values of all the item elements in the sequence $allitems. For each distinct partid it returns a constructed XML element prod_info that contains the partid (produced by $pid) as well as the name and the price of the item. Note how the name and the price are obtained for each distinct partid; that is, for each value of $pid. The variable $pid is used to probe back into the sequence $allitems to find all items with a matching partid. This probe happens in the predicate $allitems[partid = $pid]. The same is done for price. This coding is not straightforward, needlessly complex, and bad for performance. In particular, the big sequence $allitems is a large temporary object and not indexed. Hence, the predicates
9.4
How to Avoid “Bad” Queries
257
in the return clause ([partid = $pid]) both require a sequential scan over all items in all purchase orders, for each $pid. An analogy in the relational world would be a query that copies all rows from a table to a temporary table, then performs a “select distinct” on that table to obtain a set of keys, and then a table scan on the temp table for each of these keys. xquery let $allitems := ( for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") return $i/PurchaseOrder/item ) for $pid in distinct-values($allitems/partid) order by $pid return {distinct-values($allitems[partid = $pid]/name)} {distinct-values($allitems[partid = $pid]/price)} ;
Figure 9.32
Expensive usage of large sequences
The result of the query in Figure 9.32 is simply the partid, name, and price for all distinct items that occur in the purchase orders. The same result can be computed in a much easier way, as shown in Figure 9.33. This query simply generates one tuple for each item element and uses the SQL function DISTINCT to remove duplicates. In the original case, the performance improved by two orders of magnitude. The rewritten query is also easier to understand. SELECT distinct T.pid, T.name, T.price FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS pid VARCHAR(10) PATH 'partid', name VARCHAR(50) PATH 'name', price INTEGER PATH 'price') as T;
Figure 9.33
9.4.4
Rewritten query avoids large intermediate sequences
Multilevel Nesting SQL and XQuery
A general guideline is to introduce only as much complexity in your queries as you really need. For example, it is certainly possible to have an XQuery with an embedded SQL statement that has an embedded XQuery, and so on. But, experience shows that nesting the two languages more than one level deep is usually not needed to express the desired query logic. Therefore, we recommend using only one level of embedding XQuery into SQL or vice versa. As a result, queries are easier to understand and to maintain, and often also easier to optimize and execute for DB2. Figure 9.34 shows an example of an XQuery with an embedded SQL statement, which in turn has embedded XQuery expressions in the XMLQUERY function and XMLEXSISTS predicate. The
258
Chapter 9
Querying XML Data: Advanced Queries & Troubleshooting
embedded SQL statement produces the purchase order elements from all orders that belong to customer 1001 and whose PoNum attribute has the value 1002. For those orders, the XQuery checks whether the Status is Shipped and returns all order items in a newly constructed element POitems. Using XQuery within the SQL statement and around the SQL statement is needlessly complex. xquery for $i in db2-fn:sqlquery(" SELECT XMLQUERY('$PORDER/PurchaseOrder') FROM purchaseorder WHERE custid =1001 AND XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5002]') ") where $i[@Status="Shipped"] return {$i/item};
Figure 9.34
Unnecessary double-nesting of XQuery and SQL
To simplify the query in Figure 9.34, you can choose to either have all XML manipulation outside of the SQL query or all XML manipulation embedded within the SQL query. Both options are demonstrated in Figure 9.35. In the first query in Figure 9.35, all XML operations are pulled out of the SQL statement and into the surrounding XQuery. In the second query, all XML operations are pushed from the surrounding XQuery into the SQL statement. xquery for $i in db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE custid =1001") where $i/PurchaseOrder[@PoNum = 5002 and @Status="Shipped"] return {$i/ PurchaseOrder/item};
SELECT XMLQUERY('{$PORDER/PurchaseOrder/item}') FROM purchaseorder WHERE custid =1001 AND XMLEXISTS('$PORDER/PurchaseOrder[@PoNum = 5002 and @Status="Shipped"]');
Figure 9.35
9.5
Two simpler versions of the query in Figure 9.34
COMMON ERRORS AND HOW TO AVOID THEM
This section lists some common error messages that you might encounter when you run XML queries. We discuss probable causes and ways to resolve the problems. DB2 has more than 250 XML-related error messages and we cannot discuss all of them here. Additionally, a specific error message might have multiple different causes and we cannot describe all of them in this section. Therefore we look at a few select queries, their errors, and how to fix them.
9.5
Common Errors and How to Avoid Them
259
Error messages related to XML processing have numbers in the 16000-range of messages and SQL Codes. That is, the SQL Codes related to XML processing errors are -16000, -16001, -16002, and so on. This is the same in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Additionally, in DB2 for Linux, UNIX, and Windows the error messages for these SQL Codes are numbered SQL16000N, SQL16001N, SQL16002N, and so on. Each error message raised by a faulty XML query also contains an error code, such as err:XPDY0002, which is the error code defined by the W3C. These error codes are listed at http://www.w3.org/2005/xqt-errors/, and you can also search for them in the DB2 information center.
9.5.1 SQL16001N Figure 9.36 and Figure 9.37 show queries that fail at compile time with error SQL16001N, which indicates that an XPath or XQuery expression does not have a context; that is, the path does not have a proper starting point. In Figure 9.36, INFO is not a valid context, because the XML column name is only recognized if coded as a variable that starts with a $ sign ($INFO). SELECT info FROM customer WHERE XMLEXISTS('INFO/customerinfo[name="Matt Foreman"]'); SQL16001N An XQuery expression starting with token "INFO" cannot be processed because the focus component of the dynamic context has not been assigned. Error QName=err:XPDY0002. SQLSTATE=10501
Figure 9.36
Use $INFO instead of INFO to avoid this error
In Figure 9.37, the path in the return clause starts with /addr, but no context is provided to indicate from where this expression should navigate to the addr element. The correct coding in this query is $c/addr instead of /addr. xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return /addr[@country]; SQL16001N An XQuery expression starting with token "/" cannot be processed because the focus component of the dynamic context has not been assigned. Error QName=err:XPDY0002. SQLSTATE=10501
Figure 9.37
The path in the return clause should start with $c
9.5.2 SQL16002N The error SQL16002N happens at compile time whenever the query parser encounters a keyword or symbol that is unexpected or not recognized. This can happen in many different cases. The query in Figure 9.38 fails because the uppercase keyword FOR is not valid. It has to be lowercase.
260
Chapter 9
Querying XML Data: Advanced Queries & Troubleshooting
xquery FOR $d IN db2-fn:xmlcolumn ("customer.info")/customerinfo RETURN $d; SQL16002N An XQuery expression has an unexpected token "d" following "FOR $". Expected tokens may include: "". Error QName=err:XPST0003. SQLSTATE=10505
Figure 9.38
The keywords for, in, and return must be lowercase
In Figure 9.39, the expression $INFO/customerinfo/ must not end with a slash (/). The slash starts another step in the XPath expression and must be followed be an element name, attribute name, wildcard (*), function name, and so on. Hence the empty string "" after the / is not expected. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/' COLUMNS name VARCHAR(20) PATH 'name', city VARCHAR(20) PATH 'addr/city' ) as T; SQL16002N An XQuery expression has an unexpected token "" following "$INFO/customerinfo". Expected tokens may include: "".
Figure 9.39
To avoid this error remove the / after customerinfo
Furthermore, a slash cannot be followed by the square bracket that begins a predicate. Therefore the square bracket in Figure 9.40 causes error SQL16002N. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/[addr/city = "Aurora"]'); SQL16002N An XQuery expression has an unexpected token "[" following "tomerinfo/". Expected tokens may include: "".
Figure 9.40
A predicate must not be preceded by a slash (/)
9.5.3 SQL16003N Error SQL16003N happens during query execution; that is, at runtime and not at compile time. It indicates that DB2 has encountered a value of a certain data type that is not valid in this situation. The query in Figure 9.41 fails because a sequence of multiple phone elements cannot be cast to a single SQL value. In this error message, the notation ( item(), item()+ ) is a regular expression that represents a sequence of one item followed by one or more items. In total that’s two or more items, but only a single item is allowed here.
9.5
Common Errors and How to Avoid Them
261
SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(15) PATH 'phone') AS T; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "VARCHAR_15" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507
Figure 9.41
Cannot cast multiple phone numbers to a single VARCHAR value
Figure 9.42 shows a query that fails because it tries to compare a value of type xs:date with the value "2006-02-18Z” of type xs:string, which is not allowed. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/xs:date(@OrderDate) = "2006-02-18Z" return $i; SQL16003N An expression of data type "xs:string" cannot be used when the data type "xs:date" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507
Figure 9.42
The string literal “2006-02-18Z” must be cast to xs:date
9.5.4 SQL16005N The query in Figure 9.43 references a variable $c that has not been properly introduced. Normally, variables are introduced by assignment in a for or a let clause. Here, the for clause defines the variable $b, which should be used instead of $c in the return clause. xquery for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $c/name; SQL16005N An XQuery expression references an element name, attribute name, type name, function name, namespace prefix, or variable name "c" that is not defined within the static context. Error QName=err:XPST0008. SQLSTATE=10506
Figure 9.43
The variable $c has not been introduced
Figure 9.44 demonstrates a trickier case. The query tries to return a sequence of name and addr elements, but it lacks parentheses. The expression return ($b/name, $b/addr) is correct and avoids the error. The error message claims that the variable $b is not known. Clearly, $b has been defined in the for clause, so the error is seemingly misleading or even wrong.
262
Chapter 9
Querying XML Data: Advanced Queries & Troubleshooting
xquery for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $b/name, $b/addr; SQL16005N An XQuery expression references an element name, attribute name,type name, function name, namespace prefix, or variable name "b" that is notdefined within the static context. Error QName=err:XPST0008. SQLSTATE=10506
Figure 9.44
Missing parentheses in the return clause
But, the error message in Figure 9.44 is correct. The comma in the return clause is the XQuery comma operator, which constructs sequences. It has the lowest precedence of all operators. Hence, the XQuery expression in Figure 9.44 defines a sequence of two expressions, which are • for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $b/name
• $b/addr In the first expression, $b is properly introduced in the for clause. In the second expression, $b is not defined, which causes the error message. If you change the return clause to return ($b/name, $b/addr), the parentheses ensure that the comma operator only applies to $b/name and $b/addr, and both of these expressions refer to $b defined in the for clause. The use of the parentheses here is similar to parentheses in arithmetics, such as 3 * (2 + 3) to evaluate the + operator before the multiplication operator.
9.5.5 SQL16015N When you construct elements with a direct element constructor, and you include a sequence of expressions that provide the child nodes, attributes (if any) must come before elements in this sequence. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return {$i/name}{$i/@Cid}; SQL16015N An element constructor contains an attribute node named "Cid" that follows an XQuery node that is not an attribute node. QName=err:XQTY0024. SQLSTATE=10507
Figure 9.45
Within a constructed element, attributes must be first
9.5
Common Errors and How to Avoid Them
263
The error in Figure 9.45 is avoided if you construct the info element as return {$i/@Cid}{$i/name};
or as return >-+------------+--------------------------------------------------> >--copy----$VariableName--:=--CopySourceExpression-+---------------> >--modify--ModifyExpression----------------------------------------> >--return--ReturnExpression----------------------------------------|
Figure 12.5
High-level syntax of the transform expression
Such XML modifications can be performed in an SQL UPDATE statement, in a query, or as part of an INSERT statement (Figure 12.6). If you modify a document in a query, the query reads the document from an XML column, changes it on-the-fly, and returns the modified document to the application. This leaves the original version of the document in the DB2 table unchanged. If you modify a document in an UPDATE statement, you make a permanent change to the data that is stored in DB2. Such an UPDATE is logged in the DB2 transaction log and subject to all the transaction management concepts that also apply to relational updates, such as commit, rollback, and recovery, when applicable. Concurrency control (locking) and logging happens at the full document level. You can also modify a new document at insert time if you include an XQuery transform expression in an SQL INSERT statement.
326
Chapter 12
Modify a document as part of a query. The original document in the database is not changed.
Make a permanent change to a document in the database. This UPDATE is logged.
XML Document
Updating a stored document
Figure 12.6
Updating and Transforming XML Documents
Modify a new document during INSERT. The modified document is inserted and logged.
XML Document
XML Document
XML Document
XML Document
Updating a returned document upon retrieval.
Updating a new document upon insert.
Three ways of modifying XML documents
The concepts of changing XML element or attribute values, inserting new elements, renaming elements, and so on are independent from whether you do this in an UPDATE statement, in a query, or in an INSERT statement. The following sections describe the capabilities of the XQuery transform expressions and their usage in SQL UPDATE statements. Sections 12.10 and 12.11 then show how the same document modifications can be performed in queries and INSERT statements.
12.3
UPDATING THE VALUE OF AN XML NODE IN A DOCUMENT
A simple and common kind of XML update is to change the value of a specific element or attribute node in an XML document.
12.3.1
Replacing an Element Value
As an example, assume you have to update the address of a customer to change the value of the street element to “43 WestCreek”. Figure 12.7 shows the original document on the left and the desired updated document on the right. Original document
Updated document
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Jim Noodle 43 WestCreek Markham Ontario N9C 3T6 905-555-7258
Figure 12.7
Changing the value of an element
12.3
Updating the Value of an XML Node in a Document
327
The UPDATE statement that performs the desired modification of the document is shown in Figure 12.8. It assumes that the document to be updated resides in the info column of the customer table in a row with the relational cid value 1002. The SET clause of the UPDATE statement assigns a new value to the XML column info. This new value is produced by the XMLQUERY function, which contains an XQuery transform expression. The copy clause refers to the original XML column value ($INFO), and assigns the original document to the variable $mycust. Subsequently, the modify clause manipulates this variable. The modify clause contains the update operation replace value of to replace the value of the element street with the new string literal “43 WestCreek”. Finally, the variable $mycust, which contains the modified document, is returned in the return clause of the transform expression. UPDATE customer SET info = XMLQUERY(' transform copy $mycust := $INFO modify do replace value of $mycust/customerinfo/addr/street with "43 WestCreek" return $mycust ') WHERE cid = 1002
Figure 12.8
Update statement to replace the value of an element
In Figure 12.8 and many other typical update cases, the right side of the copy clause is just the variable that refers to the original document, in this case $INFO. The right side of the copy clause could be a more complex expression, but it must always evaluate to a single node. It cannot be an empty sequence or a sequence of more than one item. This single node can have descendants, which means it can be (and often is) the root of a full XML document. In many update examples you will also see that the return clause simply returns the variable that holds the modified document. However, the return clause could contain a more complex expression, including element construction or a FLWOR expression. Updates with more complex expressions in the copy and the return clauses are discussed in section 12.10. Since the transform keyword is optional, it is omitted from here on.
12.3.2
Replacing an Attribute Value
Replacing an attribute value is just as easy as replacing an element value. The UPDATE statement in Figure 12.9 changes the Cid attribute to the new value 1099. The entire UPDATE statement is the same as in Figure 12.8 except that the path to the target node and the new value are different. The literal value 1099 could be in double quotes but does not have to be because it can be interpreted as a number.
328
Chapter 12
Updating and Transforming XML Documents
UPDATE customer SET info = XMLQUERY(' copy $mycust := $INFO modify do replace value of $mycust/customerinfo/@Cid with 1099 return $mycust ') WHERE cid = 1002
Figure 12.9
12.3.3
Replacing the value of an attribute
Replacing a Value Using a Parameter Marker
Often you will want to prepare and compile an UPDATE statement only once, and then pass in a new value every time you execute it. This avoids recompiling the statement in the database server for each execution. The mechanism to use parameters is the same as for SQL/XML queries. The PASSING clause of the XMLQUERY function allows you to pass a SQL-style parameter marker (“?”) as a variable ($z) into the XQuery expression (Figure 12.10). Note that XQuery variables are case-sensitive. For example, $z and $Z are not the same. The query in Figure 12.10 also uses a parameter marker in the WHERE clause to select the row to be updated. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/phone with $z return $newinfo' PASSING CAST(? AS VARCHAR(15)) AS "z") WHERE cid = ?
Figure 12.10
Updating XML values with parameter markers
You can run the UPDATE statement in Figure 12.10 from an application, such as a Java program. You would use JDBC statements to prepare and compile the statement, bind a value from an application variable to the parameter marker, and then execute the statement.
12.3.4
Replacing Multiple Values in a Document
You can update multiple values in the same document in a single UPDATE statement. Figure 12.11 illustrates that the modify clause allows for a comma-separated list of update operations. The entire list is enclosed in parentheses. This enables you to easily combine two or more update operations in a single statement.
12.3
Updating the Value of an XML Node in a Document
329
UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify (do replace value of $newinfo/customerinfo/addr/street with "85 Leicester Rd" , do replace value of $newinfo/customerinfo/addr/pcode-zip with "W7B 8X1" ) return $newinfo ') WHERE cid = 1002
Original document
Updated document
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Jim Noodle 85 Leicester Rd Markham Ontario W7B 8X1 905-555-7258
Figure 12.11
Updating multiple values in a single UPDATE statement
If you want to update multiple values in a single UPDATE statement and use parameter markers for all values, the PASSING clause of the XMLQUERY function needs to contain a list of typed parameter markers together with the variable names that refer to them (see Figure 12.12). UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify (do replace value of $newinfo/customerinfo/addr/street with $str, do replace value of $newinfo/customerinfo/addr/pcode-zip with $zip ) return $newinfo' PASSING CAST(? AS VARCHAR(30)) AS "str", CAST(? AS VARCHAR(10)) AS "zip") WHERE cid = 1002
Figure 12.12
12.3.5
Updating multiple values with parameter markers
Replacing an Existing Value with a Computed Value
The value that you use to update an existing element or attribute does not necessarily have to be a fixed value but can be computed based on the existing values in the document. For example,
330
Chapter 12
Updating and Transforming XML Documents
assume that the customer documents can contain an element numorders that tracks the total number of orders that a customer has placed. The UPDATE statement in Figure 12.13 increments the value of the element numorders by 1. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/numorders with $newinfo/customerinfo/numorders + 1 return $newinfo ') WHERE cid = 1002
Original document
Updated document
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6
16
17
Figure 12.13
Incrementing the numeric value of an element
Similarly, the UPDATE statement in Figure 12.14 modifies the value of the element street by appending an apartment number. It uses the XQuery function concat. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/addr/street with concat($newinfo/customerinfo/addr/street, " Apt #4") return $newinfo ') WHERE cid = 1002
Figure 12.14
Appending an apartment number to the street
If you write more elaborate updates, you might find it tedious to repeat a long path such as $newinfo/customerinfo/addr/street whenever you reference an existing node in the document. Figure 12.15 uses a let clause to assign this long path to the variable $s. Subsequently, the do replace value clause uses $s multiple times instead of repeating the long path. Note that the modify clause contains a FLWOR expression that only consists of the let and the return clause while the for, where, and order by clauses are omitted. Hence, the XQuery expression
12.4
Replacing XML Nodes in a Document
331
in Figure 12.15 also contains two return clauses. The first one belongs to the let and its FLWOR expression (bold font), and the second one is the return of the transform expression. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify let $s := $newinfo/customerinfo/addr/street return do replace value of $s with concat($s, " Apt #4") return $newinfo ') WHERE cid = 1002
Figure 12.15
12.4
Using let to assign a long path to a short variable
REPLACING XML NODES IN A DOCUMENT
Suppose a customer has moved to a different city and you need to update the address in the XML document that holds the customer’s information. You could write an UPDATE statement with replace value of expressions to individually change the values of all elements and attributes that make up the address of the customer (country, street, city, prov-state, and pcodezip). However, such an update can be lengthy and tedious to write. It can be a lot easier to simply replace the existing addr element and all of its children with a new addr element. Such a replacement of a node is done with a replace expression. The replace expression works differently from the replace value of expression. The former replaces the whole node (the old node is deleted), whereas the latter replaces only the value of the target node. Figure 12.16 shows an UPDATE statement that replaces the existing addr element and all of its child nodes with a new addr fragment. The structure of the new XML fragment does not have to be identical to the original one. Indeed, the new address in Figure 12.16 contains the elements state and zipcode, which are different from the original address. Similarly, you could decide to replace the original addr element and all of its children, with a single email element, if you wanted to. If you choose to validate updated documents with an XML Schema, the new structure of the document has to conform with the XML Schema.
332
Chapter 12
Updating and Transforming XML Documents
UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace $newinfo/customerinfo/addr with 555 Bailey Avenue San Jose California 95141 return $newinfo ') WHERE cid = 1002
Original document
Updated document
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Jim Noodle 555 Bailey Avenue San Jose California 95141 905-555-7258
Figure 12.16
Replacing an element node
Note that the new addr fragment in the modify clause of the UPDATE statement in Figure 12.16 is not enclosed in single quotes because it is not a string value. Instead, the new addr element and its children are constructed with direct element and attribute constructors (see section 8.4, Constructing XML Data). The XML value that provides the new address can also be computed with an expression. For example, Figure 12.17 uses an XPath expression to obtain the addr element from the customer whose Cid attribute has the value 1004. This address element replaces the address of customer 1002. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace $newinfo/customerinfo/addr with db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid=1004]/addr return $newinfo ') WHERE cid = 1002
Figure 12.17
Updating multiple values in a single UPDATE statement
12.5
Deleting XML Nodes from a Document
12.5
333
DELETING XML NODES FROM A DOCUMENT
This section describes how to delete elements or attributes from a document. As an example, suppose that a phone number of a customer is invalid and you want to remove the entire phone element from the corresponding XML document. Figure 12.18 shows a first attempt at writing an appropriate UPDATE statement. It looks much like the previous UPDATE statements except that the updating expression is delete instead of replace value of. In the delete expression, simply specify the path to the elements or attributes that you want to remove from the document. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do delete $newinfo/customerinfo/phone return $newinfo') WHERE cid = 1003
Original document
Updated document
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8
Figure 12.18
Deleting an element
The document that is being updated in Figure 12.18 contains multiple phone elements, and the delete expression removes all of them. If you don’t want to delete all occurrences of a repeating element, add a predicate to the target path to delete only selected occurrences. For example, the following delete expression removes a phone element only if its type attribute has the value home: do delete $newinfo/customerinfo/phone[type="home"]
This delete expression removes exactly one phone element from the original document in Figure 12.18, and leaves the other two phone elements untouched. In general, this expression can delete zero, one, or multiple phone elements from a document, depending on how many phone elements with type equal to home occur in a given document. Modifying repeating elements is further discussed in section 12.8.
334
Chapter 12
Updating and Transforming XML Documents
Predicates in the update expression only serve to select nodes within any given document. They do not help you to efficiently find the documents that should be updated. Predicates that select documents for update must be placed in the WHERE clause of the SQL UPDATE statement.They can include XMLEXISTS predicates. NOTE
If you want to delete an attribute, such as country, simply use a delete expression with an XPath that points to the attribute: do delete $newinfo/customerinfo/addr/@country
You can also remove an entire XML fragment from an XML document. For example, the statement in Figure 12.19 deletes the entire addr element including all the child elements and attributes it contains. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do delete $newinfo/customerinfo/addr return $newinfo') WHERE cid = 1002
Original document
Updated document
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Jim Noodle 905-555-7258
Figure 12.19
12.6
Deleting an XML fragment
RENAMING ELEMENTS OR ATTTRIBUTES IN A DOCUMENT
The rename expression enables you to change the name of an element or attribute. For example, the statement in Figure 12.20 renames the addr element to address. The new element name address is a string literal and must be enclosed in double quotes.
12.7
Inserting XML Nodes into a Document
335
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do rename $new/customerinfo/addr as "address" return $new ') WHERE cid = 1002
Original document
Updated document
Jim Noodle < addr country="Canada"> 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Figure 12.20
Changing an element name
DB2 never allows you to update a document in a manner that violates the rules for well-formed XML documents. For example, in an element such as you cannot rename the attribute xid to yid. This update operation is rejected because it would produce an element with two attributes that have the same name (yid), which is not permitted in any XML document.
12.7
INSERTING XML NODES INTO A DOCUMENT
This section describes how to add element or attribute nodes to a document. When you insert a new element or attribute into a document, you must specify the target position of the new node in the document. We first discuss the positioning of inserted elements, then the positioning of inserted attributes, and then look at several examples.
12.7.1
Defining the Position of Inserted Elements
Suppose you want to insert the new element
[email protected] into the XML document for customer Jim Noodle. You have to decide which existing element is going to be the parent for the new email element. For example, you might decide that email is going to be a child element of the root element customerinfo. This makes email a sibling of the elements name, addr, and phone. Then you can further choose the position of the email element among its siblings. For example, should email appear before or after the addr element? Alternatively, you could decide that email is going to be a child element of addr and therefore becomes a sibling of street, city, prov-state, and pcode-zip. The insert operation in the modify clause allows you to add new nodes to an XML document. It offers five ways to specify the position of the new node: into, as last into, as first
336
Chapter 12
Updating and Transforming XML Documents
into, after, and before. Examples of using these five options for a new element are listed in
Table 12.1. Table 12.1
Five Options for Inserting an Element into a Document
Insert Operation
Position of the Inserted Node
insert
[email protected] into $new/customerinfo
email becomes a child element of customerinfo. The position of email
among the existing children of customerinfo is nondeterministic. insert
[email protected] as last into $new/customerinfo
email becomes the last child element of customerinfo.
insert
[email protected] as first into $new/customerinfo
email becomes the first child element of customerinfo.
insert
[email protected] after $new/customerinfo/addr
email becomes a sibling of addr and therefore a child of customerinfo. email appears immediately after addr.
insert
[email protected] before $new/customerinfo/addr
email becomes a sibling of addr and a child of customerinfo. email appears immediately before addr.
The path that defines the target location of the insert, such as $new/customerinfo or $new/customerinfo/addr, has to produce exactly one node. If the path does not exist in the document or if it exists more than once, the operation fails with error SQL16085N. If you look up the explanation for SQL16085N you find that a common reason is described as “the target node of an insert expression is not a single element node or document node.” Beware that the words “not a single element node” do not necessarily imply that more than one target node was found. It’s equally possible that no target node was found. “Not a single element” means that either zero or more than one node was found, so you should check for both cases when you encounter error SQL16085N. For example, if you misspell a tag name in the target path, error SQL16085N is raised because no target node was found.
12.7.2
Defining the Position of Inserted Attributes
To insert a new attribute instead of an element, you have to use a computed attribute constructor. It consists of the keyword attribute followed by the attribute name and an expression or constant that provides the attribute value. The same five insert options are available as for elements and are shown in Table 12.2. The difference for attributes is that the operations into $new/ customerinfo, as last into $new/customerinfo, and as first into $new/ customerinfo all have the same effect. Their effect is that the new attribute becomes an attribute of the element customerinfo. Since the XML data model does not define a positional order
12.7
Inserting XML Nodes into a Document
337
among the attributes of an element, attributes are always unordered. Therefore the keywords last, first, before, and after do not affect the position of attributes. If you insert an attribute before or after $new/customerinfo/addr, the attribute becomes a sibling of addr and is therefore added to the parent of addr, which is customerinfo. Table 12.2
Five Options for Inserting a Attribute into a Document
Insert Operation
Position of the Inserted Node
insert attribute email {"
[email protected]"} into $new/customerinfo
In all three cases, email becomes an attribute of customerinfo. The position of email among the existing attributes is undefined because attributes are not ordered.
insert attribute email {"
[email protected]"} as last into $new/customerinfo insert attribute email {"
[email protected]"} as first into $new/customerinfo insert attribute email {"
[email protected]"} after $new/customerinfo/addr
In both cases, email becomes an attribute of the parent of addr, which is customerinfo.
insert attribute email {"
[email protected]"} before $new/customerinfo/addr
12.7.3
Insert Examples
For the following examples, assume that an email element has to be inserted into the XML document for Robert Shoemaker. This document is identified by the relational cid value 1003. Figure 12.21 shows a first attempt at performing this update. The UPDATE statement fails with errors message SQL20345N because the target path is specified as $new instead of $new/customerinfo. When the target path is $new, the email element is inserted as a sibling and not as a child of the customerinfo element. The result is a sequence of two elements (customerinfo, email), which is not a well-formed XML document. Since XML columns can only contain wellformed documents, the update fails. It fails for the same reason if you specify before $new/ customerinfo or after $new/customerinfo as the target position.
338
Chapter 12
Updating and Transforming XML Documents
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert
[email protected] as last into $new return $new') WHERE cid = 1003 SQL20345N The XML value is not a well-formed document with a single root element. SQLSTATE=2200L
Original document
Rejected XML value
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
[email protected]
Figure 12.21
Cannot insert an element as a sibling of the root element
Figure 12.22 shows the corrected UPDATE statement and the correctly modified XML document. You could similarly insert the email element as first into $new/customerinfo. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert
[email protected] as last into $new/customerinfo return $new') WHERE cid = 1003
Original document
Updated document
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
[email protected]
Figure 12.22
Inserting a new element as the last element
12.7
Inserting XML Nodes into a Document
339
If you want the email element to appear in the document before the phone elements, you can explicitly request it to be inserted before the first occurrence of any existing phone elements using the positional predicate [1]. This is shown in Figure 12.23 where the positional predicate selects exactly one phone element as the target location. If you omit the positional predicate, the UPDATE statement fails with error SQL16085N. The statement in Figure 12.23 would also fail if the document contained no phone elements. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert
[email protected] before $new/customerinfo/phone[1] return $new') WHERE cid = 1003
Original document
Updated document
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8
[email protected] 905-555-7258 416-555-2937 905-555-8743
Figure 12.23
Inserting a new element before an existing element
If you want to insert the email element after the last phone element but before any other elements that might appear at end of the document, specify the insert position to be after $new/customerinfo/phone[last()]. As another example, Figure 12.24 shows an UPDATE statement that inserts the new email element as the first child of the addr element. Alternatively, the UPDATE statement in Figure 12.25 inserts the email address as an attribute of the addr element. In the updated document, the attribute email happens to appear before the attribute country. But this order is not relevant and not guaranteed because XML attributes have no defined order. If you change the target position of the inserted attribute to after $new/customerinfo/ addr/city or before $new/customerinfo/addr/@country, the updated document is still the same as shown in Figure 12.25.
340
Chapter 12
Updating and Transforming XML Documents
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert
[email protected] as first into $new/customerinfo/addr return $new') WHERE cid = 1003
Original document
Updated document
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker
[email protected] 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Figure 12.24
Inserting a new element as the first child element of a target node
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert attribute email {"
[email protected]"} into $new/customerinfo/addr return $new') WHERE cid = 1003
Original document
Updated document
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Figure 12.25
12.8
Inserting an attribute
HANDLING REPEATING AND MISSING NODES
If a single XPath expression identifies multiple nodes in a single document, they are called repeating nodes. In previous sections you saw that the XML document for Robert Shoemaker contains multiple phone elements. Hence, the element phone is a repeating element and the path /customerinfo/phone produces a sequence of more than one element node.
12.8
Handling Repeating and Missing Nodes
341
As defined by the XQuery Update standard, the delete expression is the only update operation that can directly process multiple occurrences of a node. It simply deletes all of them, as you saw in section 12.5. All other update expressions (replace, replace value of, rename, and insert) require special attention when dealing with repeating nodes. The same applies to missing nodes. If you try to delete an element or attribute that does not exist, the delete expression performs no action and returns successfully. However, all other update expressions fail when they try to modify an element or attribute that does not exist in the target document. The UPDATE statement in Figure 12.26 tries to change the value of a phone element but fails. At runtime, DB2 detects that there is more than one phone element in the target document and returns error SQL16085N. You can type “? SQL16085N” at the DB2 command prompt to find that the explanation for reason code XUTY0008 is that “the target node of a replace expression is not a single node”. This reason code indicates that the target path $new/customerinfo/phone has either produced multiple phone elements or none. However, it must produce exactly one node for the update to be successful. The error prevents you from updating multiple phone elements with the same number, which would not make sense. If no phone element exists, the error ensures that you are not led to believe that the new phone number was successfully written to the document. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone with "123-456-7890" return $new ') WHERE cid = 1003 SQL16085N The target node of an XQuery "replace value of" expression is not valid. Error QName=err:XUTY0008. SQLSTATE=10703.
Figure 12.26
Trying to replace the value of a repeating element
If you know that there are multiple phone elements, a common way to avoid error SQL16085N is to add a predicate to the target path to select exactly one phone element for update. As an example, Figure 12.27 uses the predicate [@type="cell"] to only update the cell phone number. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone[@type="cell"] with "123-456-7890" return $new ') WHERE cid = 1003
Figure 12.27
Replacing one of multiple occurrences of an element
342
Chapter 12
Updating and Transforming XML Documents
Using the predicate in Figure 12.27 works well if every possible target document contains exactly one phone element with a type attribute equal to cell. However, if a document does not contain a cell phone element, the UPDATE statement in Figure 12.27 still fails with error SQL16085N. In that case, another option is to use the XQuery if-then-else expression, as shown in Figure 12.28. If a cell phone element exists then its value is replaced with a new value, else a new cell phone element with the new number is inserted. This implements an “upsert” operation. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify if ($new/customerinfo/phone[@type="cell"]) then do replace value of $new/customerinfo/phone[@type="cell"] with "123-456-7890" else do insert 123-456-7890 as last into $new/customerinfo return $new ') WHERE cid = 1001
Figure 12.28
Conditional update and insert of an element
The most resilient solution for handling both repeating and missing elements is a FLWOR expression in the modify clause (see Figure 12.29). The for clause iterates over the target elements one at a time, so that the replace value of expression in the return clause is always applied to exactly one element. If you remove the condition where $j/@type = "cell", all phone elements are updated with the same number "123-456-7890", regardless of their type. If a document does not contain a cell phone or no phone elements at all, the return clause of the FLWOR expression is never invoked so that the replace value of expression never fails due to a missing node. In summary, the FLWOR expression in the modify clause enables an UPDATE statement to • Modify multiple or all occurrences of a repeating node (without warning) • Add predicates to select which occurrences of a repeating node to modify • Silently proceed and return successfully even if a target node is not found
12.9
Modifying Multiple XML Nodes in the Same Document
343
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify for $j in $new/customerinfo/phone where $j/@type = "cell" return do replace value of $j with "123-456-7890" return $new') WHERE cid = 1000
Original document
Updated document
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743
Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 123-456-7890
Figure 12.29
12.9
Iterating over the occurrences of a repeating element
MODIFYING MULTIPLE XML NODES IN THE SAME DOCUMENT
You can have multiple update operations for the same document in the modify clause of a single UPDATE statement. However, you cannot rename, replace, or update the value of the same node more than once. In this section we discuss examples where multiple combined update operations are or are not in conflict with each other.
12.9.1
Snapshot Semantics and Conflict Situations
The XQuery Update standard defines that all update operations in the modify clause are applied independently from each other to the original document. They don’t see each others’ effects. This is called snapshot semantics, which means that each update operation is logically applied to a separate snapshot of the original document. As an example, let’s look at the UPDATE statement in Figure 12.30, which contains two updating expressions in the modify clause, separated by a comma. The first expression inserts an additional phone element. The second expression deletes all phone elements. The obvious question is whether the newly inserted phone element is instantly removed by the delete expression, and whether that depends on the order in which the insert and the delete operations appear in the modify clause. As it turns out, the new phone element is not affected by the delete expression, irrespective of the order in which the operations appear in the modify clause. Due to snapshot
344
Chapter 12
Updating and Transforming XML Documents
semantics, both the insert and the delete expressions in Figure 12.30 are independently applied to a snapshot of the original document. Therefore the delete expression does not see the newly inserted phone element and only removes the old phone elements that existed in the document prior to this update. Hence, there is no conflict between the insert and the delete expression in Figure 12.30. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify( do insert 777-555-3333 after $new/customerinfo/addr , do delete $new/customerinfo/phone ) return $new ') WHERE cid = 1002
Original document Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Figure 12.30
Updated document Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 777-555-3333
Combining an insert and a delete operation
For comparison, let’s look at a different combination of an insert and a delete expression in Figure 12.31. One of the expressions deletes the addr element, and the other expression inserts a new POBox element into the addr element. Again, the order of the two operations in the modify clause is irrelevant. Nevertheless, the two operations conflict with each other because the delete expression removes the parent element (addr) of the newly inserted POBox element. For this case, the language standard defines that delete “wins” over insert and the updated document has no addr or POBox elements. Be aware of these effects when you code complex updates.
12.9
Modifying Multiple XML Nodes in the Same Document
345
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify( do delete $new/customerinfo/addr , do insert 15 into $new/customerinfo/addr ) return $new ') WHERE cid = 1002
Original document Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
Figure 12.31
12.9.2
Updated document Jim Noodle 777-555-3333
A different combination of an insert and a delete operation
Converting Elements to Attributes and Vice Versa
The UPDATE statement in Figure 12.32 is another interesting example. It combines two insert expressions and two delete expressions in a single statement. The objective is to turn the existing Cid attribute into an element called customerid, and the existing element name into an attribute called custname . Four update operations are required to make this happen: • Insert a customerid element and compute its value from the existing Cid attribute • Insert a custname attribute and take its value from the existing name element • Delete the existing Cid attribute • Delete the existing name element Again, the order of these four expressions in the modify clause does not matter. Snapshot semantics ensures that the four expressions are applied in isolation and produce the intended result. In particular, the insert expressions see their own logical snapshots of the original document, which enables them to read the Cid attribute and the name element even though these nodes are being deleted at the same time.
346
Chapter 12
Updating and Transforming XML Documents
UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify(do insert {$new/customerinfo/data(@Cid)} as first into $new/customerinfo , do insert attribute custname {$new/customerinfo/name} into $new/customerinfo, do delete $new/customerinfo/@Cid, do delete $new/customerinfo/name ) return $new') WHERE cid = 1002
Document before the update
Document after the update
Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258
" 2001,"" 2002,""
Figure 17.23
Schema identifiers in the delimited format input file
The input file in Figure 17.23 tells DB2 to use the XML Schema CUSTXSD1 to validate the XML documents contained in files data2.xml and data4.xml, and the schema CUSTXSD2 to validate the XML document data3.xml. Additionally you need to include the XMLVALIDATE USING XDS clause in the IMPORT or LOAD command (see Figure 17.24). Otherwise the SCH attributes in the input are ignored and no validation is performed. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xml XMLVALIDATE USING XDS INSERT INTO customer
Figure 17.24
Performing XML Schema validation during IMPORT with multiple schemas
532
Chapter 17
Validating XML Documents against XML Schemas
What happens if the delimited format input file contains schema references (SCH attributes) but you use the XMLVALIDATE USING SCHEMA clause in the LOAD or IMPORT command? In this case the XML Schema specified in the XMLVALIDATE USING SCHEMA clause takes precedence, all documents are validated against that one schema, and the SCH attributes in the input file are ignored. For a large number of documents you normally don’t create the delimited format input file manually—you may have an application or script that creates it for you. Also, note that DB2’s EXPORT utility can export tables (or subsets of a table defined by a query) to the file system. When you export XML data, the EXPORT utility automatically generates a delimited format file and optionally includes SCH attributes with schema identifiers for all documents that have been validated. Samples of the output produced by EXPORT are shown in Figure 17.23, Figure 17.25, and Figure 17.27.
17.7.3 Using a Default XML Schema When schema references are included in the delimited format input file, it is possible that not every XDS has a SCH attribute (see Figure 17.25). In this case, the LOAD and IMPORT commands allow you to specify a default schema for those records that do not have a SCH attribute in the input file. 2000,"" 2001,"" 2002,""
Figure 17.25
Schema identifiers in the delimited format input file
The IMPORT command in Figure 17.26 contains the DEFAULT option in the XMLVALIDATE USING XDS clause to indicate that any input documents that don’t have a schema reference in the XDS must be validated against the schema custxsd1. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS DEFAULT db2admin.custxsd1 INSERT INTO customer
Figure 17.26
Specifying a default schema for validation
Note that the DEFAULT clause takes precedence over the IGNORE and MAP clauses (discussed in the next sections).
17.7.4 Overriding XML Schema References Assume you need to import XML data using the delimited format input file in Figure 17.27. This input file contains references to XML Schemas custxsd1, custxsd2, and custxsd3.
17.7
Validation during Load and Import Operations
2000,"" />" />"
Schema identifiers in the delimited format input file
Let’s say you only want to validate the documents that reference schema custxsd1, but not the documents that reference custxsd2 or custxsd3. One reason could be that you received the input data but you only have schema custxsd1 and not the other two. Another reason could be that the documents for schemas custxsd2 and custxsd3 are already known to be valid and you want to save the CPU cycles of validating them again. In such cases you can add the IGNORE keyword with a list of schema identifiers to the XMLVALIDATE USING XDS clause. An example is shown in Figure 17.28. It tells DB2 to perform validation based on the schemas specified in the SCH attributes, but not to validate any documents that reference any of the schemas listed in the IGNORE clause. IMPORT FROM c:\xml\tab.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS IGNORE (db2admin.custxsd2, db2admin.custxsd3) INSERT INTO customer
Figure 17.28
Disabling validation for selected XML Schemas
Instead of ignoring certain XML Schemas you can also override them with a different schema. The MAP clause allows you to specify alternate XML Schemas to use in place of those specified by the SCH attributes in the delimited format input file. The MAP clause specifies a list of one or more XML Schema pairs, where each pair represents a mapping from one XML Schema to another. The first XML Schema in the pair represents a schema that is referenced by an SCH attribute in an XDS. The second XML Schema in the pair represents the schema that should be used to perform validation. An example is shown in Figure 17.29, where the IMPORT command uses the schema custxsd1 whenever it sees schema custxsd2 or custxsd3 in an SCH attribute in the input file. IMPORT FROM c:\xml\tab.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS MAP ((db2admin.custxsd2, db2admin.custxsd1), (db2admin.custxsd3, db2admin.custxsd1)) INSERT INTO customer
Figure 17.29
Import with validation against “mapped” XML Schemas
534
Chapter 17
Validating XML Documents against XML Schemas
The following usage rules apply: • If an XML Schema is present in the left side of a schema pair in the MAP clause, it cannot also be specified in the IGNORE clause. • If an XML Schema is present in the right side of a schema pair in the MAP clause, it will not be subsequently ignored if listed in the IGNORE clause. • An XML Schema cannot be mapped more than once. It cannot appear on the left side of more than one schema pair. • Schema mappings in the MAP clause are non-transitive. For example, assume schema custxsd3 is mapped to schema custxsd2, and assume a second pair maps schema custxsd2 to schema custxsd1; then schema custxsd1 will not be used instead of schema custxsd3.
17.7.5 Validation Based on schemaLocation Attributes The IMPORT command in Figure 17.30 contains the clause XMLVALIDATE USING SCHEMALOCATION HINTS. This clause indicates that each XML document in the input file is to be validated against the XML Schema that is referenced by the optional xsi:schemaLocation attribute within the document. An xsi:schemaLocation attribute, which is also called a schema location hint, contains a pair of target namespace and schema location. This pair can identify an XML Schema that you have previously registered in the XML Schema Repository. Earlier in this chapter, Figure 17.2 showed an XML document with an xsi:schemaLocation attribute. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING SCHEMALOCATION HINTS INSERT INTO customer
Figure 17.30
Validation with schema location hints
17.8 CHECKING WHETHER AN EXISTING DOCUMENT HAS BEEN VALIDATED DB2 allows you to check whether an XML document that is stored in a table has previously been validated. This can be done in a couple of ways. In DB2 for Linux, UNIX, and Windows you can use the IS VALIDATED predicate, which works similarly to the IS NULL predicate that you might already be familiar with. The query in Figure 17.31 checks every XML document in the info column of the customer table and returns YES if the document has been validated, and NO otherwise.
17.9
Validating Existing Documents in a Table
535
SELECT id, CASE WHEN info IS VALIDATED THEN 'YES' ELSE 'NO' END AS isvalid FROM customer
Figure 17.31
Checking which documents in a table have been validated
The query in Figure 17.32 is very similar but uses a WHERE clause with an XMLEXISTS predicate to check the validation status only of the document(s) where the customer name is Matt Foreman. SELECT CASE WHEN info IS VALIDATED THEN 'YES' ELSE 'NO' END AS isvalid FROM customer WHERE XMLEXISTS('$INFO/customerinfo[name = "Matt Foreman"]')
Figure 17.32
Checking whether a specific document has been validated
To perform similar checks in DB2 for z/OS you need to maintain an additional column in your user table. The column can contain 0 or 1 to indicate whether the document has been validated. Alternatively you can store the OBJECTID of the XML Schema in a BIGINT column. Then you can easily query this column to determine which schema a given XML document belongs to.
17.9
VALIDATING EXISTING DOCUMENTS IN A TABLE
You might encounter a situation where you already have XML documents stored in an XML column and want to validate them against an XML Schema. Maybe they were never validated and you want to validate them now. Or, maybe they had been validated when they were inserted, but now you want to validate them against a new schema. Either way, the validation of existing documents can be achieved with SELECT or UPDATE statements. Let’s look at the update process first. Figure 17.33 shows an UPDATE statement that replaces a document with a validated copy of itself. The WHERE clause uses a relational predicate to identify a single row in the customer table. In this row, the XML document in the info column is replaced with the result of the XMLVALIDATE function. The XMLVALIDATE function itself also takes the info column as input. If the document is not valid against the specified XML Schema, the update fails. Otherwise the document is replaced with itself and the OBJECTID of the XML Schema gets attached to the document. This links the document to its schema. The function XMLXSROBJECTID can take the document or any part of it as input, and returns the OBJECTID of the schema that the document was validated against (see section 17.10).
536
Chapter 17
Validating XML Documents against XML Schemas
UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE id = 1000
Figure 17.33
Validating an existing document
The UPDATE statement in Figure 17.34 is similar to that in Figure 17.33, but has a different predicate in the WHERE clause. It tries to validate all documents in the XML column that have not been validated before. This update works as expected if all those documents are valid against the specified XML Schema. However, the problem with this UPDATE statement is that it fails and rolls back as soon as the first invalid document is encountered. The reason for this behavior is that the SQL/XML standard requires the XMLVALIDATE function to raise an error if validation fails. You will see later how error handling in a stored procedure can circumvent this problem (see Figure 17.38). UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE info IS NOT VALIDATED
Figure 17.34
Validating multiple existing documents
Beware that a bulk update with validation of a large number of documents can take a significant amount of time. All affected documents are rewritten in the table space and logged. If you are only interested in a Yes/No answer whether certain documents are valid for a given schema, and if you don’t require the relationship between documents and schema to be permanently recorded in the database, then a SELECT statement can be used instead of an UPDATE statement. The query in Figure 17.35 reads XML documents from the info column for all customers whose city is Toronto. At the same time it uses the XMLVALIDATE function in the SELECT clause to validate the documents upon retrieval. The query fails at runtime as soon as one document is retrieved that is not valid for the specified schema. SELECT XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = "Toronto"]')
Figure 17.35
Retrieving and validating documents at the same time
If the validation is performed in a stored procedure, an exception handler can catch and handle the validation failure. Figure 17.36 shows a simple stored procedure that takes a single XML document as input and returns 1 if the document is valid and 0 if it is not valid. If the input document
17.9
Validating Existing Documents in a Table
537
is not valid for the specified schema, the exit handler catches the error that is raised by XMLVALIDATE and sets the output parameter isvalid to 0. CREATE PROCEDURE validate(IN doc XML, OUT isvalid INTEGER) LANGUAGE SQL BEGIN DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE EXIT HANDLER FOR INVALID_DOCUMENT SET isvalid = 0; IF (XMLVALIDATE(doc ACCORDING TO XMLSCHEMA ID db2admin.custxsd) IS VALIDATED) THEN SET isvalid = 1; END IF; END #
Figure 17.36
Stored procedure to validate an existing document
The stored procedure in Figure 17.36 can be called from an application or from other stored procedures that manipulate XML documents. You can also call it in the DB2 Command Line Processor, if the first parameter of the stored procedure call is a query that produces a single XML document. This is illustrated in Figure 17.37, where the XML document with id = 1003 from the customer table is passed to the stored procedure for validation. The output shows that the output parameter isvalid has the value 1, which means that the document is valid. db2 => call validate((SELECT info FROM customer WHERE id = 1003),?) Value of output parameters -------------------------Parameter Name : ISVALID Parameter Value : 1 Return Status = 0 db2 =>
Figure 17.37
Testing the validation stored procedure in the CLP
The stored procedure in Figure 17.38 is designed to perform the same task as the UPDATE statement in Figure 17.34. That is, it validates all documents in the XML column that have not been validated before. The major difference is that this stored procedure does not fail and abort when the first invalid document is encountered. Instead, it loops over the XML documents and uses a CONTINUE handler to count invalid documents instead of raising an error. Alternatively, you could change the CONTINUE handler to write the id values of the invalid documents to a separate table, or take any other appropriate action.
538
Chapter 17
Validating XML Documents against XML Schemas
CREATE PROCEDURE bulkvalidate(OUT num_invalid_docs INTEGER) LANGUAGE SQL BEGIN DECLARE count INTEGER DEFAULT 0; DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE CONTINUE HANDLER FOR INVALID_DOCUMENT SET count = count + 1; FOR doc AS cur1 CURSOR FOR SELECT id, info FROM customer WHERE info IS NOT VALIDATED FOR UPDATE OF INFO DO UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE CURRENT of cur1; END FOR; SET num_invalid_docs = count; END#
Figure 17.38
17.10
Stored procedure to validate multiple existing documents
FINDING THE XML SCHEMA FOR A VALIDATED DOCUMENT
DB2 for Linux, UNIX, and Windows also allows you to determine which XML Schema was used to validate a particular XML document. Every XML Schema that is registered in DB2 is assigned an internal identification number of type BIGINT. You can see this number in the column OBJECTID of the catalog view SYSCAT.XSROBJECTS. Whenever an XML document is validated against an XML Schema, the unique identifier (OBJECTID) is stored with the XML document. The scalar function XMLXSROBJECTID takes an XML document as input and returns the OBJECTID of the XML Schema that was used to validate the XML document. If the input document hasn’t been validated, the value 0 is returned. There are several interesting uses of the function XMLXSROBJECTID. One is to find the XML Schema that was used to validate a specific document. Another is finding all documents that have been validated against a particular XML Schema. Figure 17.39 shows how to use the function XMLXSROBJECTID in the WHERE clause of an SQL statement to join with the OBJECTID column in the catalog view syscat.xsrobjects. Together with the predicate on the relational id column, this retrieves information about the schema that was used to validate the document with id 1003. Instead of the relational predicate you can certainly also use an XMLEXISTS predicate to qualify one or multiple XML documents based on the contents of the XML document itself.
17.10
Finding the XML Schema for a Validated Document
539
SELECT c.id, SUBSTR(x.objectschema,1,10) AS xmlschema_schema, SUBSTR(x.objectname,1,10) AS xmlschema_name FROM customer c, syscat.xsrobjects x WHERE XMLXSROBJECTID(c.info) = x.OBJECTID AND c.id = 1003; ID XMLSCHEMA_SCHEMA XMLSCHEMA_NAME --------------- ---------------- -------------1003 DB2ADMIN CUSTXSD
Figure 17.39
Finding schema information for a given XML document
There is no hard dependency between a document and the XML Schema it was validated against. This means that an XML Schema can be dropped from the XML Schema Repository even if the database contains documents that were validated against this schema. Those documents continue to carry the OBJECTID of the XML Schema even after the schema is dropped.The OBJECTID now points to a non-existing XML Schema, which has no impact other than the obvious; that is, you won’t find the schema that belongs to these documents. NOTE
While the query in Figure 17.39 finds the XML Schema for a given document, the query in Figure 17.40 finds the documents that were validated with a given XML Schema. Again, the function XMLXSROBJECTID facilitates the join between the customer table and the XML Schema Repository. The second and the third predicates select the particular XML Schema db2admin. custxsd for which the query finds all corresponding XML documents. SELECT c.id FROM customer c, syscat.xsrobjects x WHERE XMLXSROBJECTID(c.info) = x.OBJECTID AND x.objectschema = 'DB2ADMIN' AND x.objectname = 'CUSTXSD'
Figure 17.40
Finding documents for given XML Schema, using XMLXSROBJECTID
Since DB2 9.5 for Linux, UNIX, and Windows you can also use the IS VALIDATED predicate with the ACCORDING TO clause, as shown in Figure 17.41. SELECT c.id FROM customer c WHERE c.info IS VALIDATED ACCORDING TO XMLSCHEMA ID db2admin.custxsd
Figure 17.41
Finding documents for given XML Schema, using IS VALIDATED
540
Chapter 17
Validating XML Documents against XML Schemas
If you use multiple XML Schemas to validate documents within a single XML column, and if you frequently need to run queries that relate documents to schemas, consider storing the OBJECTID in an additional column of your table with an index on it. This additional column can greatly improve the performance of finding schemas and documents that relate to each other. In DB2 for z/OS, such an extra column is the only way to correlate documents to schemas.
17.11
HOW TO UNDO DOCUMENT VALIDATION
It is possible to make a validated XML document look and behave as if it had never been validated. When you “undo” the validation, the linkage between the document and any XML Schema is removed, because the OBJECTID of an XML Schema is no longer associated with the document. All it takes is to update the validated document with itself and reparse it without validation. You will probably rarely have to do this, but we want to show that it is possible if needed. It only applies to DB2 for Linux, UNIX, and Windows. You “remove validation” from a document with an UPDATE statement and the XMLSERIALIZE and XMLPARSE functions as shown in Figure 17.42. This statement serializes the stored document tree back to text format and then parses it again to produce DB2’s internal tree format, but without validation (assuming you don’t have triggers that enforce validation). The document now looks like it has never been validated. UPDATE customer SET info = XMLPARSE(DOCUMENT XMLSERIALIZE(info AS CLOB(5000))) WHERE id = 1000
Figure 17.42
Undoing validation disassociates a document from its schema
Note that the XMLSERIALIZE function requires you to use a character type, such as VARCHAR or CLOB, that is large enough to temporarily hold the serialized document.
17.12
CONSIDERATIONS FOR VALIDATION IN DB2 FOR Z/OS
Throughout this chapter you have seen many ways in which the function XMLVALIDATE can be used in DB2 for Linux, UNIX, and Windows to validate XML documents against an XML Schema. The equivalent function in DB2 9 for z/OS is called SYSFUN.DSN_XMLVALIDATE. The main difference between the two is that DSN_XMLVALIDATE must be an argument to the XMLPARSE function. The other difference is that DSN_XMLVALIDATE does not use an ACCORDING TO XMLSCHEMA clause to identify an XML Schema, but a regular parameter instead. The following sections provide examples.
17.12
Considerations for Validation in DB2 for z/OS
17.12.1
541
Document Validation Upon Insert
The DSN_XMLVALIDATE function can take either two or three input parameters. The first parameter is the XML document that you want to validate. It must be of type CLOB or BLOB with a maximum size of 250MB, or of type VARCHAR with a maximum size of 32KB. If you are using DSN_XMLVALIDATE with two parameters, then the second parameter has to be the SQL identifier of the XML Schema that you want to use for validation. This parameter cannot be NULL. Figure 17.43 shows two INSERT statements that use DSN_XMLVALIDATE with two parameters. The first statement provides the XML document as a parameter marker, and the second uses a host variable. Both specify that the document is to be validated against the XML Schema SYSXSR.CUSTXSD. An error is returned if an XML Schema with this identifier is not found in DB2’s XML Schema Repository (XSR). INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? AS CLOB), 'SYSXSR.CUSTXSD') ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'SYSXSR.CUSTXSD') ) );
Figure 17.43
Referencing the XML Schema by its SQL identifier
If you are using DSN_XMLVALIDATE with three parameters, then the second and third parameters must be the target namespace and the schema location of the XML Schema that you want to use for validation (see Figure 17.44). This combination of target namespace and schema location must uniquely identify an XML Schema that is registered in the XSR, otherwise an error is raised. If you use DSN_XMLVALIDATE with three parameters, the second and/or the third parameter can be NULL. In this case DB2 still looks for a corresponding XML Schema in its XML Schema Repository. If both parameters are NULL, DB2 expects to find exactly one schema in the XSR whose target namespace and schema location are NULL. DB2 for z/OS does not infer the schema from a schema location attribute inside the XML document that you want to validate. INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? AS CLOB), 'http://pureXMLcookbook.org', NULL ) ) ); INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'http://pureXMLcookbook.org', 'customer.xsd' ) ) );
Figure 17.44 (continues)
Referencing the XML Schema by target namespace and schema location
542
Chapter 17
Validating XML Documents against XML Schemas
INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, NULL, 'customer.xsd' ) ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, NULL, NULL ) ) );
Figure 17.44 (Continued)
Referencing the XML Schema by target namespace and schema location
The previous examples provided either the SQL identifier of the XML Schema, or the target namespace and schema location as string literals. Alternatively you can provide them through parameter markers or host variables. The first INSERT statement in Figure 17.45 uses the DSN_XMLVALIDATE function with two parameter markers. The first provides the document to validate and the second provides the SQL identifier of the XML Schema. The second parameter cannot provide an actual XML Schema document for validation, because DB2 only validates against schemas that were previously registered in the XSR. The second INSERT statement in Figure 17.45 uses DSN_XMLVALIDATE with three host variables, which means that the schema is being identified by target namespace and schema location. INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? as CLOB), ?) ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, :tgtnamespace_hv, :schemalocation_hv) ) );
Figure 17.45
Providing schema identification via parameter markers or host variables
The DSN_XMLVALIDATE function can only be used as a parameter to the XMLPARSE function, and in that case the XMLPARSE function cannot use the PRESERVE WHITESPACE clause. Validation always implies that boundary whitespace is stripped, not preserved, in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows.
17.12.2
Document Validation Upon Update
If you use SQL UPDATE statements in DB2 for z/OS to replace existing documents, the DSN_XMLVALIDATE function allows you to validate the new document as part of the update
17.12
Considerations for Validation in DB2 for z/OS
543
process. In the previous sections you have seen various different ways in which you can provide input to the DSN_XMLVALIDATE function. All of them work in UPDATE statements as well, as in Figure 17.46. UPDATE customer SET info = XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'SYSXSR.CUSTXSD') ) ) WHERE id = 1003
Figure 17.46
17.12.3
DSN_XMLVALIDATE in an UPDATE statement
Validating Existing Documents in a Table
There may be situations where you already have XML documents stored in an XML column and want to validate them against an XML Schema. For example, the query in Figure 17.47 selects all documents for customers in Toronto and validates them upon retrieval. Remember that the DSN_XMLVALIDATE function requires the input document to be of type CLOB or BLOB. However, the column info in our customer table is of type XML. Therefore, at the time of writing, the function XMLSERIALIZE is required to convert the XML documents to type CLOB or BLOB. SELECT XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( XMLSERIALIZE(info AS CLOB), 'SYSXSR.CUSTXSD') ) ) FROM customer WHERE XMLEXISTS('$i/customerinfo/addr[city = "Toronto"]' PASSING info AS "i");
Figure 17.47
Validating existing documents in a table
The query in Figure 17.47 parses and validates all matching documents, which requires more CPU cycles than simply retrieving the documents without reparsing them. The query raises an error as soon as one document is encountered that is not valid against the schema SYSXSR.CUSTXSD. You can capture and handle this error in a stored procedure, similar to how it is discussed in section 17.9.
17.12.4
Summary of Platform Similarities and Differences
Table 17.2 provides a summary of the differences in validation functionality between DB2 for z/OS and DB2 for Linux, UNIX, and Windows. This comparison is a point-in-time snapshot and subject to change. Over time, the supported features in the DB2 for z/OS and DB2 for Linux, UNIX, and Windows continue to converge.
544
Table 17.2
Chapter 17
Validating XML Documents against XML Schemas
Summary of Platform Similarities and Differences
Feature
DB2 for Linux, UNIX, and Windows
DB2 for z/OS
Document validation for INSERT and UPDATE operations
Yes
Yes
Validation function
XMLVALIDATE
DSN_XMLVALIDATE; always has to be an argument of the XMLPARSE function.
Can reference XML Schema by its SQL identifier
Yes
Yes
Can reference XML Schema by target namespace and schema location
Yes
Yes
Can validate existing documents in a table
Yes
Yes
Can perform validation in stored procedures
Yes
Yes
Validation support in the LOAD utility
Yes
You can validate documents after LOAD.
Link between documents and schemas is stored with each validated document
Yes*
You can maintain this information in a separate column of the user table.
IS VALIDATED predicate to
Yes*
You can get this information from a separate column in the user table where you record the schema ID for each document.
check whether a document has been validated Function XMLXSROBJECTID to find documents for a given schema, or vice versa
Yes*
*If you query the relationship between documents and schemas often, you might want to maintain this information (the schema ID for any given document) in a separate column that is indexed to ensure good performance.
17.13
SUMMARY
Validating XML documents against XML Schemas is the best way to enforce XML data quality in the database. However, document validation is optional in DB2 and there is no performance or functional penalty if you don’t use an XML Schema. If you choose to validate documents, you typically do so when you insert, update, or load them. Existing documents in the database can
17.13
Summary
545
also be validated in queries. An XML column can contain a mix of validated and non-validated documents, and different documents in a column can be validated with different schemas. In DB2 you are not forced to assign a single XML Schema to an entire XML column. There are two general approaches for document validation in DB2: • Application-centric: Applications use the XMLVALIDATE (or DSN_XMLVALIDATE) function in their INSERT and UPDATE statements. This makes validation a distributed responsibility and provides maximum flexibility. • Database-centric: The database uses triggers and check constraints to enforce validation on a per-XML-column basis. These application- and database-centric techniques can also be combined to implement a custom validation strategy that meets specific requirements.
This page intentionally left blank
C
H A P T E R
18
Using XML in Stored Procedures, UDFs, and Triggers
tored procedures, user-defined functions (UDFs), and triggers are database objects that encapsulate processing steps to retrieve or manipulate data in the database. They can contain multiple statements that are invoked and executed as a single unit. They are typically used to implement application-specific logic. Stored procedures and UDFs can be implemented in the SQL Procedure Language (SQL PL) or in external languages such as Java, C, or COBOL. The benefits of stored procedures and UDFs include:
S
• Reduced coding labor due to the creation of reusable processing modules • Richer processing capabilities in the databases by defining custom logic and functions • Improved performance and reduced network traffic because stored procedures and UDFs are executed close to the data; that is, in the database engine Stored procedures are executed with CALL statements, which can be issued from an application program, from another stored procedure, from a UDF, or from a trigger. UDFs are used in SQL statements just like you use predefined SQL functions. Triggers are executed automatically when an insert, delete, or update operation happens on a specified table. Triggers are used to implement automated reactions to data modifications and to enforce data integrity rules within the database. The benefits of stored procedures, UDFs, and triggers apply equally to the processing of XML data and relational data. In this chapter we discuss the following topics: • Manipulating XML data in stored procedures (section 18.1) • Manipulating XML data in user-defined functions (section 18.2) • Manipulating XML data in triggers (section 18.3)
547
548
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
For general background on stored procedures, UDFs, triggers, and the SQL Procedure Language, please consult the resources listed in the Appendix C, Further Reading.
18.1
MANIPULATING XML IN SQL STORED PROCEDURES
Stored procedures are a powerful tool for application development. They allow you to define simple or complex multi-statement operations and processing logic that can be invoked with a single call from the application. Stored procedures can encapsulate and hide complex data manipulation from the client application. Since stored procedures are executed in the database server, they can process data without moving it to the client, which is often beneficial for performance. In previous chapters you have already seen several examples where stored procedures implement specific tasks: • Section 7.7, Figure 7.41: Stored procedure to execute XPath dynamically • Section 17.3, Figure 17.7: Stored procedure to handle and record validation errors • Section 17.9, Figure 17.36: Stored procedure to validate an existing document • Section 17.9, Figure 17.38: Stored procedure to validate multiple existing documents DB2 for Linux, UNIX, and Windows allows you to use the XML data type not just to define columns in a table, but also to declare input and output parameters as well as variables in stored procedures and user-defined functions. Stored procedures can therefore manipulate XML documents in their parsed format without incurring additional XML parsing, which is a major performance benefit. Variables of data type XML can be manipulated in stored procedures much like variables of other types. For example, XML variables can receive their value through statements such as a SET statement or a SELECT INTO statement. The only restriction is that XML variables and XML input parameters lose their value upon a COMMIT or ROLLBACK operation. If you want to use an XML variable or parameter after a ROLLBACK or COMMIT statement, you need to assign new values to them first. Otherwise error SQL1354N is raised. The best way to use XPath or XQuery expressions in stored procedures is to embed them in the SQL/XML functions XMLQUERY, XMLTABLE, or XMLEXISTS. These can be used in stored procedure statements and accept variables of type XML in their PASSING clause. You can also use XQuery without SQL in stored procedures, but only with dynamic cursors. Static XQuery is not allowed.
18.1.1
Basic XML Manipulation in Stored Procedures
Let’s look at Figure 18.1 to become familiar with the basic capabilities of handling XML data in stored procedures. The table addrtable is defined in addition to the customer table that we
18.1
Manipulating XML in SQL Stored Procedures
549
have been using. The stored procedure has one input parameter and one output parameter, both are of type XML. Additionally, the procedure declares the variables id and address of type INTEGER and XML, respectively. The first SET statement extracts the Cid attribute from the input document, converts it to INTEGER, and assigns it to the variable id. Note that the input parameter custDoc is passed into the XMLQUERY function. Next is the SELECT-INTO statement, which demonstrates two important capabilities. First, the INTO clause is used to assign an XML value to the XML output parameter olddoc. Second, the variable id is passed into the XMLEXISTS predicate so that only the matching document is retrieved from the customer table. The last part of the stored procedure shows that you can use the XMLEXISTS predicate directly in an IF statement. It checks whether the address in the input document is in Canada. If this is true then the SET statement extracts the addr element of the document and assigns it to the XML variable address. Subsequently the address and the id variables are inserted into the table addrtable. CREATE TABLE addrtable(id INTEGER, addr XML)# CREATE PROCEDURE processDoc(IN custDoc XML, OUT oldDoc XML) BEGIN ATOMIC DECLARE id INTEGER; DECLARE address XML; SET id = XMLCAST(XMLQUERY('$d/customerinfo/@Cid' PASSING custDoc AS "d") as INTEGER); SELECT info INTO olddoc FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = $x]' PASSING id AS "x"); IF XMLEXISTS('$d/customerinfo/addr[@country = "Canada"]' PASSING custDoc AS "d") THEN SET address = XMLQUERY('$d/customerinfo/addr' PASSING custDoc AS "d"); INSERT INTO addrtable(id, addr) VALUES(id, XMLDOCUMENT(address)); END IF; END #
Figure 18.1
Stored procedure with basic XML manipulation
Since the body of a stored procedure can contain multiple statements, these statements have to be separated by the semicolon character. This use of the semicolon conflicts with the fact that the semicolon is also the default terminating character for statements in the DB2 Command Line Processor (CLP). The same applies to user-defined functions and triggers. To avoid problems you need to use a different terminating character in the CLP. For example, in Figure 18.1 the # is used as the terminating character for the CREATE PROCEDURE statement. You must invoke the CLP
550
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
with the td# option to set the #, or any other character of your choosing, as the statement terminator. If the CREATE PROCEDURE statement in Figure 18.1 is in a file create_proc.sql then the following command issued at the OS prompt creates the procedure: db2 -td# -f create_proc.sql
18.1.2
A Stored Procedure to Store XML in a Hybrid Manner
Let’s look at a common use case for a stored procedure. Assume you want to store the customer sample documents in a hybrid fashion. You might decide to keep the address information as XML, because you expect it to be of variable format over time, but you want to store customer name and phone information in relational columns. Since each customer can have multiple phone numbers (one-to-many relationship), the phone numbers have to be stored in a separate table with a proper join key. That join key can be a number generated by a sequence for each new XML document that comes in. A sequence is a database object that produces a stream of unique values. Figure 18.2 shows the definition of the target tables and the sequence. CREATE TABLE cust (id INTEGER, name VARCHAR(20), addr XML); CREATE TABLE phone(id INTEGER, type VARCHAR(20), number VARCHAR(20)); CREATE SEQUENCE id_seq START WITH 1 INCREMENT BY 1 CACHE 100;
Figure 18.2
Table and sequence definition for hybrid storage
The stored procedure in Figure 18.3 takes a customer XML document as an input parameter. Note that this parameter is of type XML. Each time the procedure is called, it uses the NEXTVAL expression to pull a new id value from the sequence. Then it uses two INSERT statements with XMLTABLE functions to extract the required values for insert into the target tables cust and phone. The first insert produces one row per customer, the second produces one row per phone element. The same id value is used for inserts into both tables to ensure referential integrity. Instead of using the sequence, the id could also be passed as a parameter from the calling application, or extracted from the document. CREATE PROCEDURE insertCustomer(IN custDoc XML, OUT id INTEGER) BEGIN ATOMIC SET id = NEXTVAL FOR id_seq; INSERT INTO cust(id, name, addr) SELECT id, T.name, T.address FROM XMLTABLE('$d/customerinfo' PASSING custDoc AS "d" COLUMNS name VARCHAR(20) PATH 'name', address XML PATH 'document{addr}' ) as T;
Figure 18.3
Stored procedure for hybrid XML inserts
18.1
Manipulating XML in SQL Stored Procedures
551
INSERT INTO phone (id, type, number) SELECT id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS type VARCHAR(20) PATH '@type', num VARCHAR(20) PATH '.') AS T; END #
Figure 18.3
Stored procedure for hybrid XML inserts (Continued)
With the stored procedure in Figure 18.3 in place, an application should use the stored procedure call call insertCustomer(?) to insert new customer documents and never use direct INSERT statements. If all inserts are performed through this stored procedure, the relational and XML data in the tables are always consistent. You can have similar stored procedures for update and delete operations. The stored procedures can also contain additional business logic or data manipulation. A challenging situation occurs when the stored procedure in Figure 18.3 fails with the following error message, where is a data value in the input document that cannot be cast to the data type VARCHAR(20): SQL16061N The value cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "VARCHAR_20". Error QName=err:FORG0001. SQLSTATE=10608.
Note that the XMLTABLE functions in the stored procedure cast the customer name, phone type, and phone number to VARCHAR(20). However, the error message does not specify which one of them caused the problem. In this simple example, a quick look at the might reveal which XML element or attribute caused the error. However, in more complex cases it is often difficult to identify which element or attribute is responsible for the error. The solution is to add code to the stored procedure to catch the SQL error, obtain the offending , look for it in the input document, and return the name of the XML element or attribute that caused the problem. This logic is coded in Figure 18.4. The INSERT statements in the procedure in Figure 18.4 are the same as previously in Figure 18.3. The difference in Figure 18.4 is the error handling. The procedure declares SQLSTATE 10680 as a condition, and an exit handler to take appropriate action when this condition occurs. This action is enclosed in a separate BEGIN-END block and only executed when the declared error happens. The exit handler obtains the error information and uses the SUBSTR function to extract the offending and data type from it. Then it uses the XQuery expression $d//(*,@*) [data(.) = $v]/local-name() to obtain the name of the element or attribute that contains the offending value. In this expression, $d represents the XML document and $v the value to
552
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
look for. The first part of the expression, $d//(*,@*), iterates over all elements and attributes in the document. For each of those, the predicate [data(.) = $v] checks whether the value of the element or attribute matches the from the error message. If the predicate is true, then the last step of the expression, /local-name(), obtains the name of the element or attribute. The whole expression is an argument of the function string-join, which produces a comma-separated list in case more than one node with the matching value is found in the document. CREATE PROCEDURE insertCustomer(IN custDoc XML, OUT id INTEGER, OUT MESSAGE_TEXT VARCHAR(300)) BEGIN ATOMIC DECLARE vErrMsg VARCHAR(300); DECLARE vValue VARCHAR(100); DECLARE vNode VARCHAR(100); DECLARE vType VARCHAR(100); DECLARE vTokenString VARCHAR(100); DECLARE XMLTABLE_CAST_FAILURE CONDITION FOR SQLSTATE '10608'; DECLARE EXIT HANDLER FOR XMLTABLE_CAST_FAILURE BEGIN -- retrieve error message and token string GET DIAGNOSTICS EXCEPTION 1 vTokenString = DB2_TOKEN_STRING, vErrMsg = MESSAGE_TEXT; SET vValue = SUBSTR(vErrMsg, 23, POSSTR(vErrMsg, '" ')-23); SET vType = SUBSTR(vTokenString, LENGTH(vValue)+2); -- find xml nodes whose values match the error token SET vNode = XMLCAST(XMLQUERY(' string-join($d//(*,@*)[data(.) = $v]/local-name(),",")' PASSING custDoc AS "d", vValue AS "v") AS VARCHAR(100)); -- create message text SET MESSAGE_TEXT = 'Failed to cast the value "' || vValue || '", at element or attribute "' || vNode || '", to type "' || vType || '".'; END ; SET id = NEXTVAL FOR id_seq; INSERT INTO cust(id, name, addr) SELECT id, T.name, T.address FROM XMLTABLE('$d/customerinfo' PASSING custDoc AS "d" COLUMNS name VARCHAR(20) PATH 'name', address XML PATH 'document{addr}' ) as T;
Figure 18.4
Stored procedure for hybrid XML inserts with error handling
18.1
Manipulating XML in SQL Stored Procedures
553
INSERT INTO phone (id, type, number) SELECT id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS type VARCHAR(20) PATH '@type', num VARCHAR(20) PATH '.') AS T; SET MESSAGE_TEXT = 'Insert successful.'; END #
Figure 18.4
18.1.3
Stored procedure for hybrid XML inserts with error handling (Continued)
Loops and Cursors
The example in Figure 18.5 shows that you can easily loop over the elements and attributes from one or multiple XML documents. The stored procedure takes an XML document as input and uses a SELECT statement with an XMLTABLE function to produce one row for each phone element. The FOR statement is used to iterate over these rows. When a FOR statement is executed, a cursor is implicitly declared such that each iteration of the FOR loop fetches the next row from the result set until there are no rows left. For each row, the statements in the DO clause of the FOR statement are executed. An IF-THEN-ELSE statement inserts the phone information into the table cellphones if the phone type is cell, and into the table landlines otherwise. To keep stored procedures simple, we recommend the use of FOR statements instead of explicit cursor declarations whenever possible. CREATE TABLE cellphones(id INTEGER, number VARCHAR(20))# CREATE TABLE landlines(id INTEGER, number VARCHAR(20))# CREATE PROCEDURE processPhones(IN custDoc XML) BEGIN ATOMIC FOR phone AS SELECT T.id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS id INTEGER PATH '../@Cid', type VARCHAR(5) PATH '@type', num VARCHAR(20) PATH '.') as T DO IF phone.type='cell' THEN INSERT INTO cellphones(id,number) VALUES(phone.id, phone.num); ELSE INSERT INTO landlines(id, number) VALUES(phone.id, phone.num); END IF; END FOR; END #
Figure 18.5
FOR loop over repeating XML elements
554
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
You can also use XQuery without SQL in stored procedures, but not in a FOR statement or any static manner. You have to construct the XQuery dynamically as a string and prepare and open it as a dynamic cursor. In Figure 18.5 an XQuery string is assigned to the variable xqr. Note that the query string includes the value of the input parameter city. The query is then prepared and opened as a CURSOR WITH RETURN TO CALLER. With this cursor definition, the result sequence of the XQuery becomes the result set of the stored procedure. The procedure does not fetch from or close the cursor, which allows the calling application to iterate over the result of the query. Alternatively you could decide to have a WHILE loop with a FETCH statement in the stored procedure itself to process the result set. CREATE PROCEDURE cityphones(IN city VARCHAR(20)) BEGIN ATOMIC DECLARE xqr VARCHAR(2048); DECLARE c1 CURSOR WITH RETURN TO CALLER FOR stmt; SET xqr = 'xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") where $i/customerinfo/addr[city="'|| city ||'"] return $i/customerinfo/phone'; PREPARE stmt FROM xqr; OPEN c1; END #
Figure 18.6
18.1.4
Dynamic cursor for an XQuery
A Stored Procedure to Update a Selected XML Element or Attribute
The stored procedure in Figure 18.7 changes the value of a selected XML node in a document. The input parameters to the procedure are an XML document, the path to the node that is to be updated, and the new value of the node. The parameter for the XML document is declared as INOUT, so that the updated document is returned. The procedure constructs an XQuery update expression in an XMLQUERY function. The input parameter xpath provides the target path for the replace clause. Additionally, the document and the new value are passed as parameters into the XQuery Update expression. The statement OPEN c1 USING mydoc, value binds the procedure parameters mydoc and value to the parameters markers in the XMLQUERY function. CREATE PROCEDURE updateXPath (INOUT mydoc XML, IN xpath VARCHAR(1024), IN value VARCHAR(128)) BEGIN ATOMIC DECLARE sql VARCHAR(2048); DECLARE c1 CURSOR FOR stmt; SET sql = 'VALUES XMLQUERY('' copy $new := $original modify do replace value of $new' || xpath ||'
Figure 18.7
Stored procedure to update a selected XML element or attribute
18.1
Manipulating XML in SQL Stored Procedures
555
with $value return $new '' PASSING XMLCAST(? AS XML) AS "original", CAST(? AS VARCHAR(1024)) AS "value") '; PREPARE stmt FROM sql; OPEN c1 USING mydoc, value; FETCH c1 INTO mydoc; CLOSE c1; END #
Figure 18.7
18.1.5
Stored procedure to update a selected XML element or attribute (Continued)
Three Tips for Testing Stored Procedures
The following three tips seem to be not as widely known as they should be, but they are extremely useful. Tip 1: How to Test Stored Procedures in the CLP It is often very useful to test stored procedures in the CLP without having to have application code that calls the procedure and passes an XML document as input. You can simply import your test documents into a DB2 table, such as testdocs, and use an SQL fullselect as the input parameter in the stored procedure call in the CLP. Make sure that the fullselect produces exactly one row with one column of type XML, as shown in Figure 18.8. The second parameter is a question mark as a placeholder for the output parameter oldDoc. CREATE TABLE testdocs(id INTEGER NOT NULL PRIMARY KEY, doc XML); IMPORT FROM testdata.del OF DEL INSERT INTO testdocs; CALL processDoc( (SELECT doc FROM testdocs WHERE id = 3),? );
Figure 18.8
Testing a stored procedure
Tip 2: How to Get the Execution Plan of a Stored Procedure If a stored procedure does not perform well then it can be useful to examine the execution plans of queries or other statements in the stored procedure. One approach is to copy individual statements from the stored procedure and to explain them separately. However, it can happen that a statement has a different execution plan when it is compiled in the context of a stored procedure than when it is compiled by itself. In DB2 for Linux, UNIX, and Windows you can use the following approach to explain the statements within a stored procedure. 1. Establish a connection to the database. 2. Create explain tables if they do not already exist (see section 14.1.1, The Explain Tables in DB2 for Linux, UNIX, and Windows).
556
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
3. Issue the following command at the OS prompt to enable the capturing of execution plans when stored procedures are created in the current session: db2 "CALL SYSPROC.SET_ROUTINE_OPTS('EXPLAIN ALL')"
4. If a CREATE PROCEDURE statement is the only statement in a file called create_ proc.sql, and if the statement is terminated with the # character, create the procedure with the following command at the OS prompt: db2 -td# -f create_proc.sql
5. Use the db2exfmt utility to write the execution plan to a file such as myprocplan.txt: db2exfmt -d -1 -o myprocplan.txt
The output file will contain separate explain information for each statement in the stored procedure. If you want to check whether the capturing of explain information for stored procedures is enabled, use the following SELECT statement: SELECT GET_ROUTINE_OPTS() FROM sysibm.sysdummy1
To revert to not explaining stored procedures, use this statement: db2 "CALL SYSPROC.SET_ROUTINE_OPTS('EXPLAIN NO')"
Tip 3: How to Profile a Stored Procedure IBM Data Studio Developer contains a very useful stored procedure profiler that can provide information about the runtime performance of a procedure. For each statement in the stored procedure, the profile reveals the number of executions, the elapsed time, CPU time, and other optional metrics such as the number of rows read or written, or the number of logical and physical page reads. This information is extremely helpful to understand the behavior of a complex stored procedure and to discover which parts of a procedure are particularly expensive to run. If you have a Data Development Project in Data Studio and a stored procedure in the Stored Procedures folder of the Data Project Explorer, right-click on the procedure name and choose Run Profiling. The same context menu also has a command to invoke the stored procedure debugger, which is another helpful tool for the development of stored procedures in DB2 for Linux, UNIX, and Windows, and DB2 for z/OS.
18.2
MANIPULATING XML IN USER-DEFINED FUNCTIONS
DB2 9.7 for Linux, UNIX, and Windows allows you to use the XML data type in user-defined functions (UDFs). UDFs can have XML type parameters and variables and can contain SQL/XML statements that manipulate XML data. Most of these capabilities are similar to the XML support in stored procedures. An important difference between UDFs and stored procedures is that UDFs can be used in SQL statements while stored procedures can only be invoked with a CALL statement. In this section we discuss several examples of UDFs that manipulate XML data.
18.2
Manipulating XML in User-Defined Functions
18.2.1
557
A UDF to Extract an Element or Attribute Value
The function getname in Figure 18.9 takes an XML document as input and returns a value of type VARCHAR(25). The body of the function consists of a single RETURN statement. It contains the functions XMLCAST and XMLQUERY to extract the name element and convert it to VARCHAR(25). The PASSING clause of the XMLQUERY function passes the function’s input parameter doc into the XPath expression. Below the function you see an SQL statement that invokes the function in its SELECT clause. The use of the UDF allows an application to retrieve customer names without having to code the actual XPath expression and SQL/XML functions. CREATE FUNCTION getname(doc XML) RETURNS VARCHAR(25) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC BEGIN ATOMIC RETURN XMLCAST(XMLQUERY('$d/customerinfo/name' PASSING doc AS "d") AS VARCHAR(25)); END # SELECT getname(info) AS name FROM customer WHERE cid = 1002 # NAME ------------------------Jim Noodle 1 record(s) selected.
Figure 18.9
Scalar UDF to extract an element value
Such a scalar UDF also enables you to create a table with a generated column whose value is automatically computed based on the XML documents in an XML column: CREATE TABLE custinfo(info XML, name VARCHAR(25) GENERATED ALWAYS AS (getname(info)));
The function in Figure 18.9 is a scalar function, which means it returns a single value. If you want to use a similar function to extract a repeating element then a table function instead of a scalar function can be more appropriate. This is shown next.
18.2.2
A UDF to Extract the Values of a Repeating Element
Figure 18.10 demonstrates a function that extracts the phone elements from a given document. Since a customer document can have multiple phone elements, the return type of the UDF is a table. This UDF is therefore a table function. The structure of the returned table is defined in the second line of the CREATE FUNCTION statement. The body of the function contains a RETURN statement that includes an SQL/XML query that produces the rows and columns of the result table.
558
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
Below the function you see an SQL query that uses the UDF. Since this UDF is a table function, it is used in a table expression in the FROM clause of the SELECT statement. The result set of the query includes two columns from the UDF plus the cid column from the customer table. CREATE FUNCTION getphone(doc XML) RETURNS TABLE(type VARCHAR(10), number VARCHAR(20)) BEGIN ATOMIC RETURN SELECT type, number FROM XMLTABLE('$d/customerinfo/phone' PASSING doc AS "d" COLUMNS type VARCHAR(10) PATH '@type', number VARCHAR(20) PATH '.') ; END #
SELECT cid, p.type, p.number FROM customer, TABLE(getphone(info)) p WHERE cid = 1004# CID ---------------1004 1004
TYPE ---------work home
NUMBER -------------------905-555-4789 416-555-3376
2 record(s) selected.
Figure 18.10
Table UDF to extract repeating element values
You can certainly use multiple UDFs in a single query, as illustrated by the query in Figure 18.11. SELECT getname(info) AS name, p.type, p.number FROM customer, TABLE(getphone(info)) p WHERE cid IN (1004, 1005) NAME ------------------------Matt Foreman Matt Foreman Larry Menard Larry Menard
TYPE ---------work home work home
NUMBER -------------------905-555-4789 416-555-3376 905-555-9146 416-555-6121
4 record(s) selected.
Figure 18.11
18.2.3
Using a scalar UDF and a table UDF in a query
A UDF to Shred XML Data to a Relational Table
A table function can also help you shred XML data into a relational table. Suppose you want to populate the following target table:
18.2
Manipulating XML in User-Defined Functions
559
CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30))
To shred XML documents into this table, you can create a table function that takes an XML document as input and returns a set of rows with columns that match the target table. Figure 18.12 defines such a function. CREATE FUNCTION extractcols(doc XML) RETURNS TABLE(cid INT, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30)) BEGIN ATOMIC RETURN SELECT x.custid, x.custname, x.str, x.city FROM XMLTABLE('$d/customerinfo' PASSING doc AS "d" COLUMNS custid INTEGER PATH '@Cid', custname VARCHAR(30) PATH 'name', str VARCHAR(40) PATH 'addr/street', city VARCHAR(30) PATH 'addr/city' ) AS x ; END #
Figure 18.12
Table function to extract several elements and attributes
You can then include this table function in an INSERT-INTO-SELECT-FROM statement. The first INSERT statement in Figure 18.13 reads XML documents from the XML column info of the customer table and shreds them into the address table. The function extractcols takes the XML column info as input and produces relational rows for insert into the target table. The second INSERT statement in Figure 18.13 shreds an XML document that is provided by an application through the parameter marker in the FROM clause. INSERT INTO address(cid, name, street, city) SELECT e.cid, e.name, e.street , e.city FROM customer c, TABLE(extractcols(c.info)) e WHERE c.cid < 1050;
INSERT INTO address(cid, name, street, city) SELECT e.cid, e.name, e.street , e.city FROM TABLE(extractcols(cast(? as XML))) e ;
Figure 18.13
18.2.4
Using a table function to shred XML documents
A UDF to Modify an XML Document
Chapter 12, Updating and Transforming XML Documents, describes XQuery Update expressions that allow you to change the value of an element or attribute, or to insert, rename, or delete elements and attributes in a document. It can be convenient to encapsulate such update expressions in a user-defined function, which then serves as a much simpler update interface for database applications.
560
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
Using the customer documents in the sample database as an example, suppose you want to simplify the task of updating a selected phone element in a document. You could code the UDF in Figure 18.14, which has the following input parameters: • doc: the XML document that is to be updated • phonetype: a string such as “cell” or “work” to indicate which phone is to be updated • number: the new telephone number The function returns the input document where the phone element with the matching type attribute has been given the new value. CREATE FUNCTION updatephone(doc XML, phonetype VARCHAR(8), number VARCHAR(12) ) RETURNS XML BEGIN ATOMIC RETURN XMLQUERY(' copy $new := $p1 modify do replace value of $new/customerinfo/phone[@type=$p2] with $p3 return $new' PASSING doc AS "p1", phonetype as "p2", number as "p3"); END #
Figure 18.14
Scalar UDF to modify an XML document
If an application wants to change the work phone number of customer 1002 to the new value 408-463-4963, it can simply issue the UPDATE statement in Figure 18.15 and does not need to be concerned with the details of the underlying XQuery Update expression. UPDATE customer SET info = updatephone(info, 'work', '408-463-4963') WHERE cid = 1002
Figure 18.15
UPDATE statement with a scalar UDF
Remember that the update expression “replace value of” fails if the target path ($new/customerinfo/phone[@type=$p2]) does not produce exactly one node. In other words, the invocation of the UDF in Figure 18.15 leads to an error if the document for customer 1002 does not contain a phone element whose type attribute has the value work. Therefore you might want to perform an “upsert” operation (update or insert). An “upsert” operation updates the phone element if it exists and inserts a new phone element otherwise. This logic is coded in the UDF in Figure 18.16 with an XQuery if-then-else expression. The else branch constructs a new phone element with a type attribute, and the variables $p2 and $p3 provide the values for this
18.3
Manipulating XML Data with Triggers
561
attribute and element, respectively. Within such attribute and element constructors the variables $p2 and $p3 have to be in curly brackets. CREATE FUNCTION upsert_phone(doc XML, phonetype VARCHAR(8), number VARCHAR(12) ) RETURNS XML BEGIN ATOMIC RETURN XMLQUERY('copy $new := $p1 modify if ($new/customerinfo/phone[@type = $p2]) then do replace value of $new/customerinfo/phone[@type = $p2] with $p3 else do insert {$p3} as last into $new/customerinfo return $new' PASSING doc AS "p1", phonetype as "p2", number as "p3"); END #
Figure 18.16
18.3
Scalar UDF to update or insert an XML element (“upsert”)
MANIPULATING XML DATA WITH TRIGGERS
A trigger defines a set of operations that are performed in response to an INSERT, UPDATE, or DELETE statement on a specified table. For example, a trigger can perform updates to other tables, automatically generate or change values for inserted or updated rows, or invoke functions and stored procedures. When an INSERT, UPDATE, or DELETE statement activates a trigger, the operations that are executed by the trigger can reference the column values of the rows that are being inserted, updated, or deleted. So-called transition variables allow you to reference the new column values provided in INSERT and UPDATE statements, or the old values that are removed by DELETE or UPDATE statements. You can define triggers on tables with XML columns, and you can also define UPDATE triggers on individual XML columns in a table. Transition variables in triggers do not allow you to access the old or new value of an XML column, which is true in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. But, the transition variables allow you to reference the old or new value of nonXML columns in the same row, such as primary key values. Therefore, triggers can still be used for effective XML manipulation, as you will see in the examples in this section. DB2 for Linux, UNIX, and Windows has one exception where it is possible to reference the new value of an XML column as a transition variable. The exception is that the new value of an XML column can be used in the XMLVALIDATE function to trigger the validation of a document that is being inserted or updated. Such a validation trigger was shown in section 17.5, Automatic Validation with Triggers.
562
18.3.1
Chapter 18
Using XML in Stored Procedures, UDFs, and Triggers
Insert Triggers on Tables with XML Columns
Let’s look at an example in which triggers maintain the hybrid storage of incoming XML data. Suppose you receive XML documents such as the customer documents in the sample database. For reasons explained in section 2.4, Using a Hybrid XML/Relational Approach, you might decide to store the full document in a column of type XML and to extract a few selected element values into relational columns. For example, you might want to use relational columns to store the customer name and city as well as the type and number of the customer phones. Figure 18.17 defines the appropriate target tables. Since a customer document can contain multiple phone elements, the phone information is stored in a separate table together with a join key. CREATE TABLE cust(cust_id name city info
INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY, VARCHAR(30), VARCHAR(25), XML )#
CREATE TABLE phones(cust_id type number
Figure 18.17
INTEGER NOT NULL, VARCHAR (5), VARCHAR (15) )#
Tables for hybrid XML storage
Next you can define a trigger that automatically populates the relational columns in both tables whenever an XML document is inserted into the info column with an INSERT statement, such as the following: INSERT INTO cust(info) VALUES(?)
An appropriate insert trigger is shown in Figure 18.18. The trigger is fired after a new row is inserted into the cust table but before the INSERT statement commits. The transition variable newrow can be used to reference the column values of the newly inserted row, except for the XML column. For example, newrow.cust_id identifies the generated primary key value of the inserted row. This primary key value allows subselects in the trigger to identify the newly inserted row in the table and to extract the desired element values from the new XML document in that row. Since the XML document cannot be accessed through the transition variable, the trigger accesses the document directly in the table based on the primary key that it finds in the transition variable. The body of the trigger contains an UPDATE statement and an INSERT statement. The UPDATE statement populates the columns name and city in the newly inserted row. The INSERT statement adds rows to the phones table, one row for each phone element in the new document. These rows include the primary key cust_id of the cust table so that the relationship between phones and customers is properly maintained.
18.3
Manipulating XML Data with Triggers
563
CREATE TRIGGER cust_insert AFTER INSERT ON cust REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC UPDATE cust SET (name, city) = (SELECT X.name, X.city FROM cust, XMLTABLE('$INFO/customerinfo' COLUMNS name VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city') AS X WHERE cust.cust_id = newrow.cust_id ) WHERE cust.cust_id = newrow.cust_id; INSERT INTO phones(cust_id, type, number) SELECT cust.cust_id, P.type, P.number FROM cust, XMLTABLE('$INFO/customerinfo/phone' COLUMNS type VARCHAR(5) PATH '@type', number VARCHAR(15) PATH '.') AS P WHERE cust.cust_id = newrow.cust_id; END#
Figure 18.18
18.3.2
Insert trigger
Delete Triggers on Tables with XML Columns
Let’s continue with the preceding example. In addition to the insert trigger you also need a delete trigger that removes the correct rows from the phones table whenever rows are deleted from the cust table. Figure 18.19 shows such a delete trigger. The transition variable oldrow provides access to the cust_id values of the rows deleted in the cust table. These values allow the trigger to delete the corresponding rows in the phones table that have the same cust_id value. CREATE TRIGGER delete_cust AFTER DELETE ON cust REFERENCING OLD AS oldrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC DELETE FROM phones WHERE phones.cust_id = oldrow.cust_id; END#
Figure 18.19
Delete trigger
564
Chapter 18
18.3.3
Using XML in Stored Procedures, UDFs, and Triggers
Update Triggers on XML Columns
To complete our example, let’s examine the update trigger in Figure 18.20. It maintains the relational columns in the cust and phones tables whenever the info column in the cust table is updated. Note that an update of a customer document might have changed, added, or removed one or multiple phone elements. Thus, the only way to reliably update the phones table is to issue a DELETE followed by an INSERT statement. The UPDATE, DELETE, and INSERT statements in this trigger are the same as in the previous triggers. CREATE TRIGGER update_cust AFTER UPDATE OF info ON cust REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC UPDATE cust SET (name, city) = (SELECT X.name, X.city FROM cust, XMLTABLE('$INFO/customerinfo' COLUMNS name VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city') AS X WHERE cust.cust_id = newrow.cust_id ) WHERE cust.cust_id = newrow.cust_id; DELETE FROM phones WHERE phones.cust_id = newrow.cust_id; INSERT INTO phones(cust_id, type, number) SELECT cust.cust_id, P.type, P.number FROM cust, XMLTABLE('$INFO/customerinfo/phone' COLUMNS type VARCHAR(5) PATH '@type', number VARCHAR(15) PATH '.') AS P WHERE cust.cust_id = newrow.cust_id; END#
Figure 18.20
18.4
Update trigger
SUMMARY
Stored procedures, user-defined functions (UDFs), and triggers are very powerful tools to customize or automate data processing steps for your specific application. DB2 for Linux, UNIX, and Windows allows you to create stored procedures and UDFs with input parameters, output parameters, and variables of type XML. Such procedures and functions can contain XQuery and SQL/XML statements to query and manipulate XML data. The benefit of using the XML data type for parameters and variables is that DB2 keeps the XML data internally in the pureXML parsed tree format. This format enables stored procedures and
18.4
Summary
565
UDFs to process XML much more efficiently than a textual XML representation in VARCHAR or CLOB parameters would allow. For example, a UDF can read and manipulate data from an XML column without XML parsing because the data stays in DB2’s internal XML storage format. If an application passes an XML document to a stored procedure via an XML type parameter, the document is parsed only once upon entry into the procedure. Any subsequent processing steps within the procedure do not require XML parsing. Hence, the XML data type support in stored procedures and UDFs is a significant performance benefit for any custom XML processing logic that you implement. You can also define triggers on tables with XML columns to implement automated actions that are executed when XML documents are inserted, deleted, or updated. In a trigger, transitional variables give you access to the relational values of the affected rows, but not to the old or new value of an affected XML column. In the body of a trigger you can use the relational primary key values of the affected rows to find and access the corresponding XML documents in the table and perform any required operation on them. Stored procedures have been found very useful to encapsulate and hide XML processing from application programs. This reduces application complexity and improves end-to-end performance because SQL/XML statements in DB2 procedures can perform many XML processing tasks more efficiently and with less code than application programs.
This page intentionally left blank
C
H A P T E R
19
Performing Full-Text Search
ML applications and data can often be classified in one of two ways: predominantly datacentric or predominantly document- or content-centric. For example, the processing of orders, sales, or trades is typically data-centric while the management of contracts, emails, or news articles is document-centric. Content-centric XML documents often contain significant amounts of free-flow text, including full sentences and paragraphs. Such full text is rare in datacentric XML, which tends to contain atomic data values such as names, dates, prices, quantities, or addresses. Therefore, full-text search is more commonly required for querying content-centric XML than data-centric XML documents.
X
There are also applications that exhibit characteristics of both, data- and document-oriented XML processing. In fact, it is a particular strength of XML to serve as a single format for any combination of data and content. For example, plain text comments can be part of an order, or a description can be part of a product detail record. Wherever individual data items consist of more than one word, and whenever you need to search for substring matches, full-text search can be the right solution. The following topics are discussed in this chapter: • Overview of full-text search capabilities in DB2 (section 19.1) • Sample table and documents used in this chapter (section 19.2) • The DB2 Net Search Extender (sections 19.3 through 19.5) • DB2 Text Search (section 19.6) • Summary of text search administration commands (section 19.7) • Comments on full-text search in DB2 for z/OS (section 19.8)
567
568
Chapter 19
Performing Full-Text Search
19.1 OVERVIEW OF TEXT SEARCH IN DB2 DB2 offers two technologies to perform full-text search. Both of them handle plain text, HTML and XML data, as well as document formats such as PDF and Microsoft Word. • The DB2 Net Search Extender (NSE) has been providing powerful text search capabilities since DB2 8 for Linux, UNIX, and Windows. The Net Search Extender is XML aware and fully functional with the new XML column type in DB2 9 and higher. The DB2 Net Search Extender continues to provide reliable and mature text search in DB2 with proven scalability and performance. • DB2 Text Search is new text search functionality that is based on the technology in the open source project Lucene. The same technology is also used in IBM OmniFind Text Search Server for DB2 z/OS (see section 19.8). DB2 Text Search became first available in DB2 9.5 for Linux, UNIX, and Windows, Fixpack 1. Its features and performance continue to be improved in subsequent releases. DB2 Text Search in DB2 9.5 is just the beginning of integrating OmniFind text search capabilities into DB2 on all platforms. In a given DB2 database you can use either the DB2 Net Search Extender or DB2 Text Search, not both. The DB2 Net Search Extender and DB2 Text Search can coexist in the same database instance, but only one of them can be enabled for a given database. You will find that many DB2 Text Search features and most of its administration commands are identical or similar to those of the DB2 Net Search Extender. The DB2 Net Search Extender and DB2 Text Search have several design principles in common: • A table in which one or multiple columns are indexed for text search must have a primary key. The primary key values of the table are used in the text index to correlate text search results from the text index back to the rows in the table. Consequently, the finest granularity of text search results is a row (a document). • When a text index is created, triggers and a staging table (also known as a log table) are also automatically created in DB2. Any insert, update, or delete on the indexed table fires a trigger that in turn writes corresponding information about the data changes into the staging table. The content of this staging table is read to update the text index, and is subsequently deleted. • Text indexes are maintained asynchronously; that is, not in the context of the original insert, update, or delete statements. Updates of the text index are either explicitly invoked with an UPDATE INDEX command, or they happen regularly on a predefined schedule. Table 19.1 summarizes the most important commonalities and differences between the DB2 Net Search Extender and DB2 Text Search as of DB2 Version 9.5 Fixpack 1.
19.1 Overview of Text Search in DB2
Table 19.1
569
Comparing the DB2 Net Search Extender and DB2 Text Search
Feature
DB2 Net Search Extender
DB2 Text Search
Separate Text Search Install
Yes
No, part of DB2 install
DPF Support
Yes (on AIX)
No
Command line interface
Yes
Yes
Administration also through the DB2 Control Center
Yes
No
Administration also through stored procedures
No
Yes
DB2 Backup includes text index
No
No
Asynchronous index updates
Yes
Yes
Synchronous index updates
No
No
Index updates: manual or scheduled
Both
Both
Document models—to index only a subsection (part) of each XML document
Yes
No
Multiple text indexes per column
Yes
No
Indexes on views and nick names
Yes
No
Stop words (avoid indexing irrelevant words, such as "a", "or", and "the")
Yes, optional
No
SQL function: contains
Yes
Yes
XQuery function:
No
Yes
Support for XML namespaces
Limited
No
Can limit the result set size
Yes
Yes
Boolean search (and, or, and not operators for text predicates)
Yes (and: &, or: |)
Yes (and: &&, or: ||)
Wildcards in search predicates
Yes
Yes
Search with escape characters
Yes
Yes
Stemming (reduces search word to its base form)
Yes, optional
Yes, implicitly
Synonym search (Thesaurus)
Yes
Yes
db2-fn:xmlcolumn-contains
(continues)
570
Chapter 19
Table 19.1
Performing Full-Text Search
Comparing the DB2 Net Search Extender and DB2 Text Search (Continued)
Feature
DB2 Net Search Extender
DB2 Text Search
Weighted search
Yes
Yes
Fuzzy search
Yes
No
Proximity search
Yes
No
Ranking/scoring of result set items
Yes
Yes
Case-sensitive search
Yes
No
Linguistic processing (search for linguistic variations of the search term)
English only
All supported languages
19.2 SAMPLE TABLE AND DATA In the remainder of this chapter we use the following sample table and data to illustrate the text search capabilities in DB2 (see Figure 19.1). You will see that it does not take magic to perform efficient XML full-text search in DB2. CREATE TABLE orders (id INTEGER NOT NULL PRIMARY KEY, doc XML) id 1
2
doc Wendy Witch Crystal Ball, Deluxe Edition 5 95.00 Customer requested extra wrapping. Magic Potion, 300ml flask 10 19.95 Await further shipping instructions. William Wizard Magician's Hat, Black 1 75.00 Must be big enough for the rabbit. White Rabbit 1 295.00 Extra soft fur and extra white.
Figure 19.1
Sample table and data
19.3 Enabling a Database for the DB2 Net Search Extender
571
Note that the second document contains a single quote in the name of the first item. This quote is not a problem if you import or load the document, or insert with a parameter marker. But, if you execute an insert statement in the DB2 Command Line Processor (CLP) with a literal XML document in the statement, a single quote in an XML value conflicts with the single quotes that enclose the document string. Hence, the first of the three insert statements in Figure 19.2 fails. You can escape the single quote either by using two single quotes or by using the corresponding entity reference ('). --incorrect: INSERT INTO orders VALUES(1, 'Magician's Hat'); --correct: INSERT INTO orders VALUES(2, 'Magician''s Hat'); INSERT INTO orders VALUES(3, 'Magician's Hat');
Figure 19.2
Inserting XML data with quotes in the CLP
19.3 ENABLING A DATABASE FOR THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender (NSE) requires a separate install in addition to the regular DB2 install. Appendix C, Further Reading, contains links to information about downloading and installing the NSE. After installation you can start and stop the Net Search Extender instances services much like you start and stop a DB2 server. You have to be the DB2 instance owner to issue the following commands at the OS prompt: db2text start db2text stop [force]
The optional keyword force can be used to forcibly stop the NSE even if there are processes still holding locks or if caching for an index is still activated. Be careful with the use of the force option. If you perform db2text stop force while an index update or reorg is in progress, the text index may get damaged and might have to be rebuilt entirely. After starting the DB2 Net Search Extender instance services, the first step is to enable a database for text search. Execute the following command at the OS prompt to enable the database for text search: db2text ENABLE DATABASE FOR TEXT CONNECT TO
As for the majority of the db2text commands, you can optionally provide a user name and password for authentication to the database: db2text ENABLE DATABASE FOR TEXT CONNECT TO USER USING
572
Chapter 19
Performing Full-Text Search
The ENABLE DATABASE command creates UDFs, stored procedures, and the following tables and views in the default table space of the database: • db2ext.dbdefaults: Contains default values for text search configuration parameters • db2ext.textindexformats: Stores the list of supported index formats and the currently used document models • db2ext.indexconfiguration: Stores index configuration parameters • db2ext.textindexes: Keeps track of all text indexes Similarly, you can disable the DB2 Net Search Extender for a database with the following command, which removes the NSE tables, views, and UDFs, and drops all NSE indexes for that database. db2text DISABLE DATABASE FOR TEXT [force] CONNECT TO USER USING
19.4 MANAGING FULL-TEXT INDEXES WITH THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender allows you to define one or multiple text indexes per column. It also allows you to index only a certain section of each document instead of indexing all elements and attributes in a document. Such partial indexing leads to fewer index entries per document, smaller text indexes, and better index update and search performance. The following sections illustrate the CREATE INDEX command and its various options for the DB2 Net Search Extender.
19.4.1 Creating Basic Text Indexes Issued at the OS command prompt, the following command creates a text index with the name orderIdx on the column doc in the table orders in the database : db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) CONNECT TO USER USING "
Depending on the operating system and configuration of your command shell, enclosing the command parameter for db2text in double quotes might be necessary, as shown in this example. Specifying a user name and a password for authentication to the database is optional. The table orders must have a primary key; otherwise, a text index cannot be created. The column doc must be of type XML or any character or binary column type, such as CHAR, VARCHAR, CLOB, BLOB, DBCLOB, GRAPHIC, or VARCHAR FOR BIT DATA. Unlike relational indexes in DB2, the CREATE INDEX statement for a text index defines an index but does not actually build the text index. An UPDATE INDEX command is required after the CREATE INDEX statement to perform the initial index build (see section 19.4.6).
19.4
Managing Full-Text Indexes with the DB2 Net Search Extender
573
For each text index, the Net Search Extender creates a log table and an event table as well as triggers on the user table. Upon insert, delete, update, or import of data, the triggers fire and write change information into the log table, which is later used to update the index. The event table contains information about index updates and potential problems, such as invalid document formats. If you use the DB2 LOAD utility to move documents into your table, the triggers don’t fire and incremental indexing of the loaded documents does not happen. Therefore, it is recommended to use the DB2 IMPORT utility, which activates the triggers. If you insist on using LOAD for performance reasons, then it is your own responsibility to fill the log table appropriately before issuing the next UPDATE INDEX command. The names of the log table and event table are system-generated. DB2 also creates views on these tables to allow easy inspection of the information. Use the SQL statement in Figure 19.3 to obtain the schema and view names for the index called orderIdx. SELECT eventviewschema, eventviewname, logviewschema, logviewname FROM db2ext.textindexes WHERE indname = 'ORDERIDX'
Figure 19.3
Obtaining names of the event and log views for a given text index
19.4.2 Creating Text Indexes with Specific Storage Paths The previous examples used default locations for the text index and the index building work area. The work area is used to hold temporary files that are created when text indexes are built or updated. The default locations are defined in the table DB2EXT.DBDEFAULTS and are typically in /sqllib/db2ext/indexes. This default location is often not a good place for large text indexes. The command in Figure 19.4 specifies that the index is created in the file system /data/index while temporary NSE files are written to /data/temp. Additionally, the log and event tables are placed in the table space named nse_tspace instead of the default user table space. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) INDEX DIRECTORY /data/index WORK DIRECTORY /data/temp ADMINISTRATION TABLES IN nse_tspace CONNECT TO "
Figure 19.4
Text index with non-default storage locations
The DB2 instance owner needs to have read, write, and execute permissions for the index and the work directory. In a DPF system these directories have to exist on every physical node. For best performance, the index and work directories should be allocated on RAID arrays that allow high I/O throughput.
574
Chapter 19
Performing Full-Text Search
PERFORMANCE TIP When a text index is created or updated, potentially large amounts of data might have to be moved from the work directory to the index directory. If the index directory and the work directory are located in different file systems, then this move is an expensive copy operation. If the index and work directory are located within the same file system, an inexpensive rename operation can be performed instead of a copy. Hence, for best performance it is highly recommended that the index and work directory share the same file system.
The disk space required for an index depends on the amount and type of data that is being indexed and on the length of the primary key in the user table. Since the primary key is part of the index, short keys such as INTEGER or TIMESTAMP are preferable over long keys, such as CHAR(128). As a rule of thumb you should reserve at least 0.7 times as much space for the text index as the size of the data volume you want to index. The work area can require two to three times as much space as the raw data.
19.4.3
Creating Text Indexes with a Periodic Update Schedule
By default a text index is not updated automatically. You have to use the explicit UPDATE INDEX command whenever you want to refresh the text index, or configure the index for regularly scheduled index updates. The CREATE INDEX statement in Figure19.5 defines a text index that is automatically refreshed four times a day. The string D(*)H(0,6,12,18)M(30) means that the index is updated every day at 0:30, 6:30, 12:30, and 18:30 hours. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(0,6,12,18)M(30) CONNECT TO "
Figure 19.5
Text index with automatic periodic updates
Alternatively, the string D(1,2,3,4,5)H(*)M(0,15,30,45) would mean that the index gets updated Monday through Friday every 15 minutes. You will see later that there is also an ALTER INDEX command in which you can use the UPDATE FREQUENCY clause to define or change automatic updates for existing indexes. System load considerations and the time it takes for an index update to finish should be the guiding factors for choosing an appropriate update interval that is not too short. An update interval of one minute is almost always the wrong thing to do. NOTE
19.4
Managing Full-Text Indexes with the DB2 Net Search Extender
575
Depending on your application, you might want to avoid index maintenance at the scheduled times if there was only an insignificant number of changes to your data since the last time the index was updated. Figure 19.6 creates an index that is updated every 30 minutes if there are at least 50 document changes queued up in the log table. If there are less than 50 changes in the log table, the index is not updated. After 30 minutes, the scheduler checks again whether 50 or more changes have accumulated. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(*)M(0, 30) UPDATE MINIMUM 50 CONNECT TO "
Figure 19.6
Text index with automatic updates when “enough” new rows are available
Such a combination of UPDATE FREQUENCY and UPDATE MINIMUM allows you to define an index update schedule in which the index is updated more frequently when there are many changes in the base table and less frequently if there are fewer changes. If omitted, the default value for UPDATE MINIMUM is 1. Instead of updating the index incrementally you can also choose to always re-create the index from scratch. Figure 19.7 defines an index that is recreated entirely every night at 2 a.m. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(2)M(0) RECREATE INDEX ON UPDATE CONNECT TO "
Figure 19.7
Text index with automatic re-create
If you define an index with the RECREATE option, no log table and no triggers are created for this index. Use this option with caution as rebuilding a large text index can take a long time. Note that the DB2 Control Center allows you to administrate the DB2 Net Search Extender and to configure the update behavior of text indexes. When you right-click on a database name you are presented with the option to enable the database for text search. A right-click on the index folder of a database lets you create regular relational indexes but also text indexes. A multi-step wizard guides you through the text index definition and allows you to change default parameters such as index location and update characteristics. Figure 19.8 illustrates step 4 of the Create Text Index Wizard, where you can set the frequency of automatic updates. The settings selected in Figure 19.8 result in a CREATE INDEX statement with the clause UPDATE FREQUENCY D(1) H(3) M(30).
576
Chapter 19
Figure 19.8
19.4.4
Performing Full-Text Search
Create Text Index Wizard in the DB2 Control Center
Creating Text Indexes for Specific Parts of Each Document
When you define a text index on an XML column, the DB2 Net Search Extender creates index entries for all XML elements and attributes in the XML documents in the column. But, indexing all parts of the documents is not always necessary. Let’s look at the sample document in Figure 19.1. If you manage many “order” documents of this nature, you might want to perform full-text search on item names and comments. In that case, creating a full-text index on these elements is sufficient and leads to a much smaller index as compared to indexing all elements and attributes. A smaller index often allows better update and search performance. If you also need to perform queries with predicates on short data values— such as order date, customer name, item key, quantity, and price—you should use regular XML indexes. With the Net Search Extender you can use document models to control which parts of the document structure are and aren’t indexed, and by which name you can refer to these parts in search queries. A document model itself is a small XML document in the file system. This model file is passed as a parameter to the CREATE INDEX command and is read during index creation only. Later changes to the document model do not affect existing indexes.
19.4
Managing Full-Text Indexes with the DB2 Net Search Extender
577
Figure 19.9 shows a simple document model for documents like the ones in Figure 19.1. This document model declares that only item names and comments are indexed. Every XML document model starts with the element XMLModel, which includes one or multiple XMLFieldDefinition elements. Each XMLFieldDefinition assigns a name to a locator. The locator is a simple XPath expression that defines which elements, attributes, or subtrees to index. The locator can contain XPath wildcards (*), namespace prefixes, the XPath union operator (|), and the XPath descendant-and-self axis, which is also known as the “double slash” (//).
Figure 19.9
A simple document model
If the document model is stored in the file itemModel.xml, then the following command defines a full-text index for item names and comments: db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) FORMAT XML DOCUMENTMODEL XMLModel IN itemModel.xml CONNECT TO "
Note that you might have to specify a full file system path to the model file. The document model in Figure 9.10 declares that all elements under /order/item are indexed, except for the items quantity and price, which are explicitly excluded. Depending on the actual data in the XML column, and on the existence of other elements under /order/item, this document model can index more information than the previous one in Figure 19.9. However, for the sample documents in Figure 19.1, both document models index exactly the item name and comment. We will later use these document models in text search queries.