DB2 pureXML Cookbook: Master the Power of the IBM Hybrid Data Server

Related Books of Interest DB2 9 for Linux, UNIX, and Windows DBA Guide, Reference, and Exam Prep, Sixth Edition Under...

Author: Matthias Nicola | Pav Kumar-Chatterjee

573 downloads 1364 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Related Books of Interest

DB2 9 for Linux, UNIX, and Windows DBA Guide, Reference, and Exam Prep, Sixth Edition

Understanding DB2 Learning Visually with Examples, Second Edition

by George Baklarz and Paul C. Zikopoulos

by Raul F. Chong, Xiaomei Wang, Michael Dang, and Dwaine R. Snow

ISBN: 0-13-185514-X

ISBN: 0-13-158018-3

The sixth edition of this classic offers complete, ® 9 administra® tion and development for Linux , UNIX®, and Windows® platforms, as well as authoritative preparation for the latest IBM® exam. Written for both DBAs and developers,

IBM DB2 9 and DB2 9.5 provide breakthrough capabilities for providing Information on Demand, implementing Web services and Service Oriented Architecture, and streamlining information management. Understanding DB2: Learning Visually with Examples, Second Edition, is the easiest way to master the latest versions of DB2 and apply their full power to your business challenges. Written by four IBM DB2 experts, this book introduces key concepts with dozens of examples drawn from the authors’ experience working with DB2 in enterprise environments. Thoroughly updated for DB2 9.5, it covers new innovations ranging from manageability to performance and XML support to API integration. Each concept is presented with easy-to-understand screenshots, diagrams, charts, and tables. This book is for everyone who works with DB2: database administrators, system administrators, developers, and consultants. With hundreds of well-designed review questions and answers, it will also help profession-

ers all aspects of deploying and managing DB2 9, including DB2 database design and development; day-to-day administration and backup; deployment of networked, Internet-centered, and SOAbased applications; migration; and much more. tips for optimizing performance, availability, and value. Download Complete DB2 V9 Trial Version Visit ibm.com/db2/9/download.html to download a complete trial version of DB2, which enables you to try out dozens of the most powerful features of DB2 for yourself – everything from pureXML™ support to automated administration and optimization. Listen to the author’s podcast at: ibmpressbooks.com/podcasts

730, 731, or 736. Listen to the author’s podcast at: ibmpressbooks.com/podcasts

Sign up for the monthly IBM Press newsletter at ibmpressbooks/newsletters

Related Books of Interest

Understanding DB2 9 Security By Rebecca Bond, Kevin Yeung-Kuen See, Carmen Ka Man Wong, and Yuk-Kuen Henry Chan ISBN: 0-13-134590-7

Understanding DB2 9 Security is a comprehensive guide to securing DB2 and leveraging the powerful new security features of DB2 9. Direct from a DB2 Security deployment expert and the IBM DB2 development team, this book gives DBAs and their managers a wealth of security information that is available nowhere else. It presents real-world implementation scenarios, step-by-step examples, and expert guidance on both the technical and human sides of DB2 security. This book’s material is organized to support you through every step of securing DB2 in Windows, Linux, or UNIX environments. You’ll start by exploring the regulatory and business issues driving your security efforts, and then master the technological and managerial knowledge crucial to effective implementation. Next, the authors offer practical guidance on post-implementation auditing, and show how to systematically maintain security on an ongoing basis.

Mining the Talk Unlocking the Business Value in Unstructured Information by Scott Spangler, and Jeffrey Kreulen ISBN: 0-13-233953-6

In Mining the Talk, two leading-edge IBM researchers introduce a revolutionary new approach to unlocking the business value hidden in virtually any form of unstructured data – from word processing documents to websites, emails to instant messages. The authors review the business drivers that have made unstructured data so important and explain why conventional methods for working with it are inadequate. Then, writing for business professionals – not just data mining specialists – they walk step-bystep through exploring your unstructured data, understanding it, and analyzing it effectively.

key areas: learning from your customer interactions; hearing the voices of customers when they’re not talking to you; discovering the “collective consciousness” of your own organization; enhancing innovation; and spotting emerging trends. Whatever your organization, Mining the Talk offers you breakthrough opportunities to become more responsive, agile, and competitive. Listen to the author’s podcast at: ibmpressbooks.com/podcasts

Visit ibmpressbooks.com for all product information

Related Books of Interest An Introduction to IMS Meltz, Long, Harrington, Hain, Nicholls ISBN: 0-13-185671-5

A Practical Guide to Trusted Computing

Enterprise Master Data Management by Allen Dreibelbis, Eberhard Hechler, Ivan Milman, Martin Oberhofer, Paul van Run, and Dan Wolfson ISBN: 0-13-236625-8

Enterprise Master Data Management provides an authoritative, vendor-independent MDM technical reference for practitioners: architects, technical analysts, consultants, solution designers, and senior IT decision makers. Written by the IBM® data management innovators who are pioneering MDM, this book systematically introduces MDM’s key concepts and technical themes, explains its business case, and illuminates how it interrelates with and enables SOA.

Challener, Yoder, Catherman, Safford, Van Doorn ISBN: 0-13-239842-7

Mainframe Basics for Security Professionals Pomerantz, Weele, Nelson, Hahn ISBN: 0-13-173856-9

Service-Oriented Architecture (SOA) Compass Bieberstein, Bose, Fiammante, Jones, Shah ISBN: 0-13-187002-5

WebSphere Business Integration Primer Iyengar, Jessani, Chilanti ISBN: 0-13-224831-X

Drawing on their experience with cutting-edge projects, the authors introduce MDM patterns, blueprints, solutions, and best practices published nowhere else—everything you need to establish a consistent, manageable set of master data, and use it for competitive advantage.

Sign up for the monthly IBM Press newsletter at ibmpressbooks/newsletters

Outside-in Software Development Kessler, Sweitzer ISBN: 0-13-157551-1

This page intentionally left blank

DB2® pureXML® Cookbook Project Management with the

This page intentionally left blank

IBM WebSphere [SUBTITLE ]

DB2® pureXML® Cookbook

Deployment and Advanced Master the Power of the IBM Conﬁguration

®

Hybrid Data Server

Roland Barcia, Bill Hines, Tom Alcott, and Keys Botzum

Matthias Nicola Pav Kumar-Chatterjee

IBM Press Pearson plc Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Cape Town • Sydney • Tokyo • Singapore • Mexico City Ibmpressbooks.com

The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. Before you use any IBM or non-IBM or open-source product mentioned in this book, make sure that you accept and adhere to the licenses and terms and conditions for any such product. © Copyright 2010 by International Business Machines Corporation. All rights reserved. Note to U.S. Government Users: Documentation related to restricted right. Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corporation. IBM Press Program Managers: Steven M. Stansel, Ellice Uffer Cover design: IBM Corporation Associate Publisher: Greg Wiegand Marketing Manager: Kourtnaye Sturgeon Publicist: Heather Fox Acquisitions Editor: Bernard Goodwin Managing Editor: Kristy Hart Designer: Alan Clements Project Editor: Andy Beaster Copy Editor: Paula Lowell Senior Indexer: Cheryl Lenser Compositor: Gloria Schurick Proofreader: Leslie Joseph Manufacturing Buyer: Dan Uhrig Published by Pearson plc Publishing as IBM Press IBM Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests. For more information, please contact: U.S. Corporate and Government Sales 1-800-382-3419 [email protected]. For sales outside the U.S., please contact: International Sales [email protected]. The following terms are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: IBM, the IBM logo, IBM Press, DB2, pureXML, z/OS, ibm.com, WebSphere, System z, developerWorks, InfoSphere, DRDA, Rational, AIX, OmniFind, i5/OS, Lotus, and DataPower. Microsoft, Windows, Microsoft Word, Microsoft Visual Studio, Visual Basic, and Visual C# are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc., in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

Library of Congress Cataloging-in-Publication Data Nicola, Matthias. DB2 PureXML cookbook : master the power of IBM’s hybrid data server / Matthias Nicola and Pav Kumar-Chatterjee. p. cm. Includes indexes. ISBN-13: 978-0-13-815047-1 (hardback : alk. paper) ISBN-10: 0-13-815047-8 (hardback : alk. paper) 1. IBM Database 2. 2. XML (Document markup language) 3. Database management. I. Kumar-Chatterjee, Pav. II. Title. QA76.9.D3N525 2009 006.7’4—dc22 2009020222 All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to: Pearson Education, Inc. Rights and Contracts Department 501 Boylston Street, Suite 900 Boston, MA 02116 Fax (617) 671 3447 ISBN-13: 978-0-13-815047-1 ISBN-10: 0-13-815047-8 Text printed in the United States on recycled paper at Edwards Brothers in Ann Arbor, Michigan. First printing August 2009

I would like to dedicate this book to Scott and Carrie in the hope that it will inspire them to work hard at school and to my mother who did not see the ﬁnal version, but who gave me unconditional support as only a mother can.

—Pav Kumar-Chatterjee

Contents Chapter1

Introduction

1

1.1 1.2 1.3 1.4 1.5

Anatomy of an XML Document Differences Between XML and Relational Data Overview of DB2 pureXML Beneﬁts of DB2 pureXML over Alternative Storage Options for XML Data XML Solutions to Relational Data Model Problems 1.5.1 When the Schema Is Volatile 1.5.2 When Data Is Inherently Hierarchical in Nature 1.5.3 When Data Represents Business Objects 1.5.4 When Objects Have Sparse Attributes 1.5.5 When Data Needs to be Exchanged 1.6 Summary

Chapter 2 2.1 2.2 2.3 2.4 2.5

Designing XML Data and Applications

Choosing Between XML Elements and XML Attributes XML Tags versus Values Choosing the Right Document Granularity Using a Hybrid XML/Relational Approach Summary

Chapter 3

Designing and Managing XML Storage Objects

3.1 Understanding XML Document Trees 3.2 Understanding pureXML Storage 3.3 XML Storage in DB2 for Linux, UNIX, and Windows 3.3.1 Storage Objects for XML Data 3.3.2 Deﬁning Columns,Tables, and Table Spaces for XML Data 3.3.3 Dropping XML Columns 3.3.4 Improved XML Storage Format in DB2 9.7 3.4 Using XML Base Table Row Storage (Inlining) 3.4.1 Monitoring and Conﬁguring XML Inlining 3.4.2 Potential Beneﬁts and Drawbacks of XML Inlining 3.5 Compressing XML Data 3.6 Examining XML Storage Space Consumption 3.7 Reorganizing XML Data and Indexes 3.8 Understanding XML Space Management: A Comprehensive Example 3.9 XML in Range Partitioned Tables and MDC Tables 3.9.1 XML and Range Partitioning 3.9.2 XML and Multidimensional Clustering 3.10 XML in a Partitioned Database (DPF) 3.11 XML Storage in DB2 for z/OS

xi

2 4 7 10 11 12 12 12 13 13 13

15 15 19 22 24 25

27 28 30 33 33 36 40 40 41 43 47 48 51 53 54 57 57 58 59 60

xii

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

3.11.1 Storage Objects for XML Data 3.11.2 Characteristics of XML Table Spaces 3.11.3 Tables with Multiple XML Columns 3.11.4 Naming and Storage Conventions 3.12 Utilities for XML Objects in DB2 for z/OS 3.12.1 REPORT TABLESPACESET for XML 3.12.2 Reorganizing XML Data in DB2 for z/OS 3.12.3 CHECK DATA for XML 3.13 XML Parsing and Memory Consumption in DB2 for z/OS 3.13.1 Controlling the Memory Consumption of XML Operations 3.13.2 Redirecting XML Parsing to zIIP and zAAP 3.14 Summary

Chapter 4

Inserting and Retrieving XML Data

4.1 Inserting XML Documents 4.1.1 Simple Insert Statements 4.1.2 Reading XML Documents from Files or URLs 4.2 Deleting XML Documents 4.3 Retrieving XML Documents 4.4 Handling Documents with XML Declarations 4.5 Copying Full XML Documents 4.6 Dealing with XML Special Characters 4.7 Understanding XML Whitespace and Document Storage 4.7.1 Preserving XML Whitespace 4.7.2 Changing the Whitespace Default from “Strip” to “Preserve” 4.7.3 Storing XML Documents for Compliance 4.8 Summary

Chapter 5

Moving XML Data

5.1 Exporting XML Data in DB2 for Linux, UNIX, and Windows 5.1.1 Exporting XML Documents to a Single File 5.1.2 Exporting XML Documents as Individual Files 5.1.3 Exporting XML Documents as Individual Files with Non-Default Names 5.1.4 Exporting XML Documents to One or Multiple Dedicated Directories 5.1.5 Exporting Fragments of XML Documents 5.1.6 Exporting XML Data with XML Schema Information 5.2 Importing XML Data in DB2 for Linux, UNIX, and Windows 5.2.1 IMPORT Command and Input Files 5.2.2 Import/Insert Performance Tips 5.3 Loading XML Data in DB2 for Linux, UNIX, and Windows 5.4 Unloading XML Data in DB2 for z/OS 5.5 Loading XML Data in DB2 for z/OS 5.6 Validating XML Documents during Load and Insert Operations 5.7 Splitting Large XML Documents into Smaller Documents 5.8 Replicating and Publishing XML Data

61 63 64 64 65 67 68 69 71 71 72 73

75 76 76 79 82 83 85 86 87 89 91 93 94 95

97 98 98 100 102 102 104 105 106 107 108 109 111 114 116 116 118

Table of Contents

xiii

5.9 Federating XML Data 5.10 Managing XML Data with HADR 5.11 Handling XML Data in db2look and db2move 5.12 Summary

Chapter 6

Querying XML Data: Introduction and XPath

6.1 An Overview of Querying XML Data 6.2 Understanding the XQuery and XPath Data Model 6.2.1 Sequences 6.2.2 Sequence in, Sequence out 6.3 Sample Data for XPath, SQL/XML, and XQuery 6.4 Introduction to XPath 6.4.1 Analogy Between XPath and Navigating a File System 6.4.2 Simple XPath Queries 6.5 How to Execute XPath in DB2 6.6 Wildcards and Double Slashes 6.7 XPath Predicates 6.8 Existential Semantics 6.9 Logical Expressions with and, or, not() 6.10 The Current Context and the Parent Step 6.11 Positional Predicates 6.12 Union and Construction of Sequences 6.13 XPath Functions 6.14 General and Value Comparisons 6.15 XPath Axes and Unabbreviated Syntax 6.16 Summary

Chapter 7

Querying XML Data with SQL/XML

7.1 Overview of SQL/XML 7.2 Retrieving XML Documents or Document Fragments with XMLQUERY 7.2.1 Referencing XML Columns in SQL/XML Functions 7.2.2 Retrieving Element Values Without XML Tags 7.2.3 Retrieving Repeating Elements with XMLQUERY 7.3 Retrieving XML Values in Relational Format with XMLTABLE 7.3.1 Generating Rows and Columns from XML Data 7.3.2 Dealing with Missing Elements 7.3.3 Avoiding Type Errors 7.3.4 Retrieving Repeating Elements with XMLTABLE 7.3.5 Numbering XMLTABLE Rows Based on Repeating Elements 7.3.6 Retrieving Multiple Repeating Elements at Different Levels 7.4 Using XPath Predicates in SQL/XML with XMLEXISTS 7.5 Common Mistakes with SQL/XML Predicates 7.6 Using Parameter Markers or Host Variables 7.7 XML Queries with Dynamically Computed XPath Expressions

120 121 122 123

125 126 128 128 130 131 132 133 133 137 140 142 147 148 151 153 154 155 156 157 157

159 160 161 162 163 164 165 165 167 168 169 173 174 177 181 183 185

xiv

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

7.8 Ordering a Query Result Set Based on XML Values 7.9 Converting XML Values to Binary SQL Types 7.10 Summary

Chapter 8

Querying XML Data with XQuery

8.1 XQuery Overview 8.2 Processing XML Data with FLWOR Expressions 8.2.1 Anatomy of a FLWOR Expression 8.2.2 Understanding the for and let Clauses 8.2.3 Understanding the where and order by Clauses 8.2.4 FLWOR Expressions with Multiple for and let Clauses 8.3 Comparing FLWOR Expressions, XPath Expressions, and SQL/XML 8.3.1 Traversing XML Documents 8.3.2 Using XML Predicates 8.3.3 Result Set Cardinalities in XQuery and SQL/XML 8.3.4 Using FLWOR Expressions in SQL/XML 8.4 Constructing XML Data 8.4.1 Constructing Elements with Computed Values 8.4.2 Constructing XML Data with Predicates and Conditions 8.4.3 Constructing Documents with Multiple Levels of Nesting 8.4.4 Constructing Documents with XML Aggregation in SQL/XML Queries 8.5 Data Types, Cast Expressions, and Type Errors 8.6 Arithmetic Expressions 8.7 XQuery Functions 8.7.1 String Functions 8.7.2 Number and Aggregation Functions 8.7.3 Sequence Functions 8.7.4 Namespace and Node Functions 8.7.5 Date and Time Functions 8.7.6 Boolean Functions 8.8 Embedding SQL in XQuery 8.9 Using SQL Functions and User-Deﬁned Functions in XQuery 8.10 Summary

Chapter 9

Querying XML Data:Advanced Queries & Troubleshooting

9.1 Aggregation and Grouping of XML Data 9.1.1 Aggregation and Grouping Queries with XMLTABLE 9.1.2 Aggregation of Values within and across XML Documents 9.1.3 Grouping Queries in SQL/XML versus XQuery 9.2 Join Queries with XML Data 9.2.1 XQuery Joins between XML Columns 9.2.2 SQL/XML Joins between XML Columns 9.2.3 Joins between XML and Relational Columns 9.2.4 Outer Joins between XML Columns

186 187 188

189 190 191 191 193 194 195 197 197 198 200 201 202 202 204 206 207 208 212 214 215 218 220 222 224 226 227 229 230

233 233 234 236 237 239 240 242 248 250

Table of Contents

9.3 Case-Insensitive XML Queries 9.4 How to Avoid “Bad” Queries 9.4.1 Construction of Excessively Large Documents 9.4.2 “Between” Predicates on XML Data 9.4.3 Large Global Sequences 9.4.4 Multilevel Nesting SQL and XQuery 9.5 Common Errors and How to Avoid Them 9.5.1 SQL16001N 9.5.2 SQL16002N 9.5.3 SQL16003N 9.5.4 SQL16005N 9.5.5 SQL16015N 9.5.6 SQL16011N 9.5.7 SQL16061N 9.5.8 SQL16075N 9.6 Summary

Chapter 10 Producing XML from Relational Data 10.1 SQL/XML Publishing Functions 10.1.1 Constructing XML Elements from Relational Data 10.1.2 NULL Values, Missing Elements, and Empty Elements 10.1.3 Constructing XML Attributes from Relational Data 10.1.4 Constructing XML Documents from Multiple Relational Rows 10.1.5 Constructing XML Documents from Multiple Relational Tables 10.1.6 Comparing XMLAGG, XMLCONCAT, and XMLFOREST 10.1.7 Conditional Element Construction 10.1.8 Leading Zeros in Constructed Elements and Attributes 10.1.9 Default Tagging of Relational Data with XMLROW and XMLGROUP 10.1.10 GUI-Based Deﬁnition of SQL/XML Publishing Queries 10.1.11 Constructing Comments, Processing Instructions, and Text Nodes 10.1.12 Legacy Functions 10.2 Using XQuery Constructors with Relational Input 10.3 XML Declarations for Constructed XML Data 10.4 Inserting Constructed XML Data into XML Columns 10.5 Summary

Chapter 11 Converting XML to Relational Data 11.1 Advantages and Disadvantages of Shredding 11.2 Shredding with the XMLTABLE Function 11.2.1 Hybrid XML Storage 11.2.2 Relational Views over XML Data 11.3 Shredding with Annotated XML Schemas 11.3.1 Annotating an XML Schema 11.3.2 Deﬁning Schema Annotations Visually in IBM Data Studio

xv

252 253 253 254 256 257 258 259 259 260 261 262 263 263 264 264

267 268 269 274 275 277 281 284 284 285 286 289 290 290 290 292 294 295

297 297 301 303 305 306 306 311

xvi

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

11.3.3 Registering an Annotated Schema 11.3.4 Decomposing One XML Document at a Time 11.3.5 Decomposing XML Documents in Bulk 11.4 Summary

Chapter 12 Updating and Transforming XML Documents 12.1 Replacing a Full XML Document 12.2 Modifying Documents with XQuery Updates 12.3 Updating the Value of an XML Node in a Document 12.3.1 Replacing an Element Value 12.3.2 Replacing an Attribute Value 12.3.3 Replacing a Value Using a Parameter Marker 12.3.4 Replacing Multiple Values in a Document 12.3.5 Replacing an Existing Value with a Computed Value 12.4 Replacing XML Nodes in a Document 12.5 Deleting XML Nodes from a Document 12.6 Renaming Elements or Atttributes in a Document 12.7 Inserting XML Nodes into a Document 12.7.1 Deﬁning the Position of Inserted Elements 12.7.2 Deﬁning the Position of Inserted Attributes 12.7.3 Insert Examples 12.8 Handling Repeating and Missing Nodes 12.9 Modifying Multiple XML Nodes in the Same Document 12.9.1 Snapshot Semantics and Conﬂict Situations 12.9.2 Converting Elements to Attributes and Vice Versa 12.10 Modifying XML Documents in Queries 12.11 Modifying XML Documents in Insert Operations 12.12 Modifying XML Documents in Update Cursors 12.13 XML Updates in DB2 for z/OS 12.14 Transforming XML Documents with XSLT 12.14.1 The XSLTRANSFORM Function 12.14.2 XML to HTML Transformation 12.15 Summary

Chapter 13 Deﬁning and Using XML Indexes 13.1 Deﬁning XML Indexes 13.1.1 Unique XML Indexes 13.1.2 Lean XML Indexes 13.1.3 Using the DB2 Control Center to Create XML Indexes 13.2 XML Index Data Types 13.2.1 VARCHAR(n) 13.2.2 VARCHAR HASHED 13.2.3 DOUBLE and DECFLOAT 13.2.4 DATE and TIMESTAMP

311 312 315 318

321 322 324 326 326 327 328 328 329 331 333 334 335 335 336 337 340 343 343 345 346 349 350 351 352 353 356 358

361 362 364 365 366 367 367 368 369 369

Table of Contents

13.3

13.4 13.5 13.6

13.7

13.8 13.9

13.2.5 Choosing a Suitable Index Data Type 13.2.6 Rejecting Invalid Values Using XML Indexes to Evaluate Query Predicates 13.3.1 Understanding Index Eligibility 13.3.2 Data Types in XML Indexes and Query Predicates 13.3.3 Text Nodes in XML Indexes and Query Predicates 13.3.4 Wildcards in XML Indexes and Query Predicates 13.3.5 Using Indexes for Structural Predicates XML Indexes and Join Predicates XML Indexes on Non-Leaf Elements Special Cases Where XML Indexes Cannot be Used 13.6.1 Special Cases with XMLQUERY 13.6.2 Parent Steps 13.6.3 The let and return Clauses XML Index Internals 13.7.1 XML Index Keys 13.7.2 Logical and Physical XML Indexes XML Index Statistics Summary

Chapter 14 XML Performance and Monitoring 14.1 Explaining XML Queries in DB2 for Linux,UNIX, and Windows 14.1.1 The Explain Tables in DB2 for Linux, UNIX, and Windows 14.1.2 Using db2exfmt to Obtain Access Plans 14.1.3 Using Visual Explain to Display Access Plans 14.1.4 Access Plan Operators 14.1.5 Understanding and Analyzing XML Query Execution Plans 14.2 Explaining XML Queries in DB2 for z/OS 14.2.1 The Explain Tables in DB2 for z/OS 14.2.2 Obtaining Access Plan Information in SPUFI 14.2.3 Using Visual Explain to Display Access Plans 14.2.4 Access Plan Operators 14.2.5 Understanding and Analyzing XML Query Execution Plans 14.3 Statistics Collection for XML Data 14.3.1 Statistics Collection for XML Data in DB2 for z/OS 14.3.2 Statistics Collection for XML Data in DB2 for Linux, UNIX, and Windows 14.3.3 Examining XML Statistics with db2cat 14.4 Monitoring XML Activity 14.4.1 Using the Snapshot Monitor in DB2 for Linux, UNIX, and Windows 14.4.2 Monitoring Database Utilities 14.5 Best Practices for XML Performance 14.5.1 XML Document Design 14.5.2 XML Storage

xvii

369 371 373 373 374 375 376 377 379 383 385 385 385 386 387 387 389 390 393

395 396 396 397 400 401 403 409 409 410 411 413 414 417 417 418 419 424 424 427 428 428 429

xviii

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

14.5.3 XML Queries 14.5.4 XML Indexes 14.5.5 XML Updates 14.5.6 XML Schemas 14.5.7 XML Applications 14.6 Summary

Chapter 15 Managing XML Data with Namespaces 15.1 Introduction to XML Namespaces 15.1.1 Namespace Declarations in XML Documents 15.1.2 Default Namespaces 15.2 Exploring Namespaces in XML Documents 15.3 Querying XML Data with Namespaces 15.3.1 Declaring Namespaces in XML Queries 15.3.2 Using Namespace Declarations in SQL/XML Queries 15.3.3 Using Namespaces in the XMLTABLE Function 15.3.4 Dealing with Multiple Namespaces per Document 15.4 Creating Indexes for XML Data with Namespaces 15.5 Constructing XML Data with Namespaces 15.5.1 SQL/XML Publishing Functions and Namespaces 15.5.2 XQuery Constructors and Namespaces 15.6 Updating XML Data with Namespaces 15.6.1 Updating Values in Documents with Namespaces 15.6.2 Renaming Nodes in Documents with Namespace Preﬁxes 15.6.3 Renaming Nodes in Documents with Default Namespaces 15.6.4 Inserting and Replacing Nodes in Documents with Namespaces 15.7 Summary

Chapter 16 Managing XML Schemas 16.1 Introduction to XML Schemas and Their Usage 16.1.1 Valid Versus Well-Formed XML Documents 16.1.2 To Validate or Not to Validate,That Is the Question! 16.1.3 Custom Versus Industry Standard XML Schemas 16.2 Anatomy of an XML Schema 16.3 An XML Schema with Include and Import 16.4 Registering XML Schemas 16.4.1 Registering XML Schemas in the DB2 Command Line Processor 16.4.2 Registering XML Schemas from Applications via Stored Procedures 16.4.3 Registering XML Schemas from Java Applications via JDBC 16.4.4 Two XML Schemas Sharing a Common Schema Document 16.4.5 Error Situations and How to Resolve Them 16.5 Removing XML Schemas from the Schema Repository

430 432 433 434 434 435

437 437 439 442 444 447 448 451 452 454 456 460 460 462 463 464 465 467 468 469

471 472 473 474 474 476 479 483 484 486 488 489 490 492

Table of Contents

16.6 XML Schema Evolution 16.6.1 Schema Evolution Without Document Validation 16.6.2 Generic Schema Evolution with Document Validation 16.6.3 Compatible Schema Evolution with the UPDATE XMLSCHEMA Command 16.7 Granting and Revoking XML Schema Usage Privileges 16.8 Document Type Deﬁnitions (DTDs) and External Entities 16.9 Browsing the XML Schema Repository (XSR) 16.9.1 Tables and Views of the XML Schema Repository 16.9.2 Queries against the XML Schema Repository 16.10 XML Schema Considerations in DB2 for z/OS 16.11 Summary

Chapter 17 Validating XML Documents against XML Schemas 17.1 17.2 17.3 17.4 17.5 17.6 17.7

Document Validation Upon Insert Document Validation Upon Update Validation without Rejecting Invalid Documents Enforcing Validation with Check Constraints Automatic Validation with Triggers Diagnosing Validation and Parsing Errors Validation during Load and Import Operations 17.7.1 Validation against a Single XML Schema 17.7.2 Validation against Multiple XML Schemas 17.7.3 Using a Default XML Schema 17.7.4 Overriding XML Schema References 17.7.5 Validation Based on schemaLocation Attributes 17.8 Checking Whether an Existing Document Has Been Validated 17.9 Validating Existing Documents in a Table 17.10 Finding the XML Schema for a Validated Document 17.11 How to Undo Document Validation 17.12 Considerations for Validation in DB2 for z/OS 17.12.1 Document Validation Upon Insert 17.12.2 Document Validation Upon Update 17.12.3 Validating Existing Documents in a Table 17.12.4 Summary of Platform Similarities and Differences 17.13 Summary

Chapter 18 Using XML in Stored Procedures, UDFs, and Triggers 18.1 Manipulating XML in SQL Stored Procedures 18.1.1 Basic XML Manipulation in Stored Procedures 18.1.2 A Stored Procedure to Store XML in a Hybrid Manner 18.1.3 Loops and Cursors 18.1.4 A Stored Procedure to Update a Selected XML Element or Attribute 18.1.5 Three Tips for Testing Stored Procedures

xix

493 494 494 495 499 501 502 503 508 510 512

513 514 518 519 520 523 525 530 530 531 532 532 534 534 535 538 540 540 541 542 543 543 544

547 548 548 550 553 554 555

xx

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

18.2 Manipulating XML in User-Deﬁned Functions 18.2.1 A UDF to Extract an Element or Attribute Value 18.2.2 A UDF to Extract the Values of a Repeating Element 18.2.3 A UDF to Shred XML Data to a Relational Table 18.2.4 A UDF to Modify an XML Document 18.3 Manipulating XML Data with Triggers 18.3.1 Insert Triggers on Tables with XML Columns 18.3.2 Delete Triggers on Tables with XML Columns 18.3.3 Update Triggers on XML Columns 18.4 Summary

Chapter 19 Performing Full-Text Search 19.1 Overview of Text Search in DB2 19.2 Sample Table and Data 19.3 Enabling a Database for the DB2 Net Search Extender 19.4 Managing Full-Text Indexes with the DB2 Net Search Extender 19.4.1 Creating Basic Text Indexes 19.4.2 Creating Text Indexes with Speciﬁc Storage Paths 19.4.3 Creating Text Indexes with a Periodic Update Schedule 19.4.4 Creating Text Indexes for Speciﬁc Parts of Each Document 19.4.5 Creating Text Indexes with Advanced Options 19.4.6 Updating and Reorganizing Text Indexes 19.4.7 Altering Text Indexes 19.5 Performing XML Full-Text Search with the DB2 Net Search Extender 19.5.1 Full-Text Search in SQL and XQuery 19.5.2 Full-Text Search with Boolean Operators 19.5.3 Full-Text Search with Custom Document Models 19.5.4 Advanced Search with Proximity, Fuzzy, and Stemming Options 19.5.5 Finding the Correct Match within an XML Document 19.5.6 Search Conditions on Sibling Branches of an XML Document 19.5.7 Text Search in the Presence of Namespaces 19.6 DB2 Text Search 19.6.1 Enabling a Database for DB2 Text Search 19.6.2 Creating and Maintaining Full-Text Indexes for DB2 Text Search 19.6.3 Writing DB2 Text Search Queries for XML Data 19.6.4 Full-Text Search with XPath Expressions 19.6.5 Full-Text Search with Wildcards 19.7 Summary of Text Search Administration Commands 19.8 XML Full-Text Search in DB2 for z/OS 19.9 Summary

556 557 557 558 559 561 562 563 564 564

567 568 570 571 572 572 573 574 576 578 579 580 581 581 583 585 586 587 588 588 590 590 591 592 593 594 594 596 596

Table of Contents

Chapter 20 Understanding XML Data Encoding 20.1 Understanding Internal and External XML Encoding 20.1.1 Internally Encoded XML Data 20.1.2 Externally Encoded XML Data 20.2 Avoiding Code Page Conversions 20.3 Using Non-Unicode Databases for XML 20.4 Examples of Code Page Issues 20.4.1 Example 1: Chinese Characters in a Non-Unicode Code Page ISO-8859-1 20.4.2 Example 2: Fetching Data from a Non-Unicode Code Database into a Character Type Application Variable 20.4.3 Example 3: Encoding Issues with XMLTABLE and XMLCAST 20.4.4 Example 4: Japanese Literal Values in a Non-Unicode Database 20.4.5 Example 5: Data Expansion and Shrinkage Due to Code Page Conversion 20.5 Avoiding Data Loss and Encoding Errors in Non-Unicode Databases 20.6 Summary

Chapter 21 Developing XML Applications with DB2 21.1 The Value of DB2 pureXML for Application Development 21.1.1 Avoid XML Parsing in the Application Layer 21.1.2 Storing Business Objects in an Intuitive Format 21.1.3 Rapid Prototyping 21.1.4 Responding Quickly to Changing Business Needs 21.2 Using Parameter Markers or Host Variables 21.3 Java Applications 21.3.1 XML Support in JDBC 3.0 21.3.2 XML Support in JDBC 4.0 21.3.3 Comprehensive Example of Manipulating XML Data with JDBC 4.0 21.3.4 Creating XML Documents from Application Data 21.3.5 Binding XML Data to Java Objects 21.3.6 IBM pureQuery 21.4 .NET Applications 21.4.1 Querying XML Data in .NET Applications 21.4.2 Manipulating XML Data in .NET Applications 21.4.3 Inserting XML Data from .NET Applications 21.4.4 XML Schema and DTD Handling in .NET Applications 21.5 CLI Applications 21.6 Embedded SQL Applications 21.6.1 COBOL Applications with Embedded SQL 21.6.2 PL/1 Applications with Embedded SQL 21.6.3 C Applications with Embedded SQL 21.7 PHP Applications

xxi

597 599 599 600 601 601 602 602 603 604 605 605 606 606

609 610 610 612 612 613 613 615 615 619 621 627 629 629 631 632 633 635 636 636 639 640 643 645 647

xxii

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

21.8 Perl Applications 21.9 XML Application Development Tools 21.9.1 IBM Data Studio Developer 21.9.2 IBM Database Add-ins for Visual Studio 21.9.3 Altova XML Tools 21.9.4 21.9.5 Stylus Studio 21.10 Summary

Chapter 22 Exploring XML Information in the DB2 Catalog 22.1 XML-Related Catalog Information in DB2 for Linux, UNIX, and Windows 22.1.1 Catalog Information for XML Columns 22.1.2 The XML Strings and Paths Tables 22.1.3 The Internal XML Regions and Path Indexes 22.1.4 Catalog Information for User-Deﬁned XML Indexes 22.1.5 Catalog Information for XML Schemas 22.2 XML-Related Catalog Information in DB2 for z/OS 22.2.1 Catalog Information for XML Storage Objects 22.2.2 Catalog Information for XML Indexes 22.2.3 Catalog Information for XML Schemas 22.3 Summary

Chapter 23 Test Your Knowledge—The DB2 pureXML Quiz 23.1 Designing XML Data and Applications 23.2 Designing and Managing Storage Objects for XML 23.3 Inserting and Retrieving XML Data 23.4 Moving XML Data 23.5 Querying XML 23.6 Producing XML from Relational Data 23.7 Converting XML to Relational Data 23.8 Updating and Transforming XML Documents 23.9 Deﬁning and Using XML Indexes 23.10 XML Performance and Monitoring 23.11 Managing XML Data with Namespaces 23.12 XML Schemas and Validation 23.13 Performing Full-Text Search 23.14 XML Application Development 23.15 Answers

Appendix A Getting Started with DB2 pureXML A.1 Exploring the Structure of XML Documents A.1.1 Exploring XML Documents in the DB2 Control Center A.1.2 Exploring XML Documents in the CLP A.1.3 Exploring XML Documents in SPUFI A.2 Tips for Running XML Operations in the CLP

650 651 652 656 656 658 659 659

661 661 661 662 663 664 667 667 667 671 672 673

675 675 677 680 681 682 686 687 688 689 692 693 694 696 697 700

703 703 703 704 705 706

Table of Contents

Appendix B The XML Sample Database B.1 B.2 B.3 B.4 B.5

XML Sample Database on DB2 for Linux, UNIX, and Windows XML Sample Tables on DB2 for z/OS Table customer—Column info Table product—Column description Table purchaseorder—Column porder

Appendix C Further Reading C.1 General Resources for All Chapters C.2 Chapter-Speciﬁc Resources C.3 Resources on the Integration of DB2 pureXML with Other Products

Index

xxiii

709 709 710 710 712 713

717 717 718 726

727

This page intentionally left blank

Foreword n the years since E.F. Codd’s groundbreaking work in the 1970s, relational database systems have become ubiquitous in the business world. Today, most of the world’s business data is stored in the rows and columns of relational databases. The relational model is ideally suited to applications in which data has a relatively simple and uniform structure, and in which database structure evolves much more slowly than data values.

I

With the advent of the Web, however, big changes began to occur in the database world, driven by globalization and by dramatic reductions in the cost of storing, transmitting, and processing data. Today, businesses are globally interconnected and exchange large volumes of data with customers, suppliers, and governments. Much of this data consists of things that do not ﬁt neatly into rows and columns, such as medical records, legal documents, incident reports, tax returns, and purchase orders. The new kinds of data tend to be more heterogeneous than traditional business data, having more variation and a more rapidly evolving structure. In response to the changing requirements of business data, a new generation of standards have appeared. XML has emerged as an international standard for the exchange of self-describing data, unifying structured, unstructured, and semi-structured information formats. XML Schema has been adopted as the metadata syntax for describing the structure of XML documents. Industry-speciﬁc XML schemas have been developed for medical, insurance, retail, publishing, banking, and other industries. XPath and XQuery have been adopted as standard languages for retrieving and manipulating data in XML format, and new facilities have been added to the SQL standard for interfacing between relational and XML data. In DB2, the new generation of XML-related standards is reﬂected in pureXML, a broad new set of XML functionality implemented in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. pureXML bridges the gap between the XML and relational worlds and makes DB2 a true hybrid database management system. DB2 pureXML stores and indexes XML data alongside relational data in a highly efﬁcient new storage format, and supports XML query languages such as XPath and XQuery alongside the traditional SQL. pureXML is perhaps the largest new package of functionality in the history of DB2, impacting nearly every aspect of the system. The implementation of pureXML required deep changes in the database kernel, optimization methods, database administrator tools, system utilities, and application programming interfaces. New facilities were added for registering XML schemas and using them to validate stored documents. New kinds of statistics on XML documents had to be gathered and exploited. Facilities for replicated, federated, and partitioned databases had to be updated to accommodate the new XML storage format. pureXML provides DB2 users with a new level of capability, but using this capability to full advantage requires users to have a new level of sophistication. A new user of pureXML is

xxv

xxvi

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

confronted with many complex choices. What kinds of data should be represented in XML rather than in normalized tables? How can data be converted between XML and relational formats? How can a hybrid database be designed to take advantage of both data formats? What are the most appropriate uses for SQL, XQuery, and XPath? What kinds of indexes should be maintained on XML data? What is the XML equivalent of a NULL value? These and many other questions are considered in detail in the DB2 pureXML Cookbook. Matthias Nicola has been deeply involved in the design and implementation of DB2 pureXML since its inception. As a Senior Engineer at IBM’s Silicon Valley Laboratory, his work has focused on measuring and optimizing the performance of new storage and indexing techniques for XML. After the release of pureXML, he worked with many IBM customers and business partners to create, deploy, and optimize XML applications for government, banking, telecommunications, retail, and other industries. Pav Kumar-Chatterjee is a technical specialist with many years of experience in consulting with IBM customers throughout the UK and Europe on developing and deploying DB2 and XML solutions. Through their work with customers, Matthias and Pav have learned how to explain concepts clearly and how to identify and avoid common pitfalls in the application development process. They have also developed a set of “best practices” that they have shared at numerous conferences, classes, workshops, and customer engagements. Between them, Matthias and Pav have accumulated all the knowledge and experience you need to successfully create and deploy solutions using DB2 pureXML. Their expertise is encapsulated in this book in the form of hundreds of practical examples, tested and clearly explained. The book also includes a comprehensive set of questions to test your understanding. DB2 pureXML Cookbook includes both an introduction to basic XML concepts and a comprehensive description of the XML-related features of DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Chapters are organized around tasks that reﬂect the lifecycle of XML projects, including designing databases, loading and validating data, writing queries and updates, developing applications, optimizing performance, and diagnosing problems. Each topic provides a clear progression from introductory material to more advanced concepts. The writing style is informal and easy to understand for both beginners and experts. If you are an application developer, database administrator, or system architect, this is the book you need to gain a comprehensive understanding of DB2 pureXML.

Don Chamberlin IBM Fellow, Emeritus Almaden Research Center April 10, 2009

Preface n recent years XML has continued to emerge as the de-facto standard for data exchange, because it is ﬂexible, extensible, self-describing, and suitable for any combination of structured and unstructured data. With the increasing use of XML as a pervasive data format, there is a growing need to store, index, query, update, and validate XML documents in database systems. In response to this demand, IBM has developed sophisticated XML data management capabilities that are deeply integrated in the DB2 database system. This novel technology is called DB2 pureXML and is available in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. With pureXML, DB2 has evolved into a hybrid database system that allows you to manage both XML and relational data in a tightly integrated manner.

I

The DB2 pureXML Cookbook provides the single most comprehensive coverage of DB2’s pureXML functionality in DB2 for Linux, UNIX, and Windows as well as DB2 for z/OS. This book is a “cookbook” because it is more than just a description of functions and features (“ingredients”). This book provides “recipes” that show you how to combine the pureXML ingredients to efﬁciently perform typical user tasks for managing XML data. This book explains DB2 pureXML in more than 700 practical examples, including 250+ XQuery and SQL/XML queries, taking you from simple introductions all the way to advanced scenarios, tuning, and troubleshooting. Since the ﬁrst release of DB2 pureXML in 2006 we have worked with numerous companies to help them design, implement, optimize, and deploy XML applications with DB2. In this book we have distilled our experience from these pureXML projects so that you can beneﬁt from proven implementation techniques, best practices, tips and tricks, and performance guidelines that are not described elsewhere.

WHO SHOULD READ THIS BOOK? This book is written for database administrators, application developers, IT architects, and everyone who wants to get a deep technical understanding of DB2’s pureXML technology and how to use it most effectively. As a DBA you will learn, for example, how to design and manage XML storage objects, how to index XML data, where to ﬁnd XML-related information in the DB2 catalog, and how to mange XML with DB2 utilities. Application developers learn, among other things, how to write XML queries and XML updates with XPath, SQL/XML, and XQuery, and how to code XML applications with Java, .NET, C, COBOL, PL/1, PHP, or Perl. This book is suitable for both beginners and experts. Each topic starts with simple examples, which provide an easy introduction, and works towards advanced concepts and solutions to complex problems. Extensive XML knowledge is not required to read this book because it includes the necessary introductions to XML, XPath, XQuery, XML Schema, and namespaces. These

xxvii

xxviii

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

concepts are explained through numerous examples that are easy to follow. We assume that you have some experience with relational databases and SQL, but we show all the relevant DB2 commands that are required to work through the examples in this book. Appendix C, Further Reading, also contains links to additional educational material about both DB2 and XML.

COVERAGE OF DB2 FOR Z/OS AND DB2 FOR LINUX, UNIX, AND WINDOWS IN THIS BOOK The book describes DB2 pureXML on all supported platforms and versions, which at the time of writing are DB2 9 for z/OS as well as DB2 9.1, 9.5, and 9.7 for Linux, UNIX, and Windows. Many pureXML features and functions are identical across DB2 for Linux, UNIX, and Windows and DB2 for z/OS. Where platform-speciﬁc differences exist we point them out along the way. However, this book does not intend to be a reference that lists all functions and features according to platform and version of DB2. Instead, this book is a “cookbook” that focuses on concepts, examples, and best practices. The capabilities in DB2 for z/OS and DB2 for Linux, UNIX, and Windows continue to grow and converge over time. For the latest information on which feature is available in which version, please consult the respective DB2 information center. DB2 for z/OS also continues to deliver pureXML enhancements via APARs. Please look at APAR II14426, which is an informational APAR that summarizes and links all other XML-related APARs for DB2 on z/OS. In our work with users who adopt DB2 pureXML we have made the following observation: Some of the users who begin to use DB2 pureXML on Linux, UNIX, and Windows have little or no prior experience with DB2. In contrast, most users who are interested in DB2 pureXML on z/OS are already familiar with DB2 for z/OS in general. This difference is reﬂected in this book; that is, we describe some DB2 concepts, such as monitoring or the use of DB2 utilities, in more detail for DB2 for Linux, UNIX, and Windows than for DB2 for z/OS.

DO IT YOURSELF! The best way to learn a new technology is hands-on. We strongly recommend that you download DB2 Express-C, which is free, and try the concepts that you learn in this book in DB2’s sample database. Appendixes A and B contain the necessary information to get you started.

DON’T HESITATE TO ASK QUESTIONS! If any pureXML question is not covered in this book, the fastest way to get an answer is to post a question in the DB2 pureXML forum at http://www.ibm.com/developerworks/forums/forum. jspa?forumID=1423. Whether you seek clariﬁcation about speciﬁc features or functions, or if you need help with a tricky query, this forum is the right place to ask for help. You are also welcome to contact the

Preface

xxix

authors directly. If you want to discuss an XML project or if you have comments or feedback on the material in this book—we will be happy to hear from you. Please contact Matthias at [email protected] and Pav at [email protected].

HOW THIS BOOK IS STRUCTURED The DB2 pureXML Cookbook takes you through the different tasks and topics that you typically encounter during the life cycle of an XML project. The structure of this book with its 23 chapters is the following:

Planning Chapter 1, Introduction, provides an overview of XML and its differences to relational data, and discusses scenarios where XML has advantages over the relational model. This chapter also includes a summary of the pureXML technology. Chapter 2, Designing XML Data and Applications, covers fundamental XML design questions such as choosing between XML elements and attributes, selecting an appropriate XML document granularity, and deciding on a “good” mix of XML and relational data for your application.

Designing and Populating an XML Database Chapter 3, Designing and Managing XML Storage Objects, ﬁrst explains the tree representation of XML documents and how they are physically stored in DB2. Then it describes how to create and manage tables and table spaces for XML, including compression, reorganization, and partitioning. Chapter 4, Inserting and Retrieving XML Data, looks at “full document” operations such as insert, delete, and retrieval of XML documents. This chapter also explains how to handle XML declarations, white space, and reserved characters in XML documents. Chapter 5, Moving XML Data, looks at importing, exporting, loading, replicating, and federating XML data in DB2. A technique to split large XML documents into smaller ones is also demonstrated.

Querying XML Data Chapter 6, Querying XML Data: Introduction and XPath, is the ﬁrst of four chapters on querying XML data. This chapter provides an overview of the different options for querying XML, introduces the XPath and XQuery data model, and describes the XPath language in detail. These concepts are fundamental for the subsequent chapters.

xxx

DB2 ® pureXML® Cookbook: Master the Power of the IBM® Hybrid Data Server

Chapter 7, Querying XML Data with SQL/XML, explains how XPath can be included in SQL statements with the SQL/XML functions XMLQUERY and XMLTABLE and the XMLEXISTS predicate. The use of SQL/XML is illustrated through a rich collection of examples and a discussion of common mistakes and how to avoid them. Chapter 8, Querying XML Data with XQuery, introduces the XQuery language, which is a superset of XPath. Among other things, this chapter describes XQuery FLWOR expressions, combinations of SQL and XQuery, and a comparison of XPath, XQuery, and SQL/XML. Chapter 9, Querying XML Data: Advanced XML Queries and Troubleshooting, takes querying XML data to the expert level. It demonstrates how to perform grouping, aggregation, and joins over XML data or a mix of XML and relational data. The troubleshooting section discusses “bad” XML queries, common errors, and how to avoid both.

Converting, Updating, and Transforming Chapter 10, Producing XML from Relational Data, begins the discussion of converting, updating, and transforming data. This chapter explains how to read relational data from existing database tables and construct XML documents from it. Chapter 11, Converting XML to Relational Data, describes the opposite of Chapter 10, that is, the process of decomposing or shredding XML documents into relational tables. Two shredding methods are discussed, one using the XMLTABLE function and the other using annotated XML Schemas. Chapter 12, Updating and Transforming XML Documents, covers three techniques for updating XML documents: Full document replacement, XSLT transformations, and the XQuery Update Facility that allows you to modify, insert, delete, or rename individual elements and attributes within an XML document.

Performance and Monitoring Chapter 13, Deﬁning and Using XML Indexes, is one of two chapters dedicated to performance. It describes how to create XML indexes to improve query performance and explains under which conditions query predicates can or cannot use XML indexes. Chapter 14, Performance and Monitoring, looks at analyzing the performance of XML operations with particular emphasis on understanding XML query access plans. A summary of best practices for XML performance in DB2 is also provided.

Preface

xxxi

Ensuring Data Quality Chapter 15, Managing XML Data with Namespaces, introduces XML namespaces and explains how they avoid naming conﬂicts and ambiguity, thus contributing to data quality. This chapter illustrates how to index, query, update, and construct XML documents that contain namespaces. Chapter 16, Managing XML Schemas, ﬁrst describes how XML Schemas can constrain XML documents in terms of their structure, element and attribute names, data types, and other characteristics. Then this chapter walks you through the concepts of registering, managing, and evolving XML Schemas in DB2. Chapter 17, Validating XML Documents against XML Schemas, concentrates on the validation of XML documents to ensure XML data quality in DB2. You can validate XML documents in INSERT and UPDATE statements, queries, and import and load operations.

Application Development Chapter 18, Using XML in Stored Procedures, UDFs, and Triggers, demonstrates how you can implement application-speciﬁc processing logic with XML manipulation in SQL stored procedures, user-deﬁned functions, and triggers. Chapter 19, Performing Full-Text Search, describes how the DB2 Net Search Extender and DB2 Text Search support efﬁcient full-text search in collections of XML documents. Chapter 20, Understanding XML Data Encoding, explains internal and external XML encoding, how DB2 determines and handles XML encoding, and how you can avoid code page conversion. Chapter 21, Developing XML Application with DB2, contains techniques and best practices for application programs that exchange XML data with the DB2 server. Code samples are provided for Java, .NET, C, COBOL, PL/1, PHP, and Perl programmers.

Reference Material Chapter 22, Exploring XML Information in the DB2 Catalog, is a guide to how XML storage objects, XML indexes, and XML Schemas are listed in the database catalog. Chapter 23, Test Your Knowledge—The DB2 pureXML Quiz, offers 82 questions to revisit speciﬁc topic areas. The Appendixes list supporting information and further reading for each chapter.

This page intentionally left blank

Acknowledgments Writing this book would not have been possible without the support from many people. For their support and technical reviews we would like to thank Andrew Eisenberg, Andy Lai, Bert van der Linden, Bob Harbus, Christian Daser, Cindy Saracco, Craig Mullins, Daniela Wersin, David Salinero, Don Chamberlin, Guogen Zhang, Henrik Loeser, Holger Seubert, Ian Cook, Jan-Eike Michels, Jason Cu, John Pickford, Lan Huang, Manfred Paessler, Mark Mezofenyi, Martin Sommerlandt, Paul Fletcher, Phil Nelson, Qi Jin, Shantanu Munkur, Stefan Momma, Susan Gausden, Susan Malaika, Susan Visser, Susanne Englert, Thomas Fanghaenel, Tiffany Money, Tim Kiefer, and Yuchu Tong. Thanks also to the many talented people in the DB2 pureXML development team who have implemented this exciting technology that we have the privilege of writing about.

xxxiii

About the Authors Matthias Nicola is a Senior Software Engineer for DB2 pureXML at IBM’s Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, XML storage, indexing, and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Matthias has published more than a dozen articles on various XML topics (see www.matthiasnicola.de) and is a frequent speaker at DB2 conferences. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.

Pav Kumar-Chatterjee has worked with DB2 since 1991 on DB2 for z/OS and since 2000 on DB2 for Linux, UNIX, and Windows. He is currently employed by IBM as a technical sales specialist for Information Management in the United Kingdom. He has helped customers implement the XML Extender product with DB2 V8 and has presented on DB2 and XML in the United Kingdom and around Europe.

xxxiv

C

H A P T E R

1

Introduction

ML, the eXtensible Markup Language, is the standard format for exchanging information between different systems, applications, and organizations. XML is also the underlying data format for many web applications, Service-Oriented Architectures (SOA), and messagebased transaction processing systems. Enterprise application integration (EAI), enterprise information integration (EII), web services, the enterprise message bus (ESB), and standardization efforts in many vertical industries all rely on XML as the underlying technology for data exchange.

X

Organizations as well as entire industries have standardized XML Schemas to promote and simplify data exchange and are evolving those schemas to meet changing business needs. Many industry-speciﬁc initiatives as well as regulatory requirements are driving the adoption of XML. As more business transactions are conducted through web-based interfaces and electronic forms, government agencies and commercial enterprises face increasing requirements for preserving and post-processing the original transaction records. XML provides a straightforward means of capturing and maintaining the data associated with such electronic transactions. XML uses tags to deﬁne elements and attributes that hold business data. The element and attribute tags describe the intended meaning of the data items, and the nesting of the tags describes hierarchical relationships between the data items. Hence, XML is a self-describing data format. Data and metadata are tightly integrated in a vendor- and platform-independent format. These properties make XML well-suited for data exchange. Additionally, new tags can be invented and easily added. This extensibility allows XML to accommodate ever-evolving business needs. XML is a ﬂexible data model that is suited for any combination of structured, unstructured, and semi-structured data. Also, XML documents can be modiﬁed and transformed, even into other

1

2

Chapter 1

Introduction

formats such as HTML. Furthermore, the consistency of XML documents can easily be veriﬁed with an XML Schema. All this has become possible through widely available standards and tools such as XML parsers, XSLT, XPath, XQuery, and XML Schema. They greatly relieve applications from the burden of dealing with proprietary data formats. In an era where message formats, business forms, processes, and services change frequently, XML often reduces the cost and time it takes to react to such changes and to maintain databases and application logic correspondingly. Beyond XML for data exchange, enterprises are keeping large amounts of business-critical data permanently in XML format. This practice has various reasons. Some businesses must retain XML documents in their original format for auditing and regulatory compliance. Common examples include legal and ﬁnancial documents as well as electronic forms. Another reason for using XML as a permanent storage format is that XML can be a more suitable data model than a relational schema. If business objects are inherently complex, hierarchical, semi-structured, or highly variable in nature, the ﬂexibility of XML offers advantages over a rigorously deﬁned relational database schema. Accustomed to the beneﬁts of mature relational databases, many users expect the same capabilities for XML data, such as the ability to persist, query, index, update, and validate XML data with full ACID (Atomicity, Consistency, Isolation, Durability) compliance, recoverability, high availability, and high performance. DB2 pureXML is the answer. The subsequent discussion in this chapter is structured along the following topics: • Brief introduction to XML as a data format (section 1.1) • Differences between XML and relational data (section 1.2) • Overview of DB2 pureXML and its capabilities for managing XML data (section 1.3) • Advantages of DB2 pureXML over alternative storage options for XML (section 1.4) • Sample scenarios where XML can offer advantages over relational data (section 1.5)

1.1

ANATOMY OF AN XML DOCUMENT

In this section we illustrate the most important parts of an XML document. A complete and exhaustive discussion of the XML standard is outside the scope of this book. Pointers to textbooks and tutorials about XML are provided in Appendix C, Further Reading. Let’s look at the XML document in Figure 1.1 as an example. The ﬁrst line of the document contains the optional XML declaration. It indicates that this document follows the XML 1.0 standard, which is most commonly used. Besides XML 1.0, the only other version of XML is currently XML 1.1, which is very rarely used. We only consider XML 1.0 in this book. The XML declaration of the sample document in Figure 1.1 also carries an optional encoding declaration. Encoding concepts are discussed in Chapter 20, Understanding XML Data Encoding.

1.1

Anatomy of an XML Document

3

An XML document consists of elements and their attributes. Each element consists of a start tag and an end tag. These tags are enclosed in angle brackets. For example, the third line of the document shows a start tag and an end tag . Together they deﬁne a single XML element, the name element. The characters between the start and the end tag, Larry Menard, represent the value or the content of this element. Every start tag of an element must have a corresponding end tag. Elements can contain other elements, which means that tags can be nested. For example, the element addr contains the elements street, city, prov-state, and pcode-zip. Nesting builds hierarchical structures and expresses relationships between the elements. Elements can occur multiple times, in which case they are called repeating elements. For example, the phone element is a repeating element. It occurs multiple times because a single customer can have multiple phone numbers. Nested and repeating elements express one-to-many relationships between data items.

XML and encoding declaration Attribute

Larry Menard

Start tag of the root element Namespace declaration

223 NatureValley Road Toronto Ontario

Element Element value (text node)

M4C 5K8

Attribute name

905-555-9146 416-555-6121

Attribute value Comment End tag of the root element

Figure 1.1 Anatomy of an XML document

Elements can also contain one or multiple attributes within their start tag. Attributes are used to attach additional information to elements. They consist of an attribute name, the equal sign (=), and a value in quotes. For example, the element addr has an attribute country whose value is

4

Chapter 1

Introduction

Canada. Similarly, each occurrence of the element phone has an attribute type. Attribute values

must be in quotes regardless of whether the value is considered a numeric or a string value. For an XML document to be well-formed, it must have a single root element. The root element is the outermost element and contains all the other elements of the document. The root element in Figure 1.1 is customerinfo. It contains two attributes in its start tag, xmlns and Cid. The attribute Cid is used here to represent the customer identiﬁcation number. The attribute xmlns is a reserved attribute and declares a namespace. Namespaces are optional and we defer their discussion to Chapter 15, Managing XML Data with Namespaces. XML element and attribute names are case sensitive. The tags , and are all completely distinct from each other. XML element and attribute names can contain letters, numbers, and certain other characters such as the underscore. However, tag names must not start with a number or punctuation character, must not start with the characters xml (or XML, xML, and so on), and must not contain spaces. The order in which elements appear in a document is signiﬁcant. The order in which attributes appear within the start tag of an element is not signiﬁcant. In other words, elements are ordered, attributes are not ordered. When to use elements and when to use attributes to represent certain data items is a data modeling question and addressed in Section 2.1, Choosing Between XML Elements and XML Attributes. Further discussion of XML documents and their hierarchical representation is provided in Section 3.1, Understanding XML Document Trees.

1.2

DIFFERENCES BETWEEN XML AND RELATIONAL DATA

For a comparison of XML and relational data, let’s consider the simple XML document and the relational table in Figure 1.2. The relational table has six columns with ﬁxed names and data types. This table is a very strict and inﬂexible structure because every row in the table has to have exactly the same format with the same number of columns and the same data types. It is not possible that one row in the table has more or fewer columns than the next. It is also not possible for a column to have no data type or more than one data type. Each column has to have exactly one ﬁxed data type. Moreover, the structure and data types of the table are deﬁned before any data is inserted. Whenever data is inserted or retrieved from this table, the format of the rows is known without looking at the actual data. The strict schema provides a lot of information about the data and its format, which allows for very efﬁcient access. The XML document in the left side of Figure 1.2 represents similar data as the row in the table on the right. With DB2 pureXML you can store, index, query, and update this XML document even if there is no XML Schema that deﬁnes its structure or the data types of its elements. You may have an XML Schema for this XML document, but you don’t have to. The document itself contains some meta information that describes the data items, but no further schema information is necessary to store and query this document.

1.2

Differences Between XML and Relational Data

Robert Shoemaker 845 Kean Street Aurora 905-555-7258

5

CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30), email VARCHAR(50), phone VARCHAR(20))

CID NAME STREET CITY EMAIL PHONE 1003 Robert Shoemaker 845 Kean Street Aurora NULL 905-555-7258

Figure 1.2

XML document (left) and relational table (right)

Assume you receive information about another customer whose street name is 42 characters long. Inserting this information into the relational table fails with an error that needs to be handled. This error can be desirable because it enforces a certain constraint, but it can also be undesirable because it prevents the new information from being stored and processed immediately. Because XML allows more schema ﬂexibility, a document with a 42-character street name can be inserted without an error. The absence of an error can be desirable because it allows the data to be stored immediately, but it can also be undesirable because the excessive length of the street value goes undetected and can cause problems in later processing steps. Clearly, the ﬂexibility of XML needs to be used with care and only to the degree that is appropriate for a given application. Optionally, you can choose to use an XML Schema that constrains the XML document as strictly as the relational table in Figure 1.2. You could also choose to use a less stringent XML Schema. For example, you could use an XML Schema that requires the Cid value to be an integer and the name to not exceed 30 characters, leaving the data types of all other data items unconstrained. You can choose the degree of schema ﬂexibility that is right for your application. Note that the relational table in Figure 1.2 contains a NULL value in the column email. In the XML document, an email element is simply omitted if this customer does not have email. Optional XML elements are another form of schema ﬂexibility. Assume you receive information about a customer where, unexpectedly, the name of his assistant is included. The assistant name can easily be accommodated with an optional assistant element in an XML document. However, the relational table in Figure 1.2 does not allow the assistant name to be stored. Next, let’s consider a schema change. Due to unforeseen changes in your business, you now need to store multiple phone numbers per customer. Reacting to this change is simple with XML. The document in the left side of Figure 1.3 simply uses multiple occurrences of the phone element. The repeating phone elements represent the new one-to-many relationship between customers and phones. Existing XPath queries that read phone elements do not change. Accommodating

6

Chapter 1

Introduction

multiple phone numbers per customer in the relational schema requires normalization, which is a drastic schema change. Existing SQL queries must be modiﬁed to perform the proper join between the two relational tables. Downtime and service interruptions are likely. CREATE TABLE phones(cid INTEGER, phone VARCHAR(20)) Robert Shoemaker 845 Kean Street Aurora 905-555-7258 416-555-2937

CID 1003 1003

PHONE 905-555-7258 416-555-2937

CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30), email VARCHAR(50), phone VARCHAR(20))

CID NAME STREET CITY EMAIL PHONE 1003 Robert Shoemaker 845 Kean Street Aurora NULL 905-555-7258

Figure 1.3

A schema change in XML and relational data

Some of the key differences between XML and relational data are summarized in Table 1.1. The ﬂexibility of XML implies that examining and interpreting XML data can consume more computing resources than if the same data was stored in relational form. The reason is that information about the structure of the XML data needs to be discovered at runtime because a ﬁxed schema is not always present. The relational data model relies on much more rigid schema deﬁnitions than XML. For a relational table in a database, the structure of a row and the size and data types of its columns are known as soon as the table is created. Therefore, data access is more straightforward and can be more efﬁcient than for XML data. As such, relational data can provide very high performance but might fail to meet application requirements for schema ﬂexibility. Table 1.1

Comparison of Relational and XML Data

Relational Data

XML Data

Highly structured, highly regular in nature

Semi-structured, can be highly variable in nature

Rows are ﬂat

Data is hierarchical, can be arbitrarily nested

Fixed schema and metadata

Variable schema and metadata

Fixed number of columns per table

No ﬁxed format, ﬂexible number of elements and attributes per document

Fixed data type for all values in a column

Data types are optional and can be variable

1.3

Overview of DB2 pureXML

Table 1.1

7

Comparison of Relational and XML Data (Continued)

Data format deﬁned by DDL, known at query/ insert/update compile time

Data format not necessarily predeﬁned, not known until query/insert/update runtime

NULL values represent missing information

Optional elements and attributes can be omitted

Schema changes can be expensive

Schema changes are less expensive

In some cases, the nested and ﬂexible structure of XML can offer performance beneﬁts over relational schemas. Relational databases often require normalization to ﬁt business data into ﬂat, tabular structures. This normalization of complex business data requires transformation when data is stored and retrieved, and often leads to multi-way join queries in relational databases. XML can provide a more natural representation of complex business objects with all relevant relationships represented in a single document. The hierarchies within an XML document are essentially precomputed joins between related data items.

1.3

OVERVIEW OF DB2 PUREXML

This section provides a condensed overview of the DB2 pureXML technology. It summarizes the most important aspects of DB2 pureXML, which are described in more detail in the remainder of this book. At the core of DB2 pureXML is the data type XML, which has been added to the SQL type system in the SQL:2003 standard. Database users can deﬁne tables that contain one or multiple columns of type XML. In each row, a column of type XML contains either a well-formed XML document or NULL. A table that contains one or more XML columns can also contain other columns, such as INTEGER, VARCHAR, or DATE columns. Hence, users can deﬁne tables that hold both XML data and traditional relational data in each row of the table. The integration of XML and relational data is therefore very easy. It is also possible to create a table that only contains a single column of type XML and no other columns. DB2’s internal XML storage mechanism does not store XML data as text in large objects (LOBs) and does not convert XML to relational format. When you insert or load XML documents into a column of type XML, DB2 stores the XML documents in a parsed hierarchical format. Each XML document is parsed only once; that is, when it is ﬁrst inserted into an XML column. The parsed storage format allows queries and updates to operate on XML data without XML parsing—a key performance beneﬁt. The maximum XML document size is 2GB. You can use regular SQL statements to insert, delete, and update (replace) full XML documents. XML insert, update, and delete operations are logged by default and XML data is always buffered in the buffer pool. XML data participates in backup, restore, and recovery operations just like traditional relational data in the database. XML data can be compressed, replicated, and

8

Chapter 1

Introduction

federated, and is allowed in range-partitioned tables, clustered tables (MDC), and partitioned database environments (DPF). Partitioning keys and clustering keys must be relational columns. All the critical database utilities support XML data, such as LOAD, UNLOAD, IMPORT, EXPORT, RUNSTATS, REORG, BACKUP, RESTORE, and others. In DB2 for Linux, UNIX, and Windows, XML columns are also supported by High Availability Disaster Recovery (HADR). An XML Schema can be used to constrain XML documents, but the usage of XML Schemas is optional in DB2. In particular, you do not need to provide an XML Schema to create a column of type XML or to insert XML documents. DB2’s pureXML storage format does not depend on XML Schemas. When you insert, update, or load XML documents, you can choose to validate the documents against one or multiple XML Schemas. If you choose to validate documents, the validation and the association of schemas to documents happens on a per-document basis, not on a per-column basis. DB2 does not require all documents in an XML column to belong to the same XML Schema, although you can enforce that with triggers if you want. Since schema ﬂexibility is often a key reason for using XML, DB2 allows documents for multiple schemas, or multiple versions of a schema, to coexist in a single XML column. XML Schema evolution is seamless and does not require any database downtime. The use of XML Schemas for document validation can help applications ensure XML data quality. However, there is no performance penalty if you store XML documents without validation in DB2. Although XML Schemas can constrain one XML document at a time, there is no standard or XML technology yet to deﬁne constraints or referential integrity across XML documents or across XML and relational data. However, when you insert XML documents into a table you can choose to extract selected element or attribute values into relational columns. DB2 can perform such value extraction as part of the INSERT statement, but it can also be automated with triggers. Then you can deﬁne relational constraints, such as foreign keys and check constraints, on the populated relational columns. In DB2, XML data can be queried with XPath and SQL/XML, and in DB2 for Linux, UNIX, and Windows, also with XQuery. The SQL/XML standard allows XPath and XQuery expressions to be embedded in SQL statements so that XML and relational data can be queried together in a single query. Joins between XML columns or between XML and relational columns are possible. The SQL/XML function XMLTABLE can be used to query XML data and return the result set in relational format. Other SQL/XML functions support the opposite; that is, to query traditional relational tables to construct and return XML documents that contain the data values. To ensure high performance for XML queries, DB2 allows you to create XML indexes on speciﬁc XML elements and attributes that you specify with an XPath. Similar to the relational world, it makes sense to index those XML elements and attributes that are frequently used in query predicates and join conditions. Although you can decide to index all elements and all attributes in all documents in an XML column, you are not forced to do so. Indexing selected elements and attributes is often preferred. If you deﬁne an XML index on an optional element that, for example, occurs in only 5% of the documents (rows), then the index is quite small because it contains

1.3

Overview of DB2 pureXML

9

entries only for those 5% of the documents and rows in the table. In contrast, relational indexes always contain exactly one entry for each row in a table. If a query contains relational predicates and XML predicates, DB2 can use a combination of XML and relational indexes to evaluate the query. DB2’s RUNSTATS utility can collect statistics for XML data which the DB2 optimizer uses to create efﬁcient query execution plans. Although DB2 uses separate storage formats for XML and relational data, DB2 only has a single processing engine and a single query compiler and optimizer that handle any mix of relational and XML queries. DB2’s EXPLAIN facility can be used to examine the execution plans for XML queries just like for relational queries. DB2 for Linux, UNIX, and Windows also supports XQuery Updates to modify, insert, delete, or rename individual XML elements and attributes within an XML document. XSLT transformations as well as full-text search over XML data are also supported. Access control as well as concurrency control (locking) for XML data happens on the level of full documents. Since each XML document belongs to a row in a table, access control and concurrency control for a particular row determines the accessibility of the XML document in that row. Access rights and privileges cannot be deﬁned for individual elements within an XML document. The XML data type can be used for more than just the deﬁnition of XML columns. For example, you can deﬁne XML parameters and XML variables in SQL stored procedures and user-deﬁned function (UDFs). Such procedures and UDFs can contain XQuery or SQL/XML statements to manipulate XML documents while they remain in DB2’s internal parsed format. Application development for DB2 pureXML is based on existing but enhanced APIs. The traditional database APIs such as JDBC, ODBC/CLI, ADO.NET, or embedded SQL all support XQuery and SQL/XML statements as well as the exchange of XML data between a DB2 server and a client application. The JDBC 4.0 standard deﬁnes a new Java data type SQLXML to match the data type XML deﬁned by the SQL standard. Similarly you can deﬁne XML host variables in COBOL, C, PL/1, and Assembler. With DB2 pureXML, applications can often avoid XML parsing, because DB2 stores XML documents in a parsed format. The parsed storage allows you to extract or update document fragments or individual values without having to parse the XML data in your application. Applications send appropriate XML query or update statements to DB2 instead of fetching and parsing full documents. As a result, using DB2 pureXML leads to less application code, reduced application complexity, and higher end-to-end performance. Both the DB2 Control Center and IBM Data Studio support DB2 pureXML through a variety of wizards and visual interfaces. For example, you can view the tree structure of XML documents, create XML indexes with point-and-click into XML documents, design and register XML Schemas, or build XQuery and SQL/XML statements with context assist in Data Studio’s statement editor.

10

Chapter 1

Introduction

1.4

BENEFITS OF DB2 PUREXML OVER ALTERNATIVE STORAGE OPTIONS FOR XML DATA Prior to the availability of DB2 pureXML, the two main storage options for XML data in relational databases are LOB storage and shredding: • The LOB storage approach stores full XML documents in their textual form in character or binary large object columns (CLOB or BLOB). Other columns in the same table typically contain document identiﬁcation numbers or other information that helps applications to identify speciﬁc XML documents for retrieval or replacement. The main problem of this approach is that the XML documents are stored as if they were arbitrary pieces of text. The XML structure is ignored and not immediately visible. Therefore any operation that needs to access individual elements or attributes in a document requires XML parsing. For example, any query that extracts element values requires XML parsing at runtime. The resulting parsing overhead for query and update execution is a major performance problem that renders LOB storage inadequate for most XML applications. • Shredding (decomposing) XML documents into relational tables converts XML data into relational format. Shredding ﬁrst requires a design stage where an administrator maps XML elements and attributes to relational columns. When XML documents are inserted, they are parsed, broken up, and only their atomic data values are retained (see Figure 1.4). These values are inserted into the relational target tables by a series of INSERT statements. After an XML document has been shredded, its values are stored in these tables without the original XML tags. Depending on the complexity of the XML documents, shredding can require dozens or hundreds of relational tables to represent all the hierarchical relationships among the original XML elements and attributes. In many real-world XML applications this complexity is staggering such that even the mapping task is considered prohibitively expensive or unfeasible. Queries over decomposed XML data often require multi-way SQL joins that tend to be difﬁcult to develop and tune. Changes or variability in the XML input format often break the mapping to the relational database schema, which incurs time-consuming maintenance. A ﬁxed schema mapping that is costly to change negates the ﬂexibility for which XML is typically used. DB2 pureXML has been designed to overcome the problems that are inherent in LOB storage and shredding. The advantages of DB2 pureXML and its native XML storage format include: • Retaining awareness of the internal structure of the XML data: Contrary to LOB storage, DB2 pureXML stores XML in a parsed tree format that explicitly represents the structure of each XML document. As a result, applications can query and update XML data using XQuery, XPath, and SQL/XML without XML parsing at runtime. This is a critical performance beneﬁt. Additionally, query performance can be enhanced by creating indexes on speciﬁc elements and attributes in the XML documents.

1.5

XML Solutions to Relational Data Model Problems

LOB storage: stores XML as text

XML DOC

11

Shredding: XML Relational

Schema Mapping

DB2 pureXML: stores XML as XML

XML DOC

XML DOC

Shredder

XML DOC XML DOC XML DOC

XML Index

CLOB Column

regular relational tables

XML Column

Figure 1.4

DB2 pureXML and alternative XML storage options

• Keeping business objects intact: DB2 pureXML stores each XML document as a cohesive unit that belongs to one row in a table, providing a very intuitive storage and processing model for the application developer. In contrast, XML shredding scatters the values of each XML document over a number of tables. Hence, shredding can result in an unwieldy relational schema that is difﬁcult to understand and inefﬁcient for queries and the reconstruction of XML documents. • Schema ﬂexibility: While shredding requires all XML documents to adhere to a single XML Schema that is mapped to relational tables, DB2 pureXML can store documents for variable or evolving schemas in the same XML column. The cost of schema evolution is much lower for DB2 pureXML than for a shredding approach. • Faster application development: Because DB2 pureXML does not require any schema mapping and uses a single XML column instead of complex relational schema, prototyping and designing applications can be much simpler with DB2 pureXML than with shredding.

1.5

XML SOLUTIONS TO RELATIONAL DATA MODEL PROBLEMS

The data model that you use for your business data should allow for an easy and intuitive representation of your data and should efﬁciently support the most critical usage and access patterns. If the data being modeled is naturally tabular, it is typically better to represent it in relational format than as XML. However, there are cases where the relational model is not necessarily the best choice and sometimes even a poor choice to hold your data. The following are some situations where an XML representation tends to be more beneﬁcial than the relational format.

12

Chapter 1

Introduction

1.5.1 When the Schema Is Volatile Problem with relational data: If the schema of the data changes often, then a relational representation of the data is subject to costly relational schema changes. Although some forms of schema modiﬁcation are relatively painless in relational databases, such as adding a new column to a table, other forms are more involved, such as dropping a column or changing the type of a column. Still other forms of schema modiﬁcation are extremely difﬁcult, such as normalizing one table into multiple tables. Changing the tables means that the SQL statements in the applications that access them must also be changed. Solution with XML data: Portions of the schema that are volatile can be expressed as a single XML column. The self-describing and extensible nature of XML allows seamless handling of schema variability and evolution. Changes in the XML document format are accommodated without changing tables or columns in the database and typically without breaking existing XML queries.

1.5.2 When Data Is Inherently Hierarchical in Nature Problem with relational data: Data that is inherently hierarchical or recursive is often difﬁcult to represent in relational schemas. Examples include a bill of materials, engineering objects, or biological data. A bill of materials explosion can be stored in a relational database but reconstructing it in parts or in full might require recursive SQL. Solution with XML data: Since XML is a hierarchical data model, it is a much more natural ﬁt for inherently hierarchical business data. Using XML allows simple, navigational data access to replace complex set operations, which would be required if the same data was represented in tabular format.

1.5.3 When Data Represents Business Objects Problem with relational data: If application data represents business objects, such as insurance claim forms, then it is often beneﬁcial to keep the data items that comprise a particular claim together, instead of spreading them over a set of tables. This beneﬁt is particularly important when the individual data items of a claim form have no valid business meaning by themselves and can only be interpreted in the context of the complete form. Normalizing the claims across dozens of relational tables means that applications deal with a complex and unnatural fragmentation of their business data. Such normalization can increase complexity and the chance for errors. Solution with XML data: XML enables you to represent even complex business objects as cohesive and distinct documents while still capturing all the relationships between the data items that comprise the business object. Representing each claim form (business object) as a single XML document in a single row of a table provides a very intuitive storage model for the application developer and enables rapid application development.

1.6

Summary

13

1.5.4 When Objects Have Sparse Attributes Problem with relational data: Some applications have a large number of possible attributes, most of which are sparse; that is, they apply to very few objects. A classic example is a product catalog where the number of different product attributes can be huge, including size, color, weight, length, height, material, style, weave, voltage, resolution, water resistance, and a near endless list of other properties. For any given product, only a subset of these attributes is relevant. One possible relational schema is to have one column per attribute, which means a very large percentage of the cells in the table contain NULL values. Large numbers of NULLs are undesirable and can be inefﬁcient. A different relational approach for such sparse data is a three-column table that stores several name/value pairs for each product ID. In this name/value pair approach, the attribute names are not column names but values in a VARCHAR column. This design prevents relational database systems from accurately estimating constraint selectivity and generating efﬁcient query plans. Finally, deﬁning and enforcing constraints, such as uniqueness for a certain attribute, is extremely difﬁcult. Hence, data quality and integrity suffers. Solution with XML data: The beauty of XML is that elements and attributes can be optional, so they are simply omitted if they don’t apply for a speciﬁc product. Neither NULL values nor name/value pairs are needed. The XML Schema can deﬁne a very large number of optional elements, but only few of them are used for any given object. While every row in a relational table has to have the exact same columns, XML documents in an XML column can have different elements from one row to the next. Also, an XML index for an optional element is very small if this element appears only in a small percentage of the documents (rows). This is a clear advantage over relational indexes which have exactly one entry per row.

1.5.5 When Data Needs to be Exchanged Problem with relational data: If you export a set of rows from a relational table and send them to another application or organization, the recipient cannot interpret the data without additional metadata that describes the columns. This separation of data from metadata in the relational world poses a particular problem if your relational schema has changed since the last time you sent data. Solution with XML data: XML data is self-describing. The XML tags are metadata and describe the values that they enclose. The nesting of XML elements further deﬁnes the relationship between data items.

1.6

SUMMARY

XML, the extensible markup language, acts as a ﬂexible and self-describing data format for data exchange, web services, and service-oriented architectures. XML is also a hierarchical data model that is inherently different from the relational model. While relational data processing is

14

Chapter 1

Introduction

based on rigorous and predeﬁned schemas that allow for limited ﬂexibility, XML is well-suited to represent data with variable or evolving schemas. XML is also commonly used as a data format for semi-structured data or to integrate structured and unstructured data. Depending on the performance and ﬂexibility requirements of particular applications, you will ﬁnd that in some cases XML is a better choice than a relational schema, and in other cases relational data has advantages over XML. Many scenarios also exist in which a hybrid approach, that is, a mix of XML and relational data, is the best solution. Considerations for hybrid data models are discussed further in the next chapter. DB2 pureXML provides sophisticated capabilities for storing, indexing, querying, updating, and validating XML documents. The pureXML technology and its native XML storage format provide signiﬁcantly higher performance and ﬂexibility than alternative storage options for XML data, such as LOBs or shredding. DB2 pureXML also enables seamless integration of XML and relational data.

C

H A P T E R

2

Designing XML Data and Applications

his chapter looks at several design issues in the world of XML documents. Sometimes you might get involved in the design of a speciﬁc format for your XML documents and you will ﬁnd that the design decisions made at this point can have a big impact on how your application processes XML. Therefore, this is the ﬁrst stage of XML application design. In many other cases, the format of the XML documents that you need to process may have already been designed and decided by the time you get involved. Many vertical industries and consortia deﬁne speciﬁc XML Schemas to standardize the XML document formats that are used to exchange and process information within a particular industry. Some of them are discussed in Chapter 16, Managing XML Schemas. Even if you work with a predeﬁned XML format, there are still decisions to be made, such as the most suitable granularity in which you should store XML documents or document fragments.

T

In this chapter you learn • How to choose between XML elements and attributes (section 2.1) • How to represent data as XML values and metadata as XML tags (section 2.2) • How to design documents with an appropriate size and scope (section 2.3) • How to decide on a “good” mix of XML and relational data (section 2.4)

2.1

CHOOSING BETWEEN XML ELEMENTS AND XML ATTRIBUTES

A common question is when to use attributes and when to use elements, and whether this choice affects performance. It turns out that this is much more of a data modeling question than a performance question. As such, this question is as old as SGML, the precursor of XML, and has been

15

16

Chapter 2

Designing XML Data and Applications

hotly debated with no universally accepted consensus. However, a key thing to remember is that XML elements are more ﬂexible than attributes because they can be repeated and nested. Table 2.1 shows an example of an XML document with and without attributes. Both documents logically represent the same business data. They contain information about a book called “Database Systems”, written by authors “John Doe” and “Peter Pan” who have id numbers 47 and 58 respectively, and the price of the book is 29, but there is no information in either document about the currency of the price. In the document on the left of Table 2.1, price and title are child elements of the element book, and the author id is a child element of the element author. This approach is certainly a decent way of modeling the data. Alternatively, the document on the right has price and title as attributes of the element book, and id as an attribute of the element author. In general, both versions of the document, with and without attributes, can be reasonable choices. There is no immediate way to decide whether one of the two document formats is “better” than the other. Table 2.1

An XML Document with and without Attributes

XML document without attributes:

XML document with attributes:

47 John Doe 58 Peter Pan Database systems 29 SQL relational

John Doe Peter Pan SQL relational

The document with attributes might be appealing because it is shorter. It contains 200 nonwhitespace characters as opposed to 248 in the document without attributes. An XML parser needs to look at every single character of a document, which generally means that shorter documents can be parsed faster. This reduction in parsing times may matter if you are designing an XML message format for very high-volume processing with near real-time performance requirements and throughput targets such as thousands of messages per second. However, many XML applications do not fall into this category and performance should be a secondary concern during XML modeling.

2.1

Choosing Between XML Elements and XML Attributes

17

More important is the ﬂexibility and extensibility of the XML format, which is usually why XML is chosen to begin with. In the example in Table 2.1, chances are that the format of the price information eventually needs to be extended. This extension is easy in the document on the left where price is an element. For example, you can add an attribute currency to the price element to make it more descriptive. Also, as the business expands to international markets, you can easily repeat the price element multiple times to reﬂect the price of the book for different countries (see Figure 2.1). 47 John Doe 58 Peter Pan Database systems 29 5735 35.80 SQL relational

Figure 2.1 Document with multiple price elements

This extension of the price element has the very desirable property that XPath queries that worked for the old document format continue to work without changes for the new format. For example, the XPath /book/price returns the single price element from the document on the left in Table 2.1, but also all three price elements with their currency information from the new document format in Figure 2.1. This property helps to ensure seamless operation of applications during such a schema evolution. In the document on the right side of Table 2.1, where price is an attribute, such an extension is a lot harder to make if you want to keep using attributes. The existing price attribute cannot be extended to contain another nested attribute, and an attribute by the name of price can only occur once for the book element. You could certainly remove the existing price attribute and use price elements instead. This change implies that for older documents the XPath to the price information is /book/@price whereas for newer books it is /book/price. Thus, this change is invasive and indicates that you probably should have used elements to begin with. In such a situation you should not use multiple price attributes with different names, as shown in Figure 2.2. This design has a variety of undesirable consequences. First of all, XPath queries need

18

Chapter 2

Designing XML Data and Applications

to be changed each time you introduce a new currency to your business. Second, this design makes it more complicated to retrieve all price information with a single query. Third, if your queries use search conditions on the price attributes then you will have to deﬁne a separate XML index for each currency, instead of just two indexes (on e for price and one for currency). These problems stem from the fact that the currency information is part of your business data, not part of the metadata. Hence, the currency should be a value and not part of a tag name. The use of tags and values is discussed further in section 2.2. ...

Figure 2.2

Bad XML design with different names for price attributes

Also note that the XML standard speciﬁes that elements are ordered while attributes are unordered. For example, the three price elements in Figure 2.1 are in a ﬁxed order, and this order is guaranteed when the document is parsed, stored, queried, or otherwise processed. In contrast, the three price attributes in Figure 2.2 do not have a signiﬁcant order within the book element. They could appear in a different order and the document would still be considered “the same.” Hence, if the relative order among your data items is important, use elements instead of attributes. Although you could model all data without attributes, they can be a very intuitive choice for data items that are known in advance to never repeat (per element) nor have any subﬁelds. Attributes contribute to somewhat shorter XML because they have only a single tag as opposed to elements, which have a start tag and an end tag. Shorter attribute tags are at most a minor performance bonus rather than an incentive to convert elements to attributes, especially when data modeling considerations actually call for elements. In DB2, attributes can be used in queries, updates, predicates, and index deﬁnitions just as easily as elements. There is generally no signiﬁcant performance difference between accessing or updating elements versus attributes when XML documents are stored in DB2. Both elements and attributes can be deﬁned as mandatory or optional in an XML Schema. As another example, let’s look at the XML document in Figure 2.3, which contains information about a department with two employees. The document uses attributes for the department and employee identiﬁers. This approach seems to make sense because each employee and department will always have just one ID value. Furthermore, an element is used for the employee telephone information, which allows an employee to have multiple occurrences of the phone element if needed. It is also extensible in case you later need to break telephone numbers into fragments. For example, the phone element could have child elements for country code, area code, and extension, which would not be possible if phone was an attribute.

2.2

XML Tags versus Values

19

John Doe 408-555-1212 408-463-4880 Peter Pan 408-255-8587 F589

Figure 2.3 A sample XML document

The XML document in Figure 2.3 also raises another design question, which we discuss in section 2.3: Is it better to keep the information for all employees of a department in one document, or is it better to have one XML document per employee?

2.2

XML TAGS VERSUS VALUES

The idea of XML as an extensible markup language is that the markup, which consists of all the element and attribute tags, describes the enclosed data values. The ability to use custom tags for markup makes XML a self-describing data format. The XML tags can also be considered metadata. Hence, XML documents conveniently combine data and metadata in a universally accepted format. An important aspect of designing XML documents is to distinguish clearly between data and metadata. The metadata should be represented as element and attribute names, the data as element and attribute values. This approach is analogous to relational modeling, where table and column names are metadata, and the values in the columns are the actual data. In XML it’s almost always a bad idea to represent metadata as values instead of tags, or actual data as tags instead of values. Let’s look at the examples in Table 2.2 and Table 2.3. The document on the left side of Table 2.2 contains information about the brand, price, and year of a car. The brand is Honda, the price is 5000, and the year is 1996. The terms “brand”, “price”, and “year” constitute meta information for the values Honda, 5000, and 1996. Hence, Honda is a data value, not metadata. Therefore it should be an XML element value, not an element name. The XML document on the right side of Table 2.2 is a better representation of the same data. There the term “brand” is used as an element name (meta information) for the value Honda. Imagine yourself modeling the same data in a relational table. You would not use Honda as a column name in a table. Avoiding business data in tag names has several advantages: • If you are using an XML Schema, you don’t need to add new element deﬁnitions to your XML Schema each time your business handles a new brand of car.

20

Chapter 2

Designing XML Data and Applications

• You can always use the XPath /car/brand to retrieve the brand from a particular car document. Otherwise, if brand names are tags, many different or more complicated XPath expressions are necessary. • If you search for cars by brand then you can use XML indexes in a much simpler and more intuitive manner if the brand names are element or attribute values rather than tag names. Table 2.2

Business Data as Tags Versus Values

Business data as element name (not recommended): 5000 1996

Business data as element value (recommended): Honda 5000 1996

What happens if you use meta information, such as the terms “brand”, “price”, and “year”, as values rather than element or attribute names? This is shown in the left side of Table 2.3 where the XML document consists of very generic tag names, such as object, type, ﬁeld, name, and value. These tags are not very descriptive, which is contrary to the concept of XML as a selfdescribing data format. You see that the brand, price, and year of the car are represented by pairs, which consist of a name and a value. However, the names are actually XML attribute values, not descriptive tag names. This approach is commonly referred to as Name/Value Pairs (NVP), KeyValue Pairs (KVP), or Entity-Attribute-Value model (EAV). Table 2.3

Name/Value Pairs (Metadata as Tags Versus Values)

Metadata as values, aka Name/Value Pairs (often bad): Metadata as element names (good): Honda 5000 1996

The Name/Value Pair approach to data modeling also sometimes appears in the relational world when a table with three columns (id, name, and value) is used. This approach may seem attractive when dealing with entities that can have hundreds or thousands of attributes, but only a small number of them apply to any individual entity. If you were to represent each possible attribute by a column in a relational table, you might exceed the maximum row length or the maximum number of columns in a table. Nevertheless, the Name/Value Pairs approach has very signiﬁcant and inherent drawbacks, which are similar for XML and relational data. In particular:

2.2

XML Tags versus Values

21

• Deﬁning business rules and constraints for Name/Value Pairs is very difﬁcult and often impossible. You cannot deﬁne an effective XML Schema to control and constrain this type of XML data. If you use the “better” XML format shown in the right side of Table 2.3, an XML Schema can easily specify that the value of the price element has to be greater than zero, and the value of the year element has to be a four-digit integer between 1950 and 2099. In the Name/Value Pairs in the left column of Table 2.3, price and year are represented by the same XML attribute called value. An XML Schema does not allow you to specify that if there is an attribute called name with the value price then the value of the attribute value in the same ﬁeld element must be greater than zero. • Name/Value Pairs handle all data as strings (text). Since the attribute value can contain arbitrary data values, it cannot be typed as INTEGER, DECIMAL, DATE, or TIMESTAMP. Handling all data as strings leads to data quality issues because proper data types cannot be enforced. Another consequence is that any indexes and comparisons have to treat the data values as strings. If you search for cars with a price greater than “5000”, you will also ﬁnd cars with prices such as “600” or “900” because these strings are greater than the string “5000”. You can solve this problem with appropriate cast operations in your queries, but those usually preclude the use of indexes, which means performance suffers. • Writing queries against Name/Value Pair data is very complex. As an example, assume that you need to retrieve the years of all Honda cars that have a price greater than 5000. The corresponding XPath expression for the Name/Value Pair data is shown in Figure 2.4, followed by the same query for the “regular” XML data in the right side of Table 2.3. The difference in complexity is striking, and it is even greater for more advanced search queries. -- XPath query to retrieve the years of all Honda cars with a -- price greater than 5000 from Name/Value Pair XML data: /object[@type="car" and ﬁeld[@name = "brand" and @value = "Honda"] and ﬁeld[@name = "price" and @value > "5000"] ]/ﬁeld[@name="year"]/data(@value)

-- Same query for regular XML Data: /car[brand="Honda" and price > 5000]/year

Figure 2.4

Complexity when querying Name/Value Pairs

22

2.3

Chapter 2

Designing XML Data and Applications

CHOOSING THE RIGHT DOCUMENT GRANULARITY

When you design your XML application, and in particular your XML document structure, you may have a choice as to which business data is kept together in a single XML document. Is it better to keep a lot of data in a large XML document, or is it better to use many small documents instead? The proper scope of any given document is a critical design decision. The general recommendation is to choose an XML document granularity such that one document represents one logical business object from an application point of view. Another guideline is to use an XML document granularity that matches the anticipated predominant granularity of data access or data exchange. Very often the logical business objects match the predominant granularity of data access, so these two guidelines lead to the same result. What constitutes a small, medium, or large XML document? Very roughly, XML documents up to 50KB are typically considered small, documents between 50KB and 1MB are often considered medium, and documents of more than 1MB are considered large. Documents in the range of hundreds of Megabytes or a few Gigabytes are huge, relatively rare, and almost always the result of combining a large number of smaller XML documents. Let’s look at the example in Figure 2.5, which shows three design options to represent data for several orders. Each order has a date, a customer name, and several parts, which have a key, a quantity, and a price. Let’s assume that you have to store and manage these orders for a particular application that treats each individual order as a separate logical business object. It typically receives and processes one order at a time, and a single order is the predominant level of access or transmission. In case (a) on the left, multiple orders are combined in one large document (coarse granularity). This approach can be useful when you need to archive or FTP a certain batch of orders, such as all orders for the past week, for example. Storing this large document as-is in a database is only a good idea if this batch in its entirety represents a meaningful business object to your application and users. This is not the case in our example. Since our ﬁctitious application typically reads and writes one order at a time, storing many orders in a single large document would result in suboptimal performance. In general, combining many independent business objects in a single document is not recommended. DB2 uses indexes over XML data to ﬁlter on a per-document level. Therefore, the ﬁner the XML document granularity, the higher the potential beneﬁt from indexbased access. Although DB2 pureXML helps you avoid a lot of XML parsing in the application layer, some applications might still use a DOM parser to ingest XML documents and run into performance problems or failures if the documents are too large. Many XML design and editing tools also use DOM parsers and are often unable to handle very large XML documents. Therefore, debugging and correcting XML documents is much easier if they are small. In case (b), each order is a separate XML documents (medium granularity). This approach matches the nature of the application and not only provides good performance but is also very

2.3

Choosing the Right Document Granularity

23

intuitive for the application developer. One row in the database contains one business object for the application and no joins are required to retrieve all data for this object. Case (c) on the right represents ﬁne granularity. Each order and each part is stored as a separate XML document. This approach can be a very good choice if each part information in itself is a separate business object of interest and often accessed and processed independently from the order it belongs to. In this example, however, part information has no real business meaning on its own and is dependent on an order. For example, the quantity and the price of a part are relevant only for a speciﬁc order. A different order can contain the same part with a different price and quantity. Typically, an application always needs to see all parts of an order and would never retrieve a part by itself without order information. Another reason why case (c) might not be useful is that having part and order information in separate documents would require joins between them. These reasons make case (b) desirable because the XML documents already represent this join in their structure. (a) Doe 5 5.00 11 19.95 Doe 5 5.00 11 19.95 23 1.99 1 24.95

Figure 2.5

Different document granularities

24

Chapter 2

Designing XML Data and Applications

In a nutshell, choose the XML document granularity with respect to the logical business objects and the anticipated predominant granularity of access. When in doubt, it is usually better to lean towards ﬁner granularity and smaller XML documents.

2.4

USING A HYBRID XML/RELATIONAL APPROACH

XML is not the grand solution for all data management problems. As discussed in Chapter 1, Introduction, XML can provide signiﬁcant advantages if the structure of your data is highly variable, evolves over time, or is hard to represent in a simple relational schema. Also, if you receive and send business objects in XML format, you can often improve performance and simplify applications if you also store these objects in XML format. Storing XML objects in XML format avoids complex mappings and costly transformations. However, sometimes the best solution is to store some of your data in relational format and some of your data in XML format, which is called hybrid storage. There are no deﬁnitive rules that describe precisely how to determine the right mix of XML and relational data. The right mix depends on the speciﬁc characteristics and requirements of a given application, or set of applications, that access the data. The following considerations can help you ﬁnd the right design for your application. It is quite common that business objects such as orders, trades, sales records, customer records, emails, and blog posts consist of a ﬁxed header plus a highly variable body. The header contains certain data ﬁelds that are common for all business objects of the same category. The body can be very different from one business object to the next and can contain any of thousands of optional attributes. For example, a ﬁnancial trade might contain a header with the trade ID, the trading date, and the IDs of the two parties involved in the trade. Although these data items are present for every trade, the elements in the body of the trade depend highly on the exact nature of this particular trade. In this case, you might want to store the header ﬁelds in relational columns and the body in an XML column of the same table. Similarly, think of XML documents such as emails, blog posts, or CRM (customer relationship management) records produced in a call center. CRM records often contain the customer name and identiﬁer, the date when the customer called in to report a problem, the name or ID of the product or service that the customer needs help with, and most likely a unique identiﬁer of the CRM record itself. This data is very regular and structured with well-deﬁned data types and can easily be stored in relational columns. However, the body of a CRM record typically contains semi-structured information with free text as well as interspersed data ﬁelds such as dates and a user ID to track when and by whom new information gets appended. This semi-structured part of the CRM record is better stored as a whole in an XML column.

2.5

Summary

25

If a business object arrives as an XML document, DB2 can extract selected element or attribute values from the document as part of the INSERT statement, without any extra XML parsing. This process is explained in more detail in Chapter 11, Converting XML to Relational Data. The beneﬁts of storing some data ﬁelds of a business object in relational format can include the following: • You can deﬁne primary key and foreign key constraints on relational columns, but not on any elements or attributes in an XML column. • You can deﬁne multi-column (composite key) indexes on two or more relational columns, but you cannot deﬁne a composite key on two or more elements or attributes in an XML column. • Relational columns can be used to deﬁne range partitioning, hash partitioning, or multidimensional clustering for a table. These cannot be deﬁned based on elements or attributes in an XML column. • Queries can use regular relational SQL predicates for relational columns, which some people ﬁnd easier to use than XML predicates. • If you use WebSphere Replication Server to replicate rows to another database server, you can deﬁne ﬁltering conditions on relational columns of the source so that rows are selectively replicated only if they meet the speciﬁed condition. Such replication ﬁlters cannot be speciﬁed on XML columns. • Relational column values can be referenced in the deﬁnition of generated columns and materialized views, but XML columns and individual XML elements and attributes cannot.

2.5

SUMMARY

Designing an XML application begins with designing the XML data. The more appropriately you design your XML data for your business needs and application, the easier it will be to process and manage this XML data efﬁciently. Both your applications and your database will run best if the scope and granularity of your XML documents match the logical business objects of your application as well as the most frequent granularity of data access or data exchange. Try to favor smaller documents rather than larger documents. For the low-level design of your XML documents, keep in mind that XML elements are more ﬂexible than attributes because they can be repeated and nested. You often want to favor XML elements over attributes to ensure future extensibility of your XML data. Also, make sure that meta information that describes your data is represented by XML element and attribute names, not by values. Conversely, the actual data items that your applications need to read and manipulate should be XML element and attribute values, not XML tags. Remember the analogy to the columns in relational tables, where column names represent metadata while the column content is your business data.

26

Chapter 2

Designing XML Data and Applications

Often you do not have the luxury to design your XML document format. Many XML applications are forced to consume and process XML documents in a format that has previously been designed by other parties and cannot be changed. You can still choose to let DB2 split those documents into smaller fragments if that better matches the predominant granularity of access. Additionally, it can be advantageous to extract a few selected elements or attribute values from each document into relational columns. Chapter 5, Moving XML Data and Chapter 11, Converting XML to Relational Data explain DB2’s capabilities for splitting XML documents and hybrid XML/relational storage.

C

H A P T E R

3

Designing and Managing XML Storage Objects

n this chapter we discuss how to create and conﬁgure a database, table spaces, and tables to manage XML data. This discussion includes topics such as hierarchical XML storage structures, XML compression and inlining, monitoring and measuring XML storage consumption, reorganization, and partitioning of tables and databases that contain XML data. The topics in this chapter are organized as follows:

I

• Understanding XML document trees and their pureXML storage representation. These concepts are platform independent (sections 3.1 and 3.2) • Managing XML storage in DB2 for Linux, UNIX, and Windows (sections 3.3 through 3.10) • Managing XML storage in DB2 for z/OS (sections 3.11 and 3.12) • XML parsing and XML memory options speciﬁc to DB2 for z/OS (section 3.13) When you create a database that will contain XML data, one of the ﬁrst design choices is to choose a code page. The recommended code page is UTF-8 Unicode. The beneﬁts of Unicode are explained in Chapter 20, Understanding XML Data Encoding. It is also possible to manage XML in a non-Unicode database, which allows you to easily add XML to existing databases that do not use UTF-8. DB2 9 for z/OS allows XML columns in databases and table spaces of any supported encoding. In DB2 9.5 and 9.7 for Linux, UNIX, and Windows, all new databases use UTF-8 as the default code page. However, you can specify a non-Unicode code page in the CREATE DATABASE statement, if you want.

27

28

Chapter 3

Designing and Managing XML Storage Objects

DB2 9.1 for Linux, UNIX, and Windows is slightly more restrictive because pureXML is available only in UTF-8 encoded databases, and you must explicitly set the database code page to UTF-8 in the USING CODESET clause of the CREATE DATABASE statement: CREATE DATABASE mydb USING CODESET utf-8 TERRITORY us

Before we discuss how XML documents are physically stored in a DB2 database, let’s look at how the XQuery Data Model deﬁnes XML document trees.

3.1

UNDERSTANDING XML DOCUMENT TREES

Since XML is a hierarchical data model, every XML document can be represented as a tree of nodes. Any query or update of XML data traverses the hierarchical structure of the XML documents. This traversal can be done most efﬁciently if the XML documents are physically stored in a hierarchical format. Therefore, DB2 for z/OS and DB2 for Linux, UNIX, and Windows store XML documents as trees of nodes with parent-child relationships between the nodes. These trees are deﬁned by the XQuery Data Model (XDM) and described in this section. Further details of the XQuery Data Model are covered in Chapter 6, Querying XML Data: Introduction and XPath. Let’s look at the XML document in Figure 3.1 as an example. It is a simple document that contains information about a customer. The outermost element, customerinfo, is called the root element. Its children are the elements name and addr as well as two occurrences of the element phone. The element addr has an attribute country as well as four child elements: street, city, state, and zip. Each phone element has an attribute called type. Jim Noodle 555 Bailey Ave San Jose CA 95141 408-289-4136 408-710-7910

Figure 3.1

Sample XML document

Figure 3.2 shows the same XML document in its tree representation. Such a tree can be constructed by parsing a textual XML document with an XML parser. In general, an XML document tree can have six different types of nodes. Element nodes, attribute nodes, text nodes, and the document node are the most common node kinds. They occur in the tree in Figure 3.2. Occasionally, XML documents can also contain comment nodes and processing instruction nodes.

3.1

Understanding XML Document Trees

29

Every XML element of the document in Figure 3.1 is represented by an element node in the corresponding document tree in Figure 3.2. The element nodes are white and rectangular. The textual value of each element is represented by a separate text node, shown in gray. Attribute nodes are shown with a double border. An attribute node contains all information about an attribute, including its value. The XQuery Data Model also deﬁnes that each document tree has a document node, shown in Figure 3.2 as a black circle. It is the topmost node and the parent of the root element. The document node is not visible in the textual representation of an XML document, only in its parsed hierarchical format. You will see later in this book that the document node is sometimes important when you manipulate XML documents. For example, assume you cut off the addr branch from the tree in Figure 3.2. This branch by itself does not have a document node and is therefore not a valid document tree. Hence, inserting it as a document into an XML column would fail unless you construct a new document node. Construction of a document node is shown in Chapter 5 (see section 5.7, Splitting Large XML Documents into Smaller Documents).

customerinfo

name

Jim Noodle

addr

country=US

street

555 Bailey Ave

Figure 3.2

city

San Jose

phone

state

CA

zip

type=work

408-289-4136

phone

type=cell

408-710-7910

95141

XML document tree

You might wonder why element values reside in separate text nodes while attribute values do not. The main reason is that the child nodes of an element can be a mix of text nodes and other element nodes, which is known as mixed content. An attribute, however, has exactly one value and never any child nodes, which makes attributes less extensible than elements. An element can have multiple text node children but they cannot be adjacent siblings to each other. As an example of mixed content and multiple text node children, consider the following two XML documents, both of which contain a title element. In the ﬁrst case the title has a single text value and the corresponding tree representation is shown in Figure 3.3(a). The title element in the second document contains some text, “The ” and “ Cookbook” (note the spaces!), as well as a child element bold.

30

Chapter 3

Designing and Managing XML Storage Objects

Figure 3.3(b) shows that this results in a mixed set of child nodes under the title element: two text nodes and one child element (bold). The two text nodes “The ” and “ Cookbook” are separated by the element bold and are not adjacent children. If they were adjacent they would automatically collapse into a single text node. (a) (b)

The DB2 pureXML Cookbook The DB2 pureXML Cookbook

title

title

The

bold

Cookbook

The DB2 pureXML Cookbook

DB2 pureXML

(a) Figure 3.3

(b)

An example of mixed content

Note the XQuery Data Model deﬁnes the value of an XML element as the concatenation of all text nodes in the subtree under that element. This concatenation is trivial for elements that have only one text node. The value of the element state in Figure 3.2 is “CA”, and the value of title in Figure 3.3(a) is “The DB2 pureXML Cookbook”. At the same time, the value of the title element in Figure 3.3(b) is also “The DB2 pureXML Cookbook”, and the value of the element bold is “DB2 pureXML”. Similarly, the value of the addr element in Figure 3.2 is “555 Bailey AveSan JoseCA95141” (note that there is no space between Ave and San and also no space between Jose and CA and 95141). The addr element is called a non-leaf element, and this example shows that values of non-leaf elements are often not useful.

3.2

UNDERSTANDING PUREXML STORAGE

The document tree in Figure 3.2 illustrates the hierarchical format in which XML documents are stored in DB2 (all platforms). When an XML document in its textual format is inserted or loaded into an XML column, the DB2 server parses the XML document to produce the parsed hierarchical format that is stored on pages in a table space. This process is reversed when an application retrieves an XML document from DB2. This reverse process is called serialization; that is, the document tree is converted back into the text format of the XML document. You can think of parsing and serialization as inverse operations.

3.2

Understanding pureXML Storage

31

The exact shape of a document tree in DB2’s storage layer depends on and can vary with each individual instance document. It is not pre-deﬁned based on an XML Schema, which allows DB2 to store documents with widely varying or evolving structures in the same XML column. DB2 performs a variety of optimization when storing document trees on pages. For example, element and attribute names (also called tag names) are transparently replaced by unique 4-byte integer numbers. Thus, DB2’s internal tree format looks actually more like Figure 3.4 than Figure 3.2. In addition to the integer number, each node can also contain other properties, such as information about namespaces and data types.

100

101

Jim Noodle

102

103=US

104

555 Bailey Ave

Figure 3.4

109

San Jose

116

106

CA

113

110=work

408-289-4136

116

110=cell

408-710-7910

95141

XML document tree with tag names replaced by integer values

The mapping from tag names to the so-called stringIDs is kept in the catalog table sysibm. sysxmlstrings (see Figure 3.5). This mapping is database-wide, where each distinct tag name and each distinct namespace URI has exactly one entry. For example, the phone element occurs twice in the sample document and may occur millions of times across all the XML documents in a database. Each occurrence is replaced with the same unique stringID, which is 116 in this example. Hence, the phone element has only one entry in the mapping table. Consequently, the mapping table is never larger than the number of distinct tag names in the database, which is typically a small number (several hundred to several thousand).

32

Chapter 3

Designing and Managing XML Storage Objects

STRING

STRINGID

IS_TEMPORARY

customerinfo

100

N

name

101

N

addr

102

N

country

103

N

street

104

N

city

109

N

state

106

N

zip

113

N

phone

116

N

type

110

N

…

…

…

Figure 3.5

Mapping tag names to integers in sysibm.sysxmlstrings

When a document is inserted and parsed, DB2 checks every tag name to see whether it is already recorded in this mapping table. If it is not, a new entry is added to the mapping table. Otherwise the existing stringID for the tag is used. Hence, inserts into the mapping table are very rare and occur only for new elements that DB2 has never seen before in a given database. For example, if you insert a million documents of similar structure, there is a good chance that only the ﬁrst document, or the ﬁrst few documents, actually cause inserts into the sysibm.sysxmlstrings catalog table. Most of the time the mapping table is active as a lookup table and DB2 has a special purpose mechanism and cache to ensure high lookup performance. DB2’s use of the mapping table leads to signiﬁcant performance beneﬁts. First of all, it reduces the space that is required to represent XML on pages in table spaces or buffer pools. Second, any query evaluation and traversal of XML documents now operate on integers, not on strings, which is much faster. Since the sysibm.sysxmlstrings table never grows very large, DB2 never deletes or updates any entries in this table. This avoids lock contention on this table and enables high performance. Even REORG or LOAD REPLACE of a user table does not reset the mapping table. Remember that the mapping table contains entries for XML documents in the entire database, and not just for XML documents in a single table. Excessive growth of the mapping table is not a concern, because XML applications do not use an unbounded number of distinct tag names.

3.3

XML Storage in DB2 for Linux, UNIX, and Windows

33

The mapping table is really only for DB2’s internal operation and you cannot modify it. You can however, read this table if you want to get a list of all tag names that ever existed in the database (Figure 3.6). Since version 9.5, DB2 for Linux, UNIX, and Windows stores the tags in a binary format to avoid code page problems in non-Unicode databases. Therefore you need to use the function xmlbit2char to make the strings human-readable. -- DB2 for z/OS and DB2 9 for Linux, UNIX, Windows: SELECT * FROM sysibm.sysxmlstrings; -- DB2 for Linux, UNIX, and Windows, Version 9.5 and higher: SELECT stringid, substr(sysibm.xmlbit2char(string),1,50), is_temporary FROM sysibm.sysxmlstrings;

Figure 3.6

Reading XML tag names from sysibm.sysxmlstrings

The column IS_TEMPORARY in sysibm.sysxmlstrings only exists in DB2 for Linux, UNIX, and Windows. It indicates whether a tag name belongs to a document that is stored in an XML column (IS_TEMPORARY = 'N') or to an element or attribute that has been newly constructed as part of a query (IS_TEMPORARY = 'Y'). For example, a query that creates and returns a new element name that has never been seen in the database before also causes a new entry in the string table. However this happens only upon its very ﬁrst execution, after which the new tag is registered and known. You cannot delete or update entries in this catalog table.

3.3

XML STORAGE IN DB2 FOR LINUX, UNIX, AND WINDOWS

This and the following sections describe storage objects, such as tables and table spaces, for XML data in DB2 for Linux, UNIX, and Windows. DB2 for z/OS uses similar but slightly different concepts, which are discussed in sections 3.11 through 3.12.

3.3.1

Storage Objects for XML Data

Whenever you deﬁne a table, DB2 creates one or multiple storage objects in a table space. For example, a relational table structure is stored in a DAT (data) object. Any kind of index is stored as an INX object. If your table contains a LOB column, DB2 creates a separate LOB object. And, if your table contains one or multiple XML columns, there is an XDA (XML data area) object. For SMS (system-managed space) table spaces, these objects appear as separate ﬁles in the ﬁle system. For DMS (database-managed space) table spaces, which are the default and recommended, these objects are not visible but nevertheless exist in the DMS containers.

34

Chapter 3

Designing and Managing XML Storage Objects

WHAT IS A TABLE SPACE? A table space is a storage structure that can contain relational tables and indexes as well as large objects (LOBs) and XML data. Table spaces enable you to specify where your data is physically stored. They also allow you to assign different types of data to different buffer pools in main memory, or to back up and restore speciﬁc parts of your database.

Let’s look at this CREATE TABLE command as an example (note that no XML Schema is required to deﬁne a table with a column of type XML. DB2’s XML storage is independent of any particular XML Schema): CREATE TABLE customer (id INTEGER, info XML)

The storage objects that DB2 creates and maintains for this table are illustrated in Figure 3.7. The table with two columns is maintained in a DAT object. The XML column in this table does not contain the actual XML documents that are inserted, but just logical pointers to them. The reason is that XML documents can easily be too big to ﬁt into a relational row on a single page. This approach is similar to the storage of large objects (LOBs) in DB2. The main difference between XML and LOBs is that XML is buffered in the buffer pool whereas LOBs are not. By default, XML documents are stored in the XDA object. If a table has multiple XML columns, all of them share the same XDA object. Whenever a document tree does not ﬁt on a single page, DB2 automatically and transparently breaks the tree into multiple subtrees, which are called regions. Each region is then stored on a separate XDA page so a single document can span many pages. Documents that ﬁt on a single page consist of a single region. If documents are much smaller than the page size, multiple regions (documents) can be stored on a single page so that no space is wasted. DB2 allows you to store XML documents up to 2GB in size, which is large enough for just about every application. One regions index is created automatically by DB2 for each table that contains one or more XML columns. In the catalog view syscat.indexes, every regions index is identiﬁed by the value XRGN in the column INDEXTYPE. It is not a user-deﬁned index and you cannot drop it. The regions index contains one entry for each region of a document. If a document consists of multiple regions, then these regions are represented by consecutive regions index entries. An XML document pointer in the XML column in the DAT object points to a regions index entry that in turn points to the “ﬁrst” region of the corresponding document. This is the region that contains the root node of the document. A short range scan on the regions index then provides pointers to the remaining regions of the document. If a node A in a region has a child node B that is the topmost node of

3.3

XML Storage in DB2 for Linux, UNIX, and Windows

35

another region, node A contains information that points back into the regions index (not shown in Figure 3.7). It points to the regions index entry that leads to the region with node B. Also not shown in Figure 3.7 is that DB2 maintains a path index for every XML column. It contains one entry per unique path in the XML data and is therefore very small. More details on the path index can be found in Chapter 13, Deﬁning and Using XML Indexes. Table Space

ID (INT) 1001 1000 1003 1005

INFO (XML)

Regions Index

pages

INX Object

page

DAT Object

page

page

page

page

page

page

page

page

page

page

page

page

XDA Object Figure 3.7

Storage objects involved with an XML column

Storing large documents as regions across pages has several advantages. First and foremost, DB2’s proven infrastructure for managing pages works for XML data just like for relational data. This includes table spaces, buffer pools, page cleaning, backup and restore, recovery, HADR, and so on. If a document is large and spans many XDA pages and a query touches only part of the document, DB2 does not necessarily need to bring all pages of the document into the buffer pool. DB2 always strives to split a document into the smallest possible number of regions. The regions for one document are in most cases stored on physically consecutive pages. The way XML documents are broken into regions is completely transparent to the application and to the DBA. You should never attempt to design XML documents with the goal of optimizing any aspect of how DB2 stores the documents. You should model your XML data at the logical level to reﬂect your business data and focus on the characteristics and requirements of your application, not on how DB2 processes XML. Most applications are best served with large numbers of small documents, where each XML document represents a separate business object.

36

3.3.2

Chapter 3

Designing and Managing XML Storage Objects

Deﬁning Columns,Tables, and Table Spaces for XML Data

In DB2 for Linux, UNIX, and Windows, database-managed table spaces (DMS) provide higher performance than system-managed table spaces (SMS) for relational data, and even more so for XML read and write access. Since DB2 9, newly created table spaces are DMS by default. It is also recommended to use DMS table spaces with automatic storage so that they grow as needed without manual intervention. A key aspect of physical database design is the page size of a table space. Measurements have shown that the lower the number of regions (splits) per XML document the better the performance, especially for XML insert and full-document retrieval. If a document does not ﬁt on a single page, the number of splits per document depends on the page size (4KB, 8KB, 16KB, or 32KB). The larger the page size of the table space the lower the number of regions per document. For example, let’s say a given document gets split into forty regions across forty 4KB pages. Then it might be possible to store the same document on only twenty 8KB pages, or ten 16KB, or ﬁve 32KB pages. If the XML documents are signiﬁcantly smaller than the selected page size, no space is wasted because multiple small documents can be stored on a single page. The impact of the page size on the number of regions per document is illustrated in Figure 3.8. Since each region requires one regions index entry, a larger page size that allows for fewer regions per document also leads to a smaller regions index. 4K Pages

8K Pages

…. 32k Pages

Figure 3.8

The number of regions per document depends on the page size

3.3

XML Storage in DB2 for Linux, UNIX, and Windows

NOTE

37

Most XML applications perform best using 16KB or 32KB

pages.

16KB pages can provide good performance if most documents are quite small (for example, less than 4KB) so that several documents ﬁt on a page. Larger documents are better served by 32KB pages. If you prefer to use a single page size for XML and relational data, or for data and indexes, and you ﬁnd that 32KB pages are too large for efﬁcient access to relational data or indexes, then 16KB pages can be a good compromise. Let’s look at some examples. Figure 3.9 shows how to deﬁne two table spaces, one with 4KB pages and one with 32KB pages. These table spaces are used in the subsequent CREATE TABLE statements and ﬁgures. CREATE BUFFERPOOL bpsmall PAGESIZE 4k ; CREATE BUFFERPOOL bplarge PAGESIZE 32k ; CREATE TABLESPACE tbspace4k PAGESIZE 4K MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bpsmall ; CREATE TABLESPACE tbspace32k PAGESIZE 32K MANAGED BY AUTOMATIC STORAGE BUFFERPOOL bplarge ;

Figure 3.9

Creating table spaces with different page sizes

The CREATE TABLE statement shown in Figure 3.10 deﬁnes a table with an integer column and an XML column using the table space with 32KB pages. It places XML data and relational data into the same table space (see Figure 3.7). Consequently, they use the same page size and are buffered in the same buffer pool. This default layout provides good performance for most applications. CREATE TABLE customer(id INTEGER, info XML) IN tbspace32k;

Figure 3.10

Creating a table with an XML column in a named table space

If you have done a performance analysis and ﬁnd that you need a large page size for XML data but a small page size for relational data or indexes, you can use separate table spaces to achieve this. When you deﬁne a table, you can direct “long” data (LOB and XML data) into a separate table space with a different page size. The corresponding table deﬁnition and storage objects are shown in Figure 3.11 and Figure 3.12, respectively. In this example, relational data is stored in a

38

Chapter 3

Designing and Managing XML Storage Objects

table space tbspace4k with page size 4KB and XML data is stored in a table space tbspace32k with page size 32KB. If the table also contained a LOB column, the LOB data would be stored in a separate LOB object in the table space tbspace32k. Pages of the LOB object are not buffered in the buffer pool, whereas pages of the DAT, XDA, and INX objects are buffered. CREATE TABLE customer(id INTEGER, info XML) IN tbspace4k LONG IN tbspace32k;

Figure 3.11

Storing XML and LOBs in a separate table space

tbspace4k

ID (INT) 1001 1000 1003 1005

INFO (XML)

Regions Index

pages

INX Object

page

DAT Object tbspace32k

page

page

page

page

page

page

page

page

page

page

page

page

XDA Object Figure 3.12

Storage objects in a separate table spaces

If you had another table space named tbspace4kINX you could also direct the regions index as well as any user-deﬁned indexes into their own table space. This layout is shown in Figure 3.13 and Figure 3.14.

3.3

XML Storage in DB2 for Linux, UNIX, and Windows

39

CREATE TABLE customer(id INTEGER, info XML) IN tbspace4k INDEX IN tbspace4kINX LONG IN tbspace32k;

Figure 3.13

Deﬁning separate storage for indexes and XML data

tbspace4k

ID (INT) 1001 1000 1003 1005

tbspace4kINX

INFO (XML)

Regions Index

pages

INX Object

page

DAT Object tbspace32k

page

page

page

page

page

page

page

page

page

page

page

page

XDA Object Figure 3.14

Separate table spaces for relational data, XML, and indexes

In general, the fewer distinct page sizes and buffer pools you create the easier it is to tune and maintain your database. Therefore we recommend that you use different page sizes for XML and relational data only if you have evidence that it improves the performance of your workload and if you need this performance gain to meet the performance requirements of your application. Otherwise there is beneﬁt in keeping it simple. Dedicated measurements in a prototype and test workload can help you make such decisions. Since DB2 9, new table spaces are by default large table spaces, in which the number of rows per page is no longer limited to 255. Hence, you don’t need to choose a small page size for relational data to ensure that pages are ﬁlled up and space isn’t wasted.

40

3.3.3

Chapter 3

Designing and Managing XML Storage Objects

Dropping XML Columns

In DB2 9.1 and DB2 9.5 for Linux, UNIX, and Windows you cannot drop XML columns from a table. To remove an XML column, create a new table without the XML column and use a “load from cursor” to move data from the old table to the new table. Then drop the old table and rename the new table so that it assumes the name of the old table. Alternatively, you can export data from a table and then recreate and reload the table. DB2 9.7 for Linux, UNIX, and Windows allows you to drop XML columns from a table with the ALTER TABLE statement. If a table contains multiple XML columns you can only drop all XML columns at the same time.

3.3.4

Improved XML Storage Format in DB2 9.7

DB2 9.7 uses a more optimized tree format for XML storage than prior releases. This improved format is completely transparent to all database operations such as queries, inserts, updates, indexing, and schema validation. The improved XML format is used only in tables that are created in DB2 9.7 or higher. When you migrate a table with XML data from DB2 9 or 9.5 to DB2 9.7, this XML data remains in its previous format and is not changed. Documents that you newly insert or update in such a migrated table continue to be in the format of the previous DB2 release. The previous and the improved storage format are not mixed within the XDA object of a table. The new storage format has the following beneﬁts: • It is more compact and can reduce the space consumption of your XML data. • It allows compression of XML data in the XDA object (see section 3.5). • It allows you to use the function ADMIN_EST_INLINE_LENGTH to estimate the inline length that would allow an XML document to be inlined (see section 3.4). • It enables faster redistribution of XML data in a partitioned database; that is, you can use the NOT ROLLFORWARD RECOVERABLE option in the REDISTRIBUTE command to redistribute data in bulk and avoid logging. If you have migrated a table with XML data from DB2 9 or 9.5 to DB2 9.7 and want to bring the XML data into the new format, you need to create a new table and copy the data from the old to the new table. You can use “load from cursor” for moving data from one table to another efﬁciently. Then you can drop the old table and rename the new table to the old table name. Starting with DB2 9.7, copying and renaming a table can be done more elegantly and with minimal downtime by using the procedure SYSPROC.ADMIN_MOVE_TABLE. This procedure performs an online table move, which means that table data is copied to a table object with the same name, but not necessarily the same columns and storage characteristics. When the copying is complete, the source table is brieﬂy taken ofﬂine and its name is assigned to the new copy of the table. All indexes of the table are also copied. During the copy phase, any updates, inserts, or deletes on the

3.4

Using XML Base Table Row Storage (Inlining)

41

source table are collected in a staging table and ﬁnally applied to the new table. An online table move with XML data requires that the table has at least one unique index and does not participate in foreign key constraints.

3.4

USING XML BASE TABLE ROW STORAGE (INLINING)

From DB2 9.5 for Linux, UNIX, and Windows onwards, XML documents that are small enough to ﬁt on a single page can be stored on the same page as the relational row that they belong to. This capability is called base table row storage, or inlining. It means that the tree structure of an XML document is no longer stored on a separate XDA page, but next to the relational data inside the DAT object in the table space (Figure 3.16). XML inlining is currently not available in DB2 for z/OS. Inlining needs to be explicitly enabled as a column option because it may or may not provide performance beneﬁts. Before we discuss the performance trade-offs, Figure 3.15 shows how to create a table with inlined XML storage. You can add a column option INLINE LENGTH to the deﬁnition of an XML column. In this example, any XML document that can be stored within 30,000 bytes is inlined. Documents that require more than 30,000 bytes are stored in the regular way (on separate XDA pages). The inlining of some or all documents is handled by the DB2 engine and completely transparent to the application. DB2’s decision about whether a given document is within the inline length is based on the size of the document in DB2’s internal tree format, after XML parsing. The decision is not based on the length of the textual (serialized) representation of the XML document. Inlined documents can be compressed, but the inlining decision is based on their space requirement prior to compression. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 30000) IN tbspace32k;

Figure 3.15

Table deﬁnition with inlined XML storage

The maximum allowed value for the inline length depends on the page size of the table space. As a rule of thumb, the inline length has to be less than the page size minus the total length of the other columns in the table and the overhead for the page header, and so on. For example, the maximum possible inline length in the example in Figure 3.15, where the table also contains an integer column and uses 32KB pages, is 32667 bytes. If an XML document is updated it might become larger or smaller as a result of the update, which affects inlining. The update may cause a previously inlined document to be moved from the DAT object to the XDA object, or vice versa. Figure 3.16 illustrates the storage objects in the table space when XML inlining is used. Three of the four documents meet the inline length and are now stored as part of the relational rows on pages in the DAT object. They do not have regions index entries. The document that belongs to the

42

Chapter 3

Designing and Managing XML Storage Objects

second row (id = 1000) is too large to be inlined. It is stored in the XDA object and spans three pages, which are linked from the row in the DAT object via the regions index. Note that inlining makes the DAT object larger, with larger and fewer rows per page. The XDA object has become smaller and the regions index has fewer entries than without inlining. Table Space: tbspace32k

ID (INT) 1001

INFO (XML) Regions Index

1000 page

pages

INX Object 1003 page

1005 page

page

page

page

page

page

page

DAT Object

XDA Object Figure 3.16

Storage objects with XML inlining

The CREATE TABLE statement in Figure 3.17 creates the customer table in table space tbspace4k, allows documents up to 3500 bytes to be inlined, and automatically directs larger documents to the table space tbspace32k. In this case the inlining takes precedence over the LONG IN clause. If a document is small enough to be inlined it will be part of the base table row and stored on DAT pages in tbspace4k. Otherwise it is stored on XDA pages in tbspace32k. CREATE TABLE customer(id INTEGER, info XML INLINE LENGTH 3500) IN tbspace4k LONG IN tbspace32k;

Figure 3.17

Another table deﬁnition with inlined XML storage

The inline length of an XML column can be changed with an ALTER TABLE statement, as shown in Figure 3.18. This allows you to increase the inline length of an XML column, or to enable inlining for an XML column that wasn’t previously deﬁned with inlining.

3.4

Using XML Base Table Row Storage (Inlining)

43

ALTER TABLE customer ALTER COLUMN info SET INLINE LENGTH 3600;

Figure 3.18

Changing the inline length of an XML column

The ALTER TABLE statement operation does not affect existing documents in the table, only documents that are inserted, loaded, or updated after the ALTER TABLE statement has been issued. If you want existing documents to obey the newly set inline length, you need to update them with themselves, as shown in Figure 3.19. Be aware that a bulk update of many XML documents can require a lot of log space. You might have to perform a series of smaller updates and commit frequently to avoid running out of log space. After you use an UPDATE statement to move XML data from the XDA object to the DAT object, you might want to reorganize the table to reclaim the freed-up space in the XDA object (see section 3.7). However, reorganization by itself does not move XML data from the XDA object to the DAT object. UPDATE customer SET info = info;

Figure 3.19

Updating existing documents to apply inlining

After you have speciﬁed an inline length for an XML column, you can only increase the inline length, not reduce it. The only way to “undo” the inlining of XML documents is to copy the documents into a new table without inlining, drop the old table, and rename the new table to the old table name. Starting with DB2 9.7 you can do this copying also with the procedure SYSPROC. ADMIN_MOVE_TABLE.

3.4.1

Monitoring and Conﬁguring XML Inlining

After you have set the inline length for an XML column, any newly inserted or updated document is inlined if DB2’s internal tree representation of the document ﬁts within the speciﬁed inline length. The size of an XML document in DB2’s internal tree format depends on the actual document characteristics, such as the length of element names, the length of element values, the presence of namespaces, and other factors. In particular, the space required to store a document in an XML column might be less than or greater than the size of the document in its textual representation. In DB2 9.5 and higher, the space requirement of most XML documents is between 70% and 150% of the space that they occupy in the ﬁle system. Therefore predicting whether a particular document will or will not be inlined can be difﬁcult. Similarly, choosing an inline length that allows inlined storage of all or most documents can also be difﬁcult. To address this problem, DB2 9.7 for Linux, UNIX, and Windows introduced the scalar functions ADMIN_IS_INLINED and ADMIN_EST_INLINE_LENGTH.

44

Chapter 3

Designing and Managing XML Storage Objects

The function ADMIN_IS_INLINED takes an XML column name as input, and returns • 1 if the document in the current row of the XML column is inlined. • 0 if the document in the current row of the XML column is not inlined. • NULL if the XML column of the current row is NULL. The query in Figure 3.20 shows how the function ADMIN_IS_INLINED can be used to examine a table with inlining, like the one deﬁned previously in Figure 3.17. The query reveals for every document in the table whether or not it is inlined. The output indicates that the documents with the relational id values 1000 and 1002 are inlined while the other documents are not inlined. SELECT id, ADMIN_IS_INLINED(info) AS inlined FROM customer;

ID INLINED ---------------- ---------------1000 1 1001 0 1002 1 1003 0 1004 0 1005 0 6 record(s) selected.

Figure 3.20

Determining which documents are inlined

Since the query in Figure 3.20 can produce a lot of output when applied to a large table, you may want to add a WHERE clause to retrieve the inlining status only for a subset of documents. Figure 3.21 uses the ADMIN_IS_INLINED function to compute the number of documents that are inlined as well as the number of those that are not. The subselect in Figure 3.21 uses the clause FETCH FIRST 1000 ROWS ONLY to obtain inlining information based on at most 1,000 documents. This can be useful if the input table is large and you want to use the ﬁrst 1,000 documents as a representative sample rather than scanning the entire table. Alternatively, you could use the keywords TABLESAMPLE BERNOULLI(n) in the FROM clause of the subselect to sample n% of all rows in the table.

3.4

Using XML Base Table Row Storage (Inlining)

45

SELECT COUNT(*) AS doc_count, CASE WHEN inlined = 1 THEN 'Yes' ELSE 'No' END AS inlined FROM (SELECT ADMIN_IS_INLINED(info) AS inlined FROM customer FETCH FIRST 1000 ROWS ONLY) GROUP BY inlined;

DOC_COUNT ---------------2 4

INLINED ---------------Yes No

2 record(s) selected.

Figure 3.21

Obtaining the number of inlined documents

The result in Figure 3.21 shows that only two out of six examined documents are inlined. This raises the question of how much you would need to increase the inline length so that most or all of the documents can be inlined. Similarly, you might have a table with an XML column for which inlining is not yet enabled. You might wonder which inline length to use so that most or all of the documents in that column get inlined. The function ADMIN_EST_INLINE_LENGTH is designed to answer these questions. The function ADMIN_EST_INLINE_LENGTH takes an XML column name as input, and returns • The lowest inline length (in bytes) that would allow the XML document in the current row to be inlined. This is an estimated value. • –1 , if the document in the current row of the XML column is too large to be inlined for the given page size. • –2 , if the required inline length cannot be estimated for the document in the current row. This is the case for any documents that have been inserted and stored prior to DB2 9.7 because DB2 9.7 uses a more optimized XML storage format (see section 3.3.4). • NULL, if the XML column of the current row is NULL. Figure 3.22 shows sample output of the function ADMIN_EST_INLINE_LENGTH. The values returned depend on the actual XML data in the table. In this example, the output shows that the ﬁrst document (relational id = 1000) is already inlined and its actual size in DB2’s internal format is 770 bytes. The second document (id = 1001) is not inlined, but it can be inlined if the inline length is increased to 2345 or larger. The document with id = 1005 cannot be inlined because it is too large to ﬁt on a single page together with the other columns in the table.

46

Chapter 3

Designing and Managing XML Storage Objects

SELECT id, ADMIN_IS_INLINED(info) AS inlined, ADMIN_EST_INLINE_LENGTH(info) AS inline_length FROM customer;

ID INLINED INLINE_LENGTH ---------------- ---------------- --------------1000 1 770 1001 0 2345 1002 1 796 1003 0 1489 1004 0 1910 1005 0 -1 6 record(s) selected.

Figure 3.22

Examining the required inlined length for speciﬁc XML documents

For a proposed inline length, such as 1500 bytes, the query in Figure 3.23 tells you how many documents in the column would be inlined if this inline length was used. SELECT COUNT(*) AS doc_count FROM customer WHERE ADMIN_EST_INLINE_LENGTH(info) BETWEEN 0 AND 1500;

DOC_COUNT ---------------3 1 record(s) selected.

Figure 3.23

Estimating the effectiveness of a proposed inline length

Figure 3.24 gives an example of a more comprehensive report on the distribution of document sizes in a table. It shows that two documents require no more than 1000 bytes each, four documents can be stored in at most 2000 bytes each, ﬁve ﬁt into 3000 bytes each, no potentially “inlinable” document is larger than 3000 bytes, and one document is too big to be inlined.

3.4

Using XML Base Table Row Storage (Inlining)

SELECT SUM(a) AS ")

>

>

less-than symbol ( SELECT XMLCAST(XMLQUERY('$BOOKINFO/bookstore/book/title') AS VARCHAR(35)) as title FROM shelf;

TITLE -----------------------------------Helen's story about foxes & rabbits 1 record(s) selected.

Figure 4.23

4.7

Retrieving the title as SQL type VARCHAR

UNDERSTANDING XML WHITESPACE AND DOCUMENT STORAGE

Most XML documents contain whitespace, and its purpose is typically to improve readability. According to the XML standard, whitespace is any of the following characters and their respective Unicode code points. • space character (0x20) • CR, carriage return (0x0D) • LF, line feed (0x0A) • tab (0x09) The XML standard mandates that XML parsers must remove or replace any CR characters (0x0D) that appear in an XML document. Any two-character sequence CR LF is replaced by a single LF, and any CR character that is not followed by LF is also converted to a single LF. Whitespace can occur at various places in an XML document. For example, the simple document in Figure 4.24 contains whitespace in the following locations: • Between the element name “a” and the attribute “x” • On both sides of the “=” character that belongs to the attribute “x” • Within the double quotes that enclose the value of the attribute “x” • Between the start tag of element “a” and the start tag of element “b” • Trailing whitespace within the start and end tag of element “b” and within the end tag of element “a” • Between the start and end tag of element “b” • Between the end tag of element “b” and the start tag of element “c” • Inside the text value of element “c” • Between the end tag of element “c” and the end tag of element “a”

90

2

Inserting and Retrieving XML Data

A sample document with whitespace

The location of the whitespace matters. Depending on where a whitespace character occurs it is considered one of four types of whitespace: • Insigniﬁcant whitespace (trailing spaces in element or attributes names, spaces around the equality [=] symbol of an attribute, and others) • Signiﬁcant whitespace (within attribute and elements values) • Boundary whitespace (between one tag and the next, if no other characters occur there) • Known whitespace (a single whitespace that precedes an attribute name) Figure 4.25 shows the same XML document as in Figure 4.24 and identiﬁes the four types of whitespace. Note that the whitespace between the start and end tag of element “b” is considered boundary whitespace and not signiﬁcant whitespace, because there are no other non-whitespace characters in the text value of element “b”. The whitespace in the text value of element “c” is signiﬁcant, because there is another non-whitespace character (“2”) adjacent to this whitespace. significant

known

2

insignificant

boundary

Different types of whitespace

XML parsers always remove all insigniﬁcant whitespace, which is not speciﬁc to DB2 but required by the XML standard. The XML standard provides no option to preserve insigniﬁcant whitespace during XML parsing. On the other hand, signiﬁcant whitespace is always preserved and there is no option to strip signiﬁcant whitespace. Known whitespace is a single space (U+0020) that separates an attribute name from a preceding element name or attribute. Known whitespace is removed during XML parsing and not stored with the document. But, it gets reinjected during serialization when you retrieve the XML data in text format. Boundary whitespace can be preserved or removed (stripped). Figure 4.26 shows two versions of the sample document from Figure 4.25. In the ﬁrst version, all insigniﬁcant and boundary whitespace has been stripped from the document. In the second version, insigniﬁcant whitespace has been stripped but boundary whitespace has been preserved. In DB2, the default behavior is to strip boundary whitespace, but you can choose to preserve boundary whitespace, if desired.

4.7

Understanding XML Whitespace and Document Storage

91

-- Document with boundary whitespace stripped:

2

-- Document with boundary whitespace preserved:

Figure 4.26

2

Sample document with and without boundary whitespace preserved

You can preserve boundary whitespace only if you insert or update documents without validation against an XML Schema. Validation always forces boundary whitespace to be stripped. NOTE

4.7.1

Preserving XML Whitespace

DB2’s default behavior to strip boundary whitespace is desirable because it saves space on disk and in memory. Additionally, whitespace is typically not meaningful for applications that consume XML data. Hence, this default is likely the right choice for your application. However, if you encounter a case where boundary whitespace has to be preserved, DB2 supports three ways to enable whitespace preservation. Ordered by their precedence, they are • The special attribute xml:space inside XML documents • The explicit strip/preserve whitespace option in the XMLPARSE function • Changing the DB2 default behavior from “strip” to “preserve” with the CURRENT IMPLICIT XMLPARSE OPTION (see section 4.7.2) The XML standard deﬁnes the optional attribute xml:space that controls the stripping or preservation of whitespace. It can have the values preserve or default, where default means that whitespace is stripped. This attribute can be included in any element in an XML document. It affects the entire subtree under this element, unless it is overridden by other xml:space attributes at a deeper level of the document. If the xml:space attribute appears only in the root element of a document then it affects all boundary whitespace in the entire document. Any xml:space attributes override any whitespace settings in the XMPARSE function or the CURRENT IMPLICIT XMLPARSE OPTION. The drawback of xml:space attributes is that they often do not occur in XML documents and it can be time consuming to add them to every document before insertion into DB2. Also, when an xml:space attribute is in place, its effect can only be changed by removing or modifying the attribute in each document. Due to this lack of ﬂexibility it is recommended not to use xml:space attributes. Instead, use the explicit whitespace option in the XMPARSE function or the CURRENT IMPLICIT XMLPARSE OPTION, which we explain later.

92

Chapter 4

Inserting and Retrieving XML Data

Let’s look at the four INSERT statements in Table 4.3 through Table 4.6. They all insert a document with whitespace such as indentation and line breaks. The right column in each table shows the document and its whitespace after it has been retrieved from DB2. Run these INSERT statements in the CLP with the –t and the –q option (db2 –t –q). The –t option sets the semicolon as the default statement terminator. The –q option ensures that the CLP, as an application program for DB2, does not remove new line characters or other whitespace when sending statements to the DB2 server. The INSERT statement in Table 4.3 does not specify any whitespace option, which implies that all boundary whitespace is stripped. Since boundary whitespace includes line breaks, the document after retrieval is a continuous string without line breaks, spilling over multiple lines as needed. Note that signiﬁcant whitespace in the title element has been preserved; that is, the spaces between the words This, is, a, space, and story. Table 4.3

Inserting XML without Preserving Whitespace

INSERT statement:

Document after retrieval from DB2:

INSERT INTO shelf VALUES (10, ' 1851586666 This is a space story ')

1851586666This is a space story

The document that is inserted in Table 4.4 carries an xml:space attribute with the value preserve, which means that all boundary whitespace in this document is preserved. Hence, when you retrieve the document from DB2 all line breaks and indentation match the original document. Table 4.4

Inserting an XML Document with xml:space Attribute

INSERT statement:

Document after retrieval from DB2:

INSERT INTO shelf VALUES (11, ' 1851586666 This is a space story ')

1851586666 This is a space story

The INSERT statement in Table 4.5 wraps the XMLPARSE function with the explicit PRESERVE WHITESPACE clause around the document, which also preserves all boundary whitespace.

4.7

Understanding XML Whitespace and Document Storage

Table 4.5

93

Inserting an XML Document with the XMLPARSE Function

INSERT statement:

Document after retrieval from DB2:

INSERT INTO shelf VALUES (12,XMLPARSE(DOCUMENT ' 1851586666 This is a space story ' PRESERVE WHITESPACE))

1851586666 This is a space story

The INSERT statement in Table 4.6 uses the XMLPARSE function with the STRIP WHITESPACE option, and the document also carries the xml:space attribute in the book element. The effect is that all boundary whitespace is stripped, except within the book element and its child elements. The line breaks and indentation within the book element have been preserved according to the xml:space attribute. Table 4.6

Interaction between the XMLPARSE Function and xml:space Attribute

INSERT statement:

Document after retrieval from DB2:

INSERT INTO shelf VALUES (13,XMLPARSE(DOCUMENT ' 1851586666 This is a space story ' STRIP WHITESPACE))

1851586666 This is a space story

4.7.2

Changing the Whitespace Default from “Strip” to “Preserve”

If you always need to preserve boundary whitespace you might ﬁnd it tedious to ensure that all applications always use the XMLPARSE function with the PRESERVE WHITESPACE option. In this case it is easier to change DB2’s default behavior from STRIP WHITESPACE to PRESERVE WHITESPACE and avoid using the XMLPARSE function. In DB2 for Linux, UNIX, and Windows, the default behavior is controlled by a DB2 special register called CURRENT IMPLICIT XMLPARSE OPTION. It enables you to specify the whitespace handling per session (connection). You can change the default in several ways: • Use the following statement from an application or the DB2 CLP: SET CURRENT IMPLICIT XMLPARSE OPTION = 'PRESERVE WHITESPACE'

• For CLI applications, add the following entry to the db2cli.ini ﬁle: CurrentImplicitXMLParseOption = 'PRESERVE WHITESPACE'

94

Chapter 4

Inserting and Retrieving XML Data

You can edit this ﬁle manually, or issue the UPDATE CLI CONFIGURATION command: UPDATE CLI CONFIGURATION FOR SECTION USING CurrentImplicitXMLParseOption '"PRESERVE WHITESPACE"'

• In CLI applications you can also use the function SQLSetConnectAttr() to set the connection attribute SQL_ATTR_CURRENT_IMPLICIT_XMLPARSE_OPTION. It can be set before or after establishing a connection. Remember that the XMLPARSE function can always be used explicitly to override the default.

4.7.3

Storing XML Documents for Compliance

Many applications have the requirement that once they store an XML document they can get “the same” document back. The key question is how the application deﬁnes “the same.” In many cases “the same” means that all element and attribute tags, all element and attribute values, all comments, processing instructions and namespaces, and all signiﬁcant whitespace have to be returned in the same order and representation as in the original document. This notion of “the same” is sometimes also called Document Object Model ﬁdelity. It means that the structure and data content of your XML documents is always preserved and reproducible, including digital signatures. DB2’s pureXML storage provides this ﬁdelity. Some applications may take their deﬁnition of “the same” one step further. They might require that any XML document that they retrieve from a database is 100% byte-for-byte identical to the one that was inserted, including all insigniﬁcant whitespace. To ensure that the documents are byte-for-byte identical you must avoid XML parsing, because the output from an XML parser does not always contain all bytes that were in the original document. This behavior is irrespective of database storage, but inherent in how XML parsing is deﬁned by the XML standard. For example, XML parsers are required by the XML standard to remove insigniﬁcant whitespace and normalize line endings. Otherwise they are not compliant. If you require exact byte-for-byte retention of XML documents then an XML column, which stores XML in a parsed format, should not be your only storage choice for the documents. You should store a second copy of each document in a BLOB or VARCHAR FOR BIT column in the same row. The parsed XML storage allows efﬁcient querying while the binary copy is for auditing or compliance purposes. Note that character data types, such as CLOB or VARCHAR, do not guarantee that documents are stored without any byte modiﬁcations, because character data can be subject to code page conversion. Code page issues are explained in Chapter 20.

4.8

4.8

Summary

95

SUMMARY

The basic manipulation of XML documents in DB2 is easy. You can use the familiar SQL statements INSERT, SELECT, and DELETE to add, retrieve, and remove XML documents from an XML column in a DB2 table. UPDATE statements can replace or modify XML documents, which is further discussed in Chapter 12. In INSERT, SELECT, and UPDATE statements, applications can use parameter markers and host variables to exchange XML documents with the DB2 server. Code samples in various programming languages are provided in Chapter 21. If you include an XML column name in the SELECT list of an SQL query, the column type in the result set is XML and the XML documents are implicitly serialized to their textual representation upon retrieval. Alternatively, the XMLSERIALIZE function allows you to perform explicit serialization. Explicit serialization means that the text form of the XML documents are returned in a non-XML data type of your choosing, such as BLOB, CLOB, or VARCHAR. The XMLSERIALIZE function can be used to force the generation of an XML declaration at the beginning of any document that you retrieve from DB2. The XML standard deﬁnes several reserved characters as well as whitespace characters. Reserved characters, such as the less-than sign (" OFF='1666' LEN='421' />"

Content of the delimited format ﬂat ﬁle cust_exp.del

The ﬁle cust_exp.del.001.xml contains all the XML documents from the exported XML column concatenated together, as shown in Figure 5.4. The second of the six documents is highlighted in bold. As indicated in the DEL ﬁle, it begins at byte offset 281 and has a length of 283.

100

Chapter 5

Moving XML Data

You can actually count the characters in Figure 5.4 to verify that this is true. Also note that this concatenation of documents does not produce a single well-formed document because a single root element is missing. < name>Kathy Smith5 Rosewood< /street>TorontoOntarioM6W 1E6416-555-1358Kathy Smith25 EastCreekMarkhamOn tarioN9C 3T6905-555-7258Jim Noodle25 EastCreekMarkha mOntarioN9C 3T6905-555-7258 Robert Shoemaker1596 BaselineAuroraOntarioN8X 7F8905-55 5-7258416-555-2937905-555-8743613-555-3278...

Figure 5.4

5.1.2

Content of the XML data ﬁle cust_exp.del.001.xml

Exporting XML Documents as Individual Files

In some situations exporting each XML document into a separate ﬁle can be desirable. To do this you need to specify the clause MODIFIED BY with the option xmlinsepﬁles. This is shown in Figure 5.5. EXPORT TO c:\mydata\cust_exp.del OF DEL MODIFIED BY xmlinsepﬁles SELECT * FROM customer2;

Figure 5.5

Exporting XML documents as separate ﬁles

This EXPORT command produces n + 1 ﬁles where n is the number of XML documents in the exported XML column. In our example it produces the following seven ﬁles in the directory c:\mydata: • cust_exp.del • cust_exp.del.001.xml • cust_exp.del.002.xml • cust_exp.del.003.xml

5.1

Exporting XML Data in DB2 for Linux, UNIX, and Windows

101

• cust_exp.del.004.xml • cust_exp.del.005.xml • cust_exp.del.006.xml The ﬁrst ﬁle is the delimited format ﬂat ﬁle that contains the relational data of the exported result set together with pointers to the exported XML documents. These pointers (XML Data Speciﬁers) look different now because each XML document is exported as a separate ﬁle in the ﬁle system (see Figure 5.6). Offset and length are no longer required, just the ﬁle name of each individual XML document. These ﬁle names are derived from the name of the delimited format ﬂat ﬁle and extended with an increasing number and the extension .xml. The ﬁle numbers start with three digits and additional digits are used as needed when large numbers of documents are exported. 1000,"" />" />"

Content of the delimited format ﬂat ﬁle cust_exp.del

Remember that the examples in this chapter use the table customer2 which has an INTEGER column and an XML column. The table customer, which is readily available in the DB2 sample database, has an INTEGER column and two XML columns, info and history. Since the history column is initially empty (NULL), exporting all columns from the customer table leads to odd-numbered ﬁle names—cust_exp.del.001.xml, cust_exp.del.003.xml, cust_ exp.del.005.xml, and so on. The even-numbered ﬁle names would be used for the documents in the history column, but it is NULL and so these ﬁle names are not used. The xmlinsepﬁles option used in Figure 5.5 is just one of many possible options that can be speciﬁed in the MODIFIED BY clause of the EXPORT command. Table 5.1 summarizes other options relevant to XML data. Table 5.1

XML Relevant Modiﬁers for the EXPORT Command

Modiﬁed by:

Description:

xmlinsepﬁles

This option writes each XML document to a separate ﬁle. Without this option, all documents are by default concatenated into a single ﬁle.

xmlnodeclaration

This option produces XML documents without an XML declaration. Without this option the default behavior is that each exported XML document carries an XML declaration with an encoding attribute, such as

(continues)

102

Chapter 5

Table 5.1

Moving XML Data

XML Relevant Modiﬁers for the EXPORT Command (Continued)

Modiﬁed by:

Description:

xmlchar

This option writes the exported XML documents in the character codepage. The character codepage is the same as the application codepage unless the codepage option of the EXPORT command is speciﬁed. Without the xmlchar option, XML documents are by default written out in Unicode. Chapter 20 provides a deeper discussion of code pages and XML document encodings.

xmlgraphic

This option writes the exported XML documents in the UTF-16 code page regardless of the application code page or the codepage modiﬁer.

5.1.3

Exporting XML Documents as Individual Files with Non-Default Names

If you want the exported XML documents to have ﬁle names that are not based on the ﬁle name of the delimited format ﬂat ﬁle, use the XMLFILE clause of the EXPORT command to specify a different ﬁle name preﬁx. The command in Figure 5.7 exports the table customer2 and writes all XML documents to separate ﬁles whose names start with custdoc. EXPORT TO c:\mydata\cust_exp.del OF DEL XMLFILE custdoc MODIFIED BY xmlinsepﬁles SELECT * FROM customer2;

Figure 5.7

Exporting XML documents to ﬁles with custom ﬁle names

This command produces the following ﬁles: • cust_exp.del • custdoc.001.xml • custdoc.002.xml • custdoc.003.xml • custdoc.004.xml • custdoc.005.xml • custdoc.006.xml The XMLFILE clause can also be used without the xmlinsepﬁles option; that is, all documents are combined into a single ﬁle whose name starts with custdoc.

5.1.4

Exporting XML Documents to One or Multiple Dedicated Directories

The EXPORT command allows you to write the exported XML documents to a dedicated directory that is different from the directory where the delimited format ﬁle is written to. To achieve this,

5.1

Exporting XML Data in DB2 for Linux, UNIX, and Windows

103

use the XML TO clause to specify an existing directory, as shown in Figure 5.8. This EXPORT command writes the delimited format ﬂat ﬁle cust_exp.del to the directory /mydata, and the six XML documents in six separate ﬁles to the directory /mydata/customer. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/customer MODIFIED BY xmlinsepﬁles SELECT * FROM customer2;

Figure 5.8

Exporting XML documents as individual ﬁles to a dedicated directory

If the XML TO clause speciﬁes a list of multiple directories, as in Figure 5.9, the XML documents are distributed evenly among them in a round-robin fashion. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/cust1, /mydata/cust2 XMLFILE custdoc MODIFIED BY xmlinsepﬁles SELECT * FROM customer2;

Figure 5.9

Exporting XML documents as separate ﬁles to multiple directories

This EXPORT command produces the following ﬁles: • /mydata/cust1/custdoc.001.xml • /mydata/cust1/custdoc.003.xml • /mydata/cust1/custdoc.005.xml • /mydata/cust2/custdoc.002.xml • /mydata/cust2/custdoc.004.xml • /mydata/cust2/custdoc.006.xml You can later invoke the IMPORT or LOAD utility with the same two paths, /mydata/cust1 and /mydata/cust2, to have DB2 read the same documents in the same round-robin fashion. If you specify multiple target directories in the XML TO clause but omit the xmlinsepﬁles option, as in Figure 5.10, then the EXPORT utility concatenates the exported XML documents into multiple large ﬁles, one per target directory. EXPORT TO /mydata/cust_exp.del OF DEL XML TO /mydata/cust1, /mydata/cust2 XMLFILE custdoc SELECT * FROM customer2;

Figure 5.10

Exporting XML documents to multiple directories

104

Chapter 5

Moving XML Data

This EXPORT command produces the following three ﬁles: • The delimited format ﬂat ﬁle cust_exp.del in the directory /mydata • A ﬁle called custdoc.001.xml in the directory /mydata/cust1 • A ﬁle called custdoc.002.xml in the directory /mydata/cust2 The exported XML documents are evenly distributed across the two ﬁles custdoc.001.xml and custdoc.002.xml. The delimited format ﬂat ﬁle cust_exp.del contains the rows shown in Figure 5.11. It reveals that the ﬁrst, third, and ﬁfth documents are stored in the ﬁle custdoc. 001.xml, while the second, fourth, and sixth documents are stored in custdoc.002.xml. Each document is precisely identiﬁed by its offset and length. 1000,"" OFF='563' LEN='412' />" OFF='691' LEN='421' />"

Content of the delimited format ﬂat ﬁle cust_exp.del

Exporting Fragments of XML Documents

Up to now we have looked at exporting whole documents. It is also possible to export document fragments that may or may not be well-formed documents. To achieve this you can use the EXPORT command with any XQuery or SQL/XML query, such as the ones that we discuss in Chapters 6 through 9, which cover XML queries. Let’s consider the following examples. The command in Figure 5.12 exports all phone elements from each of the six XML documents in the info column of the table customer2. It writes six rows to the output ﬁles, one for each XML document. Each row contains one or more phone elements, depending on the number of phone elements in the respective document. If a row contains a sequence of multiple phone elements without a common root element, then this value is not a well-formed XML document. EXPORT TO /mydata/phones.del OF DEL SELECT XMLQUERY('$INFO/customerinfo/phone') FROM customer2;

Figure 5.12

Exporting document fragments

The query in the EXPORT command can also be an XPath or XQuery expression, as shown in Figure 5.13. Similar to the previous example in Figure 5.12, this command also exports all phone

5.1

Exporting XML Data in DB2 for Linux, UNIX, and Windows

105

elements from all six customer documents. However, it writes each phone element to a separate row in the output ﬁle, even if multiple phone elements come from the same XML document. This is because XQuery and SQL/XML queries that seem to be equivalent can produce result sets with different cardinalities. For details, please refer to Chapter 8 (see section 8.3.3, Result Set Cardinalities in XQuery and SQL/XML). EXPORT TO /mydata/phones.del OF DEL XQUERY db2-fn:xmlcolumn("CUSTOMER2.INFO")/customerinfo/phone;

Figure 5.13

5.1.6

Exporting document fragments as well-formed documents

Exporting XML Data with XML Schema Information

An XML column can contain XML documents that have been validated against one or multiple XML Schemas when they were inserted or loaded. When you export validated XML documents, the EXPORT utility can produce information that tells you for each document which XML Schema it belongs to. This is achieved with the XMLSAVESCHEMA option in the EXPORT command. For each exported XML document that was validated against an XML Schema, the fully qualiﬁed SQL identiﬁer of that XML Schema is stored as an attribute (SCH) in the corresponding XML Data Speciﬁer (XDS). The SQL identiﬁer of the XML Schema is the name under which you registered the XML Schema in DB2. If the exported document was not validated against an XML Schema or the schema no longer exists in the database, the SCH attribute is not included in the corresponding XDS. Figure 5.14 shows the command to export documents with XML Schema information. EXPORT TO /mydata/cust_exp.del OF DEL MODIFIED BY xmlinsepﬁles XMLSAVESCHEMA SELECT * FROM customer2;

Figure 5.14

Exporting documents specifying the XML Schema

The delimited format ﬂat ﬁle produced might look like the one in Figure 5.15. In this example it shows that the ﬁrst two documents were validated against the XML Schema with the SQL identiﬁer DB2ADMIN.CUSTXSD. The third and the ﬁfth documents were validated against schema DB2ADMIN.CUSTXSD2, while the fourth and the sixth documents are not associated with any XML Schema. This information reﬂects how documents were validated at insert time, if at all. If you load or import the exported XML documents and use this delimited format ﬂat ﬁle as input, the documents can be validated against their respective XML Schemas, if those schemas exist in the database.

106

Chapter 5

1000,"" SCH='DB2ADMIN.CUSTXSD2'/>" />"

Content of the delimited format ﬂat ﬁle cust_exp.del

IMPORTING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS

In DB2 9.1 for Linux, UNIX, and Windows you can use the IMPORT utility to move XML data into an XML column. Since DB2 Version 9.5 you can also use the LOAD utility to load XML data. The choice between IMPORT and LOAD is largely dependent on operating considerations, which are similar for XML as for relational data: • The LOAD utility typically performs better than the IMPORT utility because • It operates at the DB2 page level, whereas the IMPORT utility operates at the row level. • The data loaded by the LOAD utility is not logged in the transaction log. • The LOAD utility automatically parallelizes its workload. • If you use the IMPORT utility, then the target table can be kept fully accessible to other applications for insert and query operations at all times. In particular, you can start an IMPORT operation while other queries on the table are in progress. The LOAD utility has an online mode that allows queries (but no writes) against the target table while the LOAD is in progress. However, queries that started prior to the LOAD must be quiesced before a LOAD or online LOAD can be started. • If you have triggers on the target table, then these are ﬁred if the IMPORT utility is used, but are not ﬁred if the LOAD utility is used. • Both the IMPORT and LOAD utilities can optionally perform XML Schema validation and preserve whitespace in the XML documents. The IMPORT and LOAD utilities can be viewed as inverse operations to EXPORT. In particular, the IMPORT and LOAD utilities can directly consume the output produced by the EXPORT utility; that is, a delimited format ﬂat ﬁle that contains pointers to the XML documents that reside in one or multiple separate ﬁles. If you want to IMPORT or LOAD data that wasn’t previously exported with the EXPORT command, you need to produce a delimited format ﬁle that looks as if it had been produced by the EXPORT utility.

5.2

Importing XML Data in DB2 for Linux, UNIX, and Windows

5.2.1

107

IMPORT Command and Input Files

Assume you want to use the IMPORT command to add new rows to the table customer2, and that you have a directory c:\mydata in the ﬁle system that contains several ﬁles with XML documents that you want to import. This directory could contain thousands of ﬁles, but in this example let’s assume that you just have two XML ﬁles called data2.xml and data3.xml, each containing a single XML document. You can produce a delimited format ﬂat ﬁle, such as the ﬁle data.del in Figure 5.16, which contains two columns. The ﬁrst column holds INTEGER values for the ﬁrst column of the target table, and the second column holds pointers to the XML documents that you want to import into the second column of the target table. 2000,"" 2001,""

Figure 5.16

Content of the delimited format input ﬁle data.del

With this delimited format input ﬁle you can execute the IMPORT command shown in Figure 5.17. It assumes that the ﬁle data.del as well as the XML documents data2.xml and data3.xml are all located in the current directory. The keywords OF DEL indicate that the input ﬁle data.del is of type delimited format. IMPORT FROM data.del OF DEL INSERT INTO customer2;

Figure 5.17

Importing XML documents

If the required ﬁles are not located in the local directory then you must provide appropriate paths. For example, if the ﬁle data.del is located in the directory c:\mydata, and the XML documents are in the directory c:\mydata\myxml, then the IMPORT command in Figure 5.18 obtains the ﬁles from the appropriate locations. IMPORT FROM c:\mydata\data.del OF DEL XML FROM c:\mydata\myxml INSERT INTO customer2;

Figure 5.18

Importing XML documents from speciﬁc locations

Incorrect ﬁle paths in the IMPORT command are a very common mistake, so you want to pay extra attention to them! NOTE

If you need to load XML data that was previously exported to multiple directories, specify the list of directories in the XML FROM clause. This clause corresponds to the XML TO clause of the EXPORT command.

108

Chapter 5

Moving XML Data

If the two XML documents data2.xml and data3.xml happen to be concatenated as a single ﬁle (for example, docs.xml), then the delimited format input ﬁle needs to specify offset and length for each document, as in Figure 5.19. The ﬁrst XML document starts at an offset of 0 bytes into the ﬁle and is 281 bytes long. The second XML document starts at offset 281 and is 283 bytes long, and so on for all XML documents that may be in the same ﬁle. Since it is tedious to determine the number of bytes of each document, such an input ﬁle with offsets and lengths is typically only used if it is available from a previous EXPORT operation or generated by an application. 2000,"" 2001,""

Figure 5.19

Input ﬁle for multiple concatenated documents

As an aside, what happens if you have more than one XML column in the target table? To populate a table with two XML columns, the delimited format input ﬁle has to contain two XML Data Speciﬁers (XDS) per row, one for each XML column that you want to populate. Such an input ﬁle is shown in Figure 5.20. 2000,"","" 2001,"",""

Figure 5.20

Input ﬁle to populate an integer column and two XML columns

When you import, insert, or load XML data, insigniﬁcant whitespace is by default automatically stripped from the XML documents (see section 4.7, Understanding XML Whitespace and Document Storage). If you want to preserve whitespace, specify the XMLPARSE PRESERVE WHITESPACE clause in the IMPORT command (see Figure 5.21). IMPORT FROM c:\mydata\cust_exp.del OF DEL XML FROM c:\mydatadata XMLPARSE PRESERVE WHITESPACE INSERT INTO customer2;

Figure 5.21

5.2.2

Importing XML data into a table and preserving whitespace

Import/Insert Performance Tips

Several performance guidelines are common to all methods of populating a table with XML data. If you have multiple user-deﬁned XML indexes on a table, it is typically better to deﬁne them before populating the table rather than creating them afterwards. It is better to deﬁne the indexes before populating the table because during INSERT, LOAD, or IMPORT, each XML document is processed only once to generate index entries for all XML indexes. However, if multiple CREATE INDEX statements are issued, all documents in the XML column will be traversed multiple times, once for each index.

5.3

Loading XML Data in DB2 for Linux, UNIX, and Windows

109

Even if you have not deﬁned any indexes on the target table, DB2’s pureXML storage mechanism transparently maintains regions and path indexes for efﬁcient XML storage access (see Chapter 3, Designing and Managing XML Storage Objects). Take these indexes into account when determining buffer pool sizes. Just as for relational data, you can issue the ALTER TABLE APPEND ON command, which enables append mode for the table. New data is appended to the end of the table instead of searching for free space on existing pages. This can provide for improved runtime performance of bulk inserts or import. You can avoid logging if you use the ALTER TABLE ACTIVATE NOT LOGGED INITIALLY command. However, be warned that if there is a statement failure, the table will be marked as inaccessible and must be dropped. This risk often prohibits using the NOT LOGGED INITIALLY (NLI) option for incremental bulk inserts in production systems. The option can be useful for the initial population of an empty table. Beware that NLI prevents concurrent inserts/imports into a target table and that parallelism can yield higher performance than NLI. If you use the IMPORT command, a small value for the COMMITCOUNT parameter tends to hurt performance. Committing every 100 rows or more will perform better than committing every row. An IMPORT command with an explicit COMMITCOUNT parameter is shown in Figure 5.22. IMPORT FROM c:\mydata\data.del OF DEL XML FROM c:\mydata COMMITCOUNT 100 INSERT INTO customer2;

Figure 5.22

IMPORT command with COMMITCOUNT parameter

To achieve higher performance than provided by the IMPORT utility, consider using the LOAD utility instead, which automatically parallelizes its work.

5.3

LOADING XML DATA IN DB2 FOR LINUX, UNIX, AND WINDOWS

Since DB2 9.5 for Linux, UNIX, and Windows you can use the LOAD utility to move XML documents into a table. The key advantages of the LOAD utility are the same for XML as for relational data. For example, the data is not logged and parallelism is automatically used to increase performance. DB2 determines a default degree of parallelism based on the number of CPUs and table space containers. The syntax for handling XML data in the LOAD command is the same as the XML-speciﬁc syntax in the IMPORT command. For example, the only difference between the LOAD command in Figure 5.23 and the IMPORT command in Figure 5.18 is that the keyword IMPORT has been replaced by the keyword LOAD.

110

Chapter 5

Moving XML Data

LOAD FROM c:\mydata\data.del OF DEL XML FROM c:\mydata\myxml INSERT INTO customer2;

Figure 5.23

Example of a LOAD command

The LOAD command has several optional parameters that can affect performance. DB2 automatically determines suitable values for these parameters, so you can usually obtain good load performance out-of-the-box without setting any parameters. If you want to try to improve load performance, consider the following parameters: • DATA BUFFER —This parameter speciﬁes the number of 4KB pages (regardless of the degree of parallelism) to use as buffered space for transferring data within the utility. The data buffers use the utility heap, whose size can be modiﬁed through the util_heap_sz database conﬁguration parameter. Large degrees of parallelism require a larger util_heap_sz. • CPU_PARALLELISM —This parameter speciﬁes the number of threads that the LOAD utility uses for parsing, converting, and formatting records. • DISK_PARALLELISM —This parameter speciﬁes the number of threads that the LOAD utility uses for writing data to the table space. After a LOAD operation, the loaded table might be in SET INTEGRITY PENDING state in either READ or NO ACCESS mode. This means that the table is only available for read or not available at all. You can check whether the loaded table is in SET INTEGRITY PENDING status (also known as CHECK PENDING status) by looking at the STATUS column of the catalog view SYSCAT.TABLES and checking for a STATUS value equal to "C" (see Figure 5.24). The value "C" means CHECK PENDING. SELECT SUBSTR(tabschema,1,10) AS tabschema, SUBSTR(tabname,1,10) AS tabname, status FROM syscat.tables WHERE status = 'C';

TABSCHEMA TABNAME STATUS ---------- ---------- -----DB2ADMIN CUSTOMER C

Figure 5.24

Listing tables that are in CHECK PENDING state

One of the most common reasons why a table is placed in CHECK PENDING state after a LOAD operation is that the table has check constraints or referential integrity constraints deﬁned on it. To take a table out of CHECK PENDING state, issue the SET INTEGRITY command:

5.4

Unloading XML Data in DB2 for z/OS

111

SET INTEGRITY FOR db2admin.customer2 IMMEDIATE CHECKED

DB2 performs minimal logging for the LOAD utility, because the operations are performed at the DB2 page level and not the DB2 row level. If you have DB2 archive logging enabled (disabled by default) and use the LOAD command, then the table will be placed in BACKUP PENDING status after the load. After the load operation you have to take a backup of the table space containing the table before you issue the SET INTEGRITY command. An alternative to taking the backup is to specify the COPY YES option in the LOAD command. This option instructs DB2 to perform a backup of the new data while it is being loaded, which avoids the BACKUP PENDING state. Another alternative is to specify the NONRECOVERABLE option in the LOAD command. This option means the table space is not put in BACKUP PENDING state following the LOAD operation and a copy of the loaded data does not have to be made during the load. However, it is not possible to recover the table by a subsequent roll forward action. You can also move XML data from one table to another using the “load from cursor” option of the LOAD utility. This option allows you to move data between tables without having to unload the data ﬁrst. In Figure 5.25 a cursor curs is declared. The subsequent LOAD command uses this cursor to move data from the table customer2 into table customer3. Loading XML data from a cursor is supported for tables in the same database but not for moving XML data from one database to another (error SQL1407N). DECLARE curs CURSOR FOR SELECT cid, info FROM customer ; LOAD FROM curs OF CURSOR INSERT INTO customer3(cid,info) ;

Figure 5.25

5.4

Example of loading data from a cursor

UNLOADING XML DATA IN DB2 FOR Z/OS

You have two options for unloading data from DB2 for z/OS. You can either use the DSNTIAUL utility or the UNLOAD utility. An example of using the DSNTIAUL utility to unload data from a table called customer is shown in Figure 5.26. The execution of the DSNTIAUL utility in Figure 5.26 produces two output ﬁles, pointed to by SYSREC00 and SYSPUNCH. The SYSPUNCH sequential dataset contains the LOAD statement for you to be able to load the unloaded data into a new table. The SYSREC00 sequential dataset contains the unloaded data, including the XML data.

112

Chapter 5

Moving XML Data

//DSNTIAUL EXEC PGM=IKJEFT01 //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSREC00 DD DSN=USER123.DSN8UNLD.SYSREC00,VOL=SER=P8P007, // UNIT=SYSDA,SPACE=(32760,(1000,500)),DISP=(,CATLG) //SYSPUNCH DD DSN=USER123.DSN8UNLD.SYSPUNCH, // UNIT=SYSDA,SPACE=(800,(15,15)),DISP=(,CATLG), // RECFM=FB,LRECL=120,BLKSIZE=1200,VOL=SER=P8P007 //SYSTSIN DD * DSN SYSTEM(ISC9) RUN PROGRAM(DSNTIAUL) PLAN(DSNTIB91) PARMS('SQL') LIB('ISC910P8.RUNLIB.LOAD') END //SYSIN DD * SELECT * FROM CUSTOMER;

Figure 5.26

Unloading data using the DSNTIAUL utility

You can also use the UNLOAD utility to unload XML data. Remember that in DB2 for z/OS, the XML data of an XML column always resides in an XML table space, separate from the base table space. In the UNLOAD statement you just need to specify the base table space. You do not have to specify the XML table space. An example is shown in Figure 5.27, where the data is unloaded in delimited format. Once you have determined the table space and database for the table you want to unload, you can plug these values into the unload job as shown in Figure 5.27. //UNLOAD EXEC DSNUPROC,PARM='ISC9,IANTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSREC DD DSN=USER123.UNLOAD.SYSREC, // DISP=(MOD,CATLG,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSPUNCH DD DSN=USER123.UNLOAD.SYSPUNCH, // DISP=(MOD,CATLG,CATLG), // UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * UNLOAD TABLESPACE DSN00191.CUSTOMER DELIMITED CHARDEL X'22' COLDEL X'2C' DECPT X'2E' FROM TABLE CUSTOMER (CID POSITION(*) INT, INFO POSITION(*) XML) UNICODE /*

Figure 5.27

Unloading data using the UNLOAD utility

5.4

Unloading XML Data in DB2 for z/OS

113

For maximum portability, you should specify UNICODE in the UNLOAD statement and use Unicode delimiter characters. If XML columns are not being unloaded in UTF-8 CCSID 1208, the unloaded column values are preﬁxed with a standard XML encoding declaration that speciﬁes the encoding that is used. If the table that you unload contains XML documents larger than 32KB, you need to use ﬁle reference variables (FRV) to unload the XML data to a separate partitioned data set (PDS) or hierarchical ﬁle system (HFS) ﬁle. Figure 5.28 shows unload to a PDS. //SYSIN DD * TEMPLATE XMLHERE DSN 'USER123.&DB..&TS..UNLOAD' DSNTYPE(PDS) UNIT(SYSDA) UNLOAD DATA DELIMITED CHARDEL X'22' COLDEL X'2C' DECPT X'2E' FROM TABLE CUSTOMER (CID INT, INFO VARCHAR(255) CLOBF XMLHERE) UNICODE /*

Figure 5.28

SYSIN cards for unloading XML documents larger than 32KB

Let’s look at how the SYSIN cards in Figure 5.28 are constructed. The ﬁrst two lines deﬁne a template with the name XMLHERE. The template declares the output naming pattern for the XML data ﬁles. The variables &DB and &TS take the value of the database and table space where the XML data is unloaded from. The parameter DSNTYPE speciﬁes the type of volume for the unloaded data. If PDS is speciﬁed, then this limits the output dataset to a single volume. This is also the default if no DSNTYPE is speciﬁed. If the output should use multiple volumes, then you must specify HFS. Next is the UNLOAD DATA statement. The line starting with DELIMITED deﬁnes how the data is to be delimited. The last line speciﬁes that the XML documents that are unloaded from the XML column INFO are represented in the output data by ﬁle names of up to 255 characters. The type VARCHAR(255) deﬁnes the data type of the XML ﬁle names, not of the actual XML data. The keyword CLOBF tells UNLOAD to use File Reference Variables (FRV) and to store the XML documents as CLOB ﬁles. You can also specify BLOBF or DBCLOBF as possible output ﬁle formats. The template name XMLHERE tells UNLOAD to name the XML ﬁles according to the template that was deﬁned in the ﬁrst line. If you do not specify EBCDIC, ASCII, UNICODE, or CCSID, the encoding scheme of the source data is preserved. If the output PDS that will contain the XML documents does not exist, the job will create it for you. The names of the output ﬁles are stored in the SYSREC data set as strings, as shown in Figure 5.29. 1000.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQCY) 1001.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQDR) 1002.USER123.DSN00201.XCUS0000.UNLOAD(B4C0WQEB) ...

Figure 5.29

Contents of SYSREC DS when unloading documents larger than 32KB

114

Chapter 5

Moving XML Data

You can see that the value of the relational column cid is the ﬁrst part of each record. Each of the output ﬁles pointed to by the remainder of the record contains an XML document. Note the random member name. If the dataset already contains members when the job is run, then the existing members are not deleted, but new members (again with random names) are added. But the dataset that SYSREC points to is overwritten with the new names. The dataset pointed to by SYSPUNCH contains the statements that you need to put into a LOAD job, as shown in Figure 5.30. Such a LOAD job is discussed in section 5.5. LOAD DATA INDDN SYSREC LOG NO RESUME YES UNICODE CCSID(01208,01208,01208) FORMAT DELIMITED COLDEL X'2C' CHARDEL X'22' DECPT X'2E' SORTKEYS 3 INTO TABLE "USER123"."CUSTOMER" ("CID" POSITION(*) INTEGER, "INFO" POSITION(*) VARCHAR CLOBF MIXED PRESERVE WHITESPACE)

Figure 5.30

5.5

Output SYSPUNCH DS when unloading records larger than 32KB

LOADING XML DATA IN DB2 FOR Z/OS

To load data into tables you use the LOAD utility, as shown in Figure 5.31. The data that was unloaded in Figure 5.27 is being loaded into a new table called customer2. This table has an INTEGER column and an XML column. Remember that only well-formed XML documents can be loaded into an XML column. //LOAD01 EXEC DSNUPROC,PARM='ISC9,IANTEX',COND=(4,LT) //SORTLIB DD DSN=SYS1.SORTLIB,DISP=SHR //SORTOUT DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK01 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK02 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK03 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SORTWK04 DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //DSNTRACE DD SYSOUT=* //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //MYSYSREC DD DSN=USER123.UNLOAD.SYSREC,DISP=SHR //SYSUT1 DD UNIT=SYSDA,SPACE=(4000,(50,50),,,ROUND) //SYSERR DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSDISC DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSMAP DD UNIT=SYSDA,SPACE=(4000,(20,20),,,ROUND) //SYSIN DD * LOAD DATA INDDN (MYSYSREC) LOG NO RESUME YES UNICODE CCSID(01208,01208,01208) FORMAT DELIMITED COLDEL X'2C' CHARDEL X'22' DECPT X'2E' SORTKEYS 3 INTO TABLE "USER123"."CUSTOMER2" ( "CID" POSITION(*) INTEGER , "INFO" POSITION(*) XML PRESERVE WHITESPACE ) /*

Figure 5.31

Example of a DB2 for z/OS LOAD job

5.5

Loading XML Data in DB2 for z/OS

115

Note: • If you have unloaded the data previously, using the jobs shown in Figure 5.26 or Figure 5.27, then the SYSIN records are the contents of the SYSPUNCH DD card in these jobs. • The PRESERVE WHITESPACE option has been speciﬁed for the XML column. It can be omitted, in which case the default behavior is not to preserve whitespace. • If you omit the UNICODE CCSID line, then you get the following error: “RECORD (1) WILL BE DISCARDED DUE TO 'CID' CONVERSION ERROR”. The Unicode input data for FORMAT DELIMITED must be UTF-8, which is CCSID 1208. • The COLDEL parameter speciﬁes the column delimiter that is used in the input ﬁle. The default value is a comma (,). For ASCII and UTF-8 data this is X'2C', and for EBCDIC data it is a X'6B'. The CHARDEL parameter speciﬁes the character string delimiter that is used in the input ﬁle. The default value is a double quotation mark ("). For ASCII and UTF-8 data this is X'22', and for EBCDIC data it is X'3F'. The DECPT parameter speciﬁes the decimal point character that is used in the input ﬁle. The default value is a period (.). The default decimal point character is a period in a delimited ﬁle, X'2E' in an ASCII or Unicode UTF-8 ﬁle. When the XML data is loaded as a part of regular input records, specify XML as the input ﬁeld type. The target column must be an XML column. The LOAD utility treats XML columns as variable-length data when loading XML directly from input records and expects a two-byte length ﬁeld preceding the actual XML value. The internal XML tables are loaded when the base table is loaded. You cannot specify the name of the internal XML table for load. You also cannot directly load the DocID column of the base table space or specify a default value for an XML column. You can load XML documents from regular input records if the total input record length is less than 32KB. XML documents that don’t ﬁt into 32KB input records must be loaded from separate ﬁles. To achieve this you need to modify the SYSIN cards in Figure 5.31 with the one in Figure 5.30. The SYSREC input dataset is the dataset you speciﬁed in the UNLOAD job in Figure 5.27. If you have documents larger than 32KB that come from a source other than a previous unload, you can load these into a table as follows. As an example let us use a document called DOC01, which is also the member name in a partitioned dataset called USER123.XMLLOAD. First you need to edit the dataset pointed to by SYSREC and add the relational value for the Cid column of the row, as shown next: 2000.USER123.XMLLOAD(DOC01)

You can now use exactly the same SYSIN cards as before to load this document into the table customer2.

116

Chapter 5

Moving XML Data

Note that DB2 for z/OS does not compress an XML table space during the LOAD process. If the XML table space is deﬁned with COMPRESS YES, then you have to run a REORG to compress the data.

5.6 VALIDATING XML DOCUMENTS DURING LOAD AND INSERT OPERATIONS When you use the LOAD or IMPORT utilities in DB2 for Linux, UNIX, and Windows to move a large number of XML documents into a table, you can validate these documents against an XML Schema. Simply add the clause XMLVALIDATE USING SCHEMA to the LOAD or IMPORT command, as illustrated in Figure 5.32. LOAD FROM c:\mydata\load_customer.txt OF DEL XML FROM c:\mydatadata XMLVALIDATE USING SCHEMA db2admin.custxsd INSERT INTO customer;

Figure 5.32

Performing XML Schema validation during LOAD

In DB2 for z/OS there is no XMLVALIDATE option for the LOAD utility but you can validate documents after loading them into a table. This and other validation topics are covered in Chapter 17, Validating XML Documents against XML Schemas.

5.7

SPLITTING LARGE XML DOCUMENTS INTO SMALLER DOCUMENTS

Most programmers ﬁnd it convenient and efﬁcient to work with an XML document granularity that matches the logical business objects of the application and the predominant granularity of access. For example, a single document per purchase order, per trade, per contract, per tax return, per customer, and so on is usually a good idea. Smaller documents can be manipulated more efﬁciently than larger ones. Also, indexed access and data retrieval is faster for smaller documents. However, for a bulk transfer of XML data outside the database, such as FTP, it is often not convenient to handle thousands or millions of separate documents. Therefore, it is common to receive large XML documents, often several hundred megabytes per ﬁle, which contain many repeating blocks that represent independent objects. Many external XML tools fail, or have severe problems, when you try to open such large XML documents, typically due to document object model (DOM) parsing and memory limitations. DB2 can ingest XML documents up to 2GB. Optionally, you can split them into smaller documents using the XMLTABLE function. The XMLTABLE function is discussed in detail in Chapter 7, Querying XML Data with SQL/XML. Here we show one simple example of how it can split up documents.

5.7

Splitting Large XML Documents into Smaller Documents

117

Assume you need to manage many XML documents with the following (simpliﬁed) structure: 1 Heather 12.34

You may receive many of these documents in one large ﬁle that has a root element . The root element is required for the ﬁle to be a well-formed document. Otherwise it cannot be processed in DB2. The large ﬁle looks like this: 1 Heather 12.34 2 Helen 56.78 …

Your ﬁrst step is to insert, import, or load this document into a staging table that has a column of type XML, such as this one: CREATE TABLE staging(xcol XML)

When this table contains the large document in a single row, you can read the document from the staging table, split it into the individual account documents, and insert those into the following target table: CREATE TABLE accounts(acc XML)

To split the large document, use one of the two INSERT statements in Figure 5.33. Both accomplish the same thing; that is, they produce one row (document) in the target table for each account element in the large input document. You must create an XML document node for each newly created account document, either with the SQL/XML function XMLDOCUMENT, or with the XQuery function document{}. The latter is only available in DB2 for Linux, UNIX, and Windows. The ﬁrst of the two statements in Figure 5.33 is suitable for DB2 for z/OS.

118

Chapter 5

Moving XML Data

INSERT INTO accounts(acc) SELECT XMLDOCUMENT(x.val) FROM staging, XMLTABLE('$x/accounts/account' passing xcol as "x" COLUMNS val XML PATH '.') AS x;

INSERT INTO accounts(acc) SELECT x.val FROM staging, XMLTABLE('$XCOL/accounts/account' COLUMNS val XML PATH 'document{.}') AS x;

Figure 5.33

Splitting a large document

After the insert operation, select the data from accounts to verify that the large input document has been split correctly (see Figure 5.34). SELECT acc FROM accounts;

1 Heather 12.34 2 Helen 56.78 describe xquery db2-fn:xmlcolumn('CUSTOMER.INFO') Column Information Number of columns: 1 SQL type ------------988 XML

Type length ----------0

Column name --------------------INFO

Name length ----------4

db2 =>

Figure 6.19

Describing an XQuery

You can run the query in Figure 6.18 in the DB2 Command Line Processor (CLP) or any other interface, such as the Command Editor that’s part of the DB2 Control Center, IBM Data Studio, or, for example, via JDBC from a Java application. When the XML type data is returned from the DB2 server to any such client it is automatically serialized; that is, converted from DB2’s internal tree format to XML text. The CLP displays at most 4,000 bytes of XML text per row. Any XML column values shorter than this are padded with blanks. Any XML data beyond 4,000 bytes per row is truncated in the CLP display. To avoid truncation and to see the full XML output, you can use the DB2 EXPORT utility (see Chapter 5, Moving XML Data) or a tool such as IBM Data Studio. The table and column name in the db2-fn:xmlcolumn() function must be enclosed in either single quotes or double quotes. They typically also need to be in uppercase. This is because DB2 table and column names default to uppercase, unless you use quotes in the CREATE TABLE statement to force a lowercase table or column name. Now that you are familiar with the mechanics of running XPath in DB2, let’s run the XPath expression previously shown in Figure 6.17. Simply append the path /customerinfo/phone to the db2-fn:xmlcolumn() function, as shown in Figure 6.20. The result is exactly the same as in Figure 6.17. db2 => xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo/phone 416-555-3376

5 record(s) selected. db2 =>

Figure 6.20

Executing the query from Figure 6.17 in the DB2 Command Line Processor

6.5

How to Execute XPath in DB2

139

Remember that each step in a path expression produces a sequence of so-called context nodes that are input to the next step. In the same manner, the db2-fn:xmlcolumn() function produces a sequence of XML documents that are input to the ﬁrst step of the XPath expression. Hence, the XPath /customerinfo/phone is evaluated once for each document in the table. The result items from all documents, in this case phone elements, are combined into a single sequence. Each item is returned to the client as a separate row. DB2 also offers the function db2-fn:sqlquery(), which is similar to db2-fn:xmlcolumn(). While db2-fn:xmlcolumn() takes an XML column name as input and produces the sequence of all documents in that column as output, the function db2-fn:sqlquery() takes an SQL query as input and produces as output the sequence of documents that are returned by that SQL statement. This SQL query can be any query, even with joins and subselects and so on, as long as it returns a single column of type XML. Figure 6.21 is a simple example of a query that returns a sequence of documents that are a subset of the documents in the XML column info. xquery db2-fn:sqlquery("SELECT info FROM customer WHERE id > 1003")

Figure 6.21

Producing a sequence of documents with an SQL query

The key difference between db2-fn:xmlcolumn() and db2-fn:sqlquery() is that db2fn:xmlcolumn() takes all documents in an XML column as the input for your XPath expression, while db2-fn:sqlquery() allows you to use relational predicates and so on to pre-ﬁlter the set of documents that are input to the XPath query. The embedded SQL statement is parsed by DB2’s SQL parser, which means that table and column names are automatically converted to uppercase. You can append any path expression to the db2-fn:sqlquery() function to further process the returned documents. In Figure 6.22, the XPath expression /customerinfo/phone is applied to the one XML document that is identiﬁed by the embedded SQL statement. db2 => xquery db2-fn:sqlquery("select info from customer where id = 1003")/customerinfo/phone 905-555-7258 416-555-2937 905-555-8743 3 record(s) selected. db2 =>

Figure 6.22

Using db2-fn:sqlquery in the DB2 Command Line Processor

140

Chapter 6

Querying XML Data: Introduction and XPath

You can run any XPath expression that you see in this chapter simply by appending it to the db2fn:xmlcolumn() or db2-fn:sqlquery() functions and using the xquery keyword, as illustrated in the preceding ﬁgure. In the following sections we explain further features of the XPath language and provide more examples. All of them can be run in DB2 for Linux, UNIX, and Windows just like you see in Figure 6.20 and Figure 6.22.

6.6

WILDCARDS AND DOUBLE SLASHES

XPath allows the use of the * as a wildcard character to match any element name, and @* to match any attribute name. The XPath expression in Figure 6.23 uses the wildcard to return all elements that are immediate children of the assistant element. The assistant element occurs only in the second of the two documents and has two child elements, name and phone. XPath: Output:

Figure 6.23

/customerinfo/assistant/* Gopher Runner 416-555-3426

Using a wildcard to select all child elements of assistant

The wildcard in the XPath expression in Figure 6.24 matches all elements that occur directly under customerinfo. These are the elements name, addr, phone and in the second document also assistant. The sequence of these elements is input to the last step of this XPath, /name. In other words, the XPath then tries to ﬁnd /customerinfo/name/name, /customerinfo/ addr/name, /customerinfo/phone/name, and /customerinfo/assistant/name. The ﬁrst three of these don’t exist and so only the assistant’s name is returned.

XPath: Output:

Figure 6.24

/customerinfo/*/name Gopher Runner

Using a wildcard to match any child element of customerinfo

The query in Figure 6.25 uses two wildcards, one to match any element at the second level of the document hierarchy and one to match any element at the third level. The ﬁrst wildcard matches name, addr, phone, and assistant, as in the previous example. The next wildcard then matches any child elements of these nodes. Only addr and assistant have child elements and all of those are returned. The last two elements in the result, name and phone, are children of assistant, which exists only for one of the two input documents. Customer phone elements are not included in the result, because they are at the second instead of the third level of the document. The XPath expression /*/*/* would return the same result from the sample data.

6.6

Wildcards and Double Slashes

XPath: Output:

Figure 6.25

141

/customerinfo/*/* 845 Kean Street Aurora Ontario N8X 7F8 1596 Baseline Toronto Ontario M3Z 5H9 Gopher Runner 416-555-3426

Using wildcards to return any element on the third level of the document

While * matches any element name, @* matches any attribute. The XPath in Figure 6.26 is similar to the one in Figure 6.25, but it returns any attribute at the third level of the documents because it uses @* instead of * in the last step of the path expression. Additionally, the data() function is used to return just the value of each attribute node. The sample data contains two attributes on the third level of the document, /customerinfo/addr/@country and /customerinfo/phone/ @type. The addr and phone elements are matched by the * in the second step of the XPath, and their attributes are matched by @* in the third step. Attributes of the assistant phone elements are not returned because they are at the fourth level. XPath: Output:

Figure 6.26

/customerinfo/*/data(@*) Canada work home cell Canada work home

Using wildcards to return any attribute on the third level of the document

The examples clarify that a * is a wildcard for a tag name at a very speciﬁc level of the XML documents, and you need to use multiple wildcards to match arbitrary tags at multiple levels. Another XPath construct that makes queries more general is the double slash (//). You can use it to reach descendants at any level in a document tree. An example is shown in Figure 6.27. The difference between a single slash (/) and a double slash (//) is that a / navigates exactly one level further down in the document tree while a // navigates any number of levels down the tree. In other words, a / navigates to an immediate child node while a // navigates to all descendant nodes. Descendant nodes include child nodes, grandchild nodes, great-grandchild nodes, and so on.

142

Chapter 6

Querying XML Data: Introduction and XPath

The XPath expression in Figure 6.27 consists of two steps. The ﬁrst step navigates to the top-level element customerinfo. All customerinfo nodes are input (context) for the second step. The second step, //name, looks for name elements at any level in the document tree under a customerinfo node. It ﬁnds two name elements at the second level, /customerinfo/name, and one name element at the third level, /customerinfo/assistant/name.

XPath: Output:

Figure 6.27

/customerinfo//name Robert Shoemaker Matt Foreman Gopher Runner

Selecting name elements at any level under customerinfo

Figure 6.27 shows some of the beneﬁts and some of the dangers of the //. A beneﬁt is that the // allows you to easily navigate to all occurrences of a certain element, even if that element occurs at multiple different levels of a document tree. Another beneﬁt can be that it allows you to ﬁnd a certain element in the documents even if you do not know its exact position and therefore are unable to write a fully qualiﬁed XPath. A danger of the // can be that it might select more data than you actually intended. If the goal of the query in Figure 6.27 was to retrieve customer names only, then the result leads you to believe that there are three customers and that Gopher Runner is one of them. This is incorrect because Gopher Runner is the assistant to Matt Foreman and not a customer himself. Another disadvantage of the // is that it doesn’t specify a direct path to the desired nodes. This causes an XPath processor, such as DB2, to search exhaustively through potentially large portions of a document. For example, the query in Figure 6.27 requires DB2 to navigate into the addr branch of each document and examine each child element of addr to determine whether its element name is name. A fully speciﬁed path without // avoids this overhead and yields better performance. The // can also be used at the beginning of a path expression, such as //name, which for the sample data returns the same result as the query in Figure 6.27. The XPath //* returns all elements from all input documents, because // navigates to any level of the document and * matches any element at each of those levels. Similarly //data(@*) returns all attribute values anywhere in the documents, and //text() returns all text nodes. Use such general expressions with caution.

6.7

XPATH PREDICATES

The preceding XPath examples always return all matching nodes from the input documents. In many cases it is desirable to use search conditions (predicates) to ﬁlter the data and only return selected items. In XPath, predicates are always enclosed in square brackets and can appear in any

6.7

XPath Predicates

143

step of the path. In Figure 6.28, a predicate in square brackets is applied to the customerinfo element, which is the ﬁrst step of the path. Roughly speaking, this query returns the name of the customer(s) whose Cid attribute is 1004. More precisely, the predicate checks for each customerinfo element in the input data, whether the element has an attribute by the name of Cid and whether the value of that attribute is 1004. If such a Cid attribute does not exist or if its value is not 1004, the respective customerinfo element is excluded from further consideration. Based on our input data, only the customerinfo element in the second document passes this test. This element is now the context for the next steps of the navigation, /name/text(), and the value Matt Foreman is returned.

XPath: Output:

Figure 6.28

/customerinfo[@Cid=1004]/name/text() Matt Foreman

Numeric predicate in an XPath expression

Instead of the equality comparison you can also use less than (), less than or equal (=), and not equal (!=). More details on comparison operators are provided in section 6.8. In Figure 6.29, the predicate in square brackets is applied to the addr element to return the streets of those customers who live in Toronto. If an addr element has a child element city whose value is Toronto, the addr element is used as the context for the next navigation step, /street.

XPath: Output:

Figure 6.29

/customerinfo/addr[city="Toronto"]/street 1596 Baseline

String predicate in an XPath expression

Remember that the value of an element is deﬁned as the concatenation of all text nodes in the subtree underneath that element (see section 3.1, Understanding XML Document Trees). Since the city element has only a single text node, the predicates [city="Toronto"] and [city/ text()="Toronto"] lead to the same result. Hence, in the vast majority of cases you do not need to use /text() in predicates. The relatively rare case in which it can sometimes be useful to use /text() in predicates is when the immediate children of an element are a mix of element and text nodes. Such elements are said to have mixed content (see section 3.1). If you want to return the city element instead of the street element, a possible XPath is /customerinfo/addr[city="Toronto"]/city. The city element is referenced once to evaluate the predicate and then a second time at the end of the path to return it.

144

Chapter 6

Querying XML Data: Introduction and XPath

NUMERIC VERSUS STRING COMPARISON Note that the predicate [@Cid=1004] performs a numeric comparison while the predicate [@Cid="1004"], with double quotes around the literal value, performs a string comparison. The difference between numeric and string comparison can lead to different query results. For example, a string comparison would ﬁnd that the string values “1E3” and “1000” are not equal. But, a numeric comparison would conﬁrm that the numbers 1E3 and 1000 are equal because 1E3 is the exponential notation for 1000. Similarly, the string comparison “2” < “10” is false, but the numeric comparison 2 < 10 is true. Note also that the numeric comparison [@Cid=1004] fails with an error (SQL16061N) at runtime if a document is encountered where the value of the Cid attribute is not a number.

A predicate expression within the square brackets can contain multiple steps to navigate to the element or attribute whose value you want to check. For example, say you want to return the name of all customers in Toronto. To develop this XPath expression from scratch, ﬁrst start without the predicate and write down just the path to the element that you want to return: /customerinfo/name

To restrict the result to customers in Toronto, a predicate on the city element is required. The city element is a child of the addr element, which in turn is a child of customerinfo, so this is where you need to apply the predicate: /customerinfo[addr/city ="Toronto"]/name

The predicate [addr/city ="Toronto"] checks for each customerinfo element if it has a child element addr that has a child element city whose value is Toronto. The customerinfo nodes that fulﬁll this condition are then the input for the next step, /name. In other words, the XPath step right after the predicate is /name and it continues navigation based on the element before the predicate (customerinfo) and not based on any element inside the square brackets. This is illustrated in Figure 6.30, where this XPath expression is shown with two branches. The horizontal branch identiﬁes the items that are to be returned (/customerinfo/name), and the branch in the dotted box is the predicate. addr customerinfo

Figure 6.30

city = "Toronto"

name

Visualization of an XPath with a predicate

6.7

XPath Predicates

145

One XPath can contain multiple predicates, as illustrated in Figure 6.31, which returns the street of the customer whose name is Matt Foreman and whose city is Toronto.

XPath: Output:

/customerinfo[name="Matt Foreman"]/addr[city="Toronto"]/street 1596 Baseline

Figure 6.31

XPath with two predicates

When writing such a query from scratch, proper placement of the predicates is sometimes not obvious if you are new to XPath. The recommendation is again to ﬁrst write the XPath without any predicates and only navigate to the element that you want to return (street). This simpler XPath looks like this: /customerinfo/addr/street

Now you can add ﬁltering predicates for name and city. Since name is a child element of customerinfo, insert a pair of square brackets right after customerinfo for the predicate: /customerinfo[name="Matt Foreman"]/addr/street

The city element is a child of addr, so the square brackets for the second predicate come right after addr in the path expression, and this completes the query in Figure 6.31: /customerinfo[name="Matt Foreman"]/addr[city="Toronto"]/street

Again, visualizing this query as a branching expression might be helpful (see Figure 6.32). name = "Matt Foreman" city = "Toronto" customerinfo

Figure 6.32

addr

street

Visualization of an XPath with two predicates

Note that a predicate expression in square brackets can contain a / or a // but typically never starts with a / or a //. Consider the following XPath expression as an example: /customerinfo[/name="Matt Foreman"]/addr/street

This XPath returns the empty sequence because the predicate [/name="Matt Foreman"] does not use the current customerinfo element as context. That is, it does not look for name elements that are children of customerinfo. Instead, the / inside the square brackets causes it to

146

Chapter 6

Querying XML Data: Introduction and XPath

restart navigation at the very top of each document, but there is no document in the sample data where the topmost element is name. Figure 6.33 shows what can happen if you use // right at the beginning of a predicate expression in square brackets. The intention of this query was to return all cell phones by looking at type attributes anywhere under phone. However, the // inside the square brackets causes it to restart navigation at the very top of each document. Hence, the actual meaning of this query is: Retrieve all phone elements from a document if a type attribute with the value “cell” occurs anywhere in the document. In other words, return all phone elements if one of them is a cell phone. XPath: Output:

/customerinfo/phone[//@type="cell"] 905-555-7258 416-555-2937 905-555-8743

Figure 6.33

Incorrect use of // in a predicate

If you know that the type attribute is a child of phone, you could simply remove the // from the beginning of the predicate expression. Otherwise you can use a dot to force the // to only search within the subtree (within the current context) of the respective phone element (see Figure 6.34). The current context is explained in more detail in section 6.10. XPath: Output:

/customerinfo/phone[.//@type="cell"] 905-555-8743

Figure 6.34

Correct use of // in a predicate

Also note that the opening square bracket of a predicate can never follow immediately after a / or a //. For example, the XPath /customerinfo/[name="Matt Foreman"] would fail with an error (SQL16002N). A / starts a new step, which cannot begin with a predicate. A predicate always has to be preceded by a context node (such as an element name) to which it is applied. And ﬁnally, look at Figure 6.35, which uses an equality comparison without square brackets. This is just a Boolean expression of the form A = B that returns either true or false. It is not a useful predicate to select speciﬁc parts of the customer data. In particular, this query does not return the customer whose name is Matt Foreman. The query examines a sequence of name elements and returns true if at least one of them is equal to Matt Foreman. This is called existential semantics and is explained in the next section. XPath: Output:

/customerinfo/name="Matt Foreman" true

Figure 6.35

A Boolean expression, not a ﬁltering predicate

6.8

6.8

Existential Semantics

147

EXISTENTIAL SEMANTICS

When you use XPath, existential semantics (also known as existential quantiﬁcation) is applied automatically all the time. Roughly speaking, existential semantics means that the existence of at least one matching node is sufﬁcient for a predicate to evaluate to true. Let’s look at the query in Figure 6.36 as an example. This query returns the name of those customers whose phone number is 416-555-2937. But, both of the input documents contain several occurrences of the phone element. Existential semantics means that the query in Figure 6.36 returns name elements that are children of customerinfo elements that contain at least one child element phone whose value is 416-555-2937. The existence of at least one matching phone element is sufﬁcient to fulﬁll the predicate. Existential semantics is a useful concept for querying XML data, because it deﬁnes how to evaluate predicates on repeating elements (or more generally, on sequences of two or more items). XPath: Output:

Figure 6.36

/customerinfo[phone="416-555-2937"]/name Robert Shoemaker

At least one phone element must match, not all of them

Figure 6.37 shows another example of existential semantics. It includes a predicate that contains nothing but the element name assistant. The predicate evaluates to true if this element exists at the indicated position in the document tree; that is, as a child of the customerinfo element. As a result, this query returns the name of those customers who have an assistant, no matter what the assistant name or phone number is. The mere existence of an assistant element is what this predicate is looking for. Such a predicate is called a structural predicate as opposed to a value predicate, which performs a value comparison. XPath: Output:

Figure 6.37

/customerinfo[assistant]/name Matt Foreman

A structural predicate

Similarly you can check for the existence of an attribute. The query in Figure 6.38 retrieves the names of all customers who have a country attribute in the addr element. XPath: Output:

Figure 6.38

/customerinfo[addr/@country]/name Robert Shoemaker Matt Foreman

Return the name if a country attribute exists

Yet another example of existential semantics is illustrated in Figure 6.39 where the right side of the predicate is a sequence of two atomic values. This predicate is true if there is at least one value in this sequence that is equal to the value of the city element. If you are familiar with IN-list queries in SQL, this is how you can do the same in XPath.

148

Chapter 6

XPath: Output:

Figure 6.39

Querying XML Data: Introduction and XPath

/customerinfo[addr/city = ("Toronto","Aurora")]/name Robert Shoemaker Matt Foreman

Predicate is true if at least one of the values matches

What if a customer has several addresses so that addr/city evaluates to a sequence of multiple city elements? In this case, existential semantics deﬁnes that the predicate is true if at least one of these city elements is equal to at least one of the values on the right side. Let’s look at the two sequences (1,2,3,4) and (7,8,2). The comparison (1,2,3,4) = (7,8,2) evaluates to true because there is at least one item in the ﬁrst sequence that is equal to at least one item in the second sequence. This item is the number 2. What might seem counterintuitive at ﬁrst is that the predicate (1,2,3,4) != (7,8,2) also evaluates to true! This is again due to existential semantics, because there is at least one item in the ﬁrst sequence that is not equal to at least one item in the second sequence. Figure 6.40 shows the corresponding behavior for the sample data. Remember that Robert Shoemaker lives in Aurora and Matt Foreman lives in Toronto (see Figure 6.7). The XPath in Figure 6.40 returns Robert Shoemaker’s name because his city (Aurora) is not equal to at least one item in the sequence on the right (Toronto). The same applies to Matt Foreman whose city (Toronto) is not equal to Aurora. XPath: Output:

Figure 6.40

/customerinfo[addr/city != ("Toronto","Aurora")]/name Robert Shoemaker Matt Foreman

Predicate is true if at least one of the values does not match

The lesson here is that XPath’s existential semantics is not only applied to equality predicates but also to range and inequality predicates for which the behavior is not immediately intuitive if the left side or the right side evaluates to a sequence of more than one item. For example, the predicate in Figure 6.41 only involves sequences of exactly one item on either side of the != operator. The behavior is intuitive and only Robert Shoemaker’s name is returned because he is the only customer in our sample who does not live in Toronto. XPath: Output:

Figure 6.41

6.9

/customerinfo[addr/city != "Toronto"]/name Robert Shoemaker

Not-equal predicate on single items

LOGICAL EXPRESSIONS WITH AND,

OR, NOT()

Similarly to SQL, XPath allows you to build more complex predicates with and, or, and not(). While and and or are logical operators, not() is a function that reverses the Boolean value of its argument. XPath and XQuery are case-sensitive languages and all operators and functions have to be written in lowercase.

6.9

Logical Expressions with AND, OR, NOT()

149

The query in Figure 6.42 uses the or operator to check whether there is an addr with a city element that has the value Toronto, or if there is an addr with a city element whose value is Aurora. For the sample data, this returns the same result as in Figure 6.39. Note that when we say “if there is” or “if there exists” we are hinting at the fact that existential semantic is always at play. XPath: Output:

Figure 6.42

/customerinfo[addr/city = "Toronto" or addr/city ="Aurora"]/name Robert Shoemaker Matt Foreman

Disjunction of predicates (or-’ing)

The and operator is used in Figure 6.43 to select the names of customer whose city is Aurora and whose country is Canada. XPath: Output:

Figure 6.43

/customerinfo[addr/city = "Aurora" and addr/@country = "Canada"]/name Robert Shoemaker

Conjunction of predicates (and-’ing)

The predicate in Figure 6.43 checks whether there is an addr element with a city child that has the value Aurora, and whether there is also an addr element with a country attribute whose value is Canada. In this case, both conditions are fulﬁlled by one and the same addr element. In general, however, they could be fulﬁlled by two different addr elements; for example, if a customer had two addresses. This alludes to the next interesting example. You might write the query in Figure 6.44 to ﬁnd a customer whose work phone number is 416555-2937. Such a customer does not exist in our sample data, because 416-555-2937 is Robert Shoemaker’s home phone number, not his work phone number. The predicate restricts the value of the phone element to 416-555-2937, and the type attribute of the phone element to work. Still, the name Robert Shoemaker is returned. This is because existential semantics applies to both parts of the predicate. The ﬁrst part of the predicate, phone = "416-555-2937", is true because there is a phone element whose value is 416-555-2937. The second part of the predicate, phone/@type = "work", is also true because there also is a phone element whose type is work. But, these two phone elements are not the same. The query result in Figure 6.44 is perfectly correct according to the existential semantics of XPath, but probably not what you wanted to achieve with this query. XPath: Output:

Figure 6.44

/customerinfo[phone = "416-555-2937" and phone/@type = "work"]/name Robert Shoemaker

Two predicates matched by different phone elements!

150

Chapter 6

Querying XML Data: Introduction and XPath

To solve this issue you need to express the predicate such that both conditions are applied to the same phone element. One way of doing this is shown in Figure 6.45 where nested square brackets are used. The outer square brackets describe a predicate that is applied to the customerinfo elements. This predicate says that a customerinfo element should only be considered if a certain phone element exists among its children. The inner square brackets are used to further constrain these phone elements by applying a predicate to them. The inner predicate [text() = "416-555-2937" and @type = "work"] says that the text value of the phone element has to be 416-555-2937 and the type of the same phone element is work. Both parts of this inner predicate are always applied together to the same phone element. Since no such customer exists in our sample data, the correct result of the query is empty. XPath:

/customerinfo[phone[text() = "416-555-2937" and @type = "work"] ]/name

Output:

Figure 6.45

Nested predicates

Figure 6.46 provides another example of the use of the or operator. It returns the names of the customers who have an assistant or a cell phone. Both of the customers are returned because one of them has a cell phone and the other has an assistant. XPath: Output:

Figure 6.46

/customerinfo[assistant or phone/@type="cell"] ]/name Robert Shoemaker Matt Foreman

A structural predicate and a value predicate

The XPath expression in Figure 6.47 lists the names of those customers who don’t have an assistant. The not() function is used in the predicate to qualify the customerinfo elements that do not have a child element with the name assistant. XPath: Output:

Figure 6.47

/customerinfo[not(assistant)]/name Robert Shoemaker

Checking for the non-existence of an element

Next, let’s look at the following pair of queries (see Figure 6.48 and Figure 6.49) to clarify the difference between using the not() function and the “not equal” comparison operator (!=). Due to existential semantics, the query in Figure 6.48 returns the names of both customers. This is because both of them have at least one phone number that is not equal to 416-555-2937. One such non-matching phone element is enough to fulﬁll the predicate, even if other phone elements exist that do match this number.

6.10

The Current Context and the Parent Step

151

The query in Figure 6.49 returns a result that might be more desirable: the name of the customer who does not have any phone element with the value 416-555-2937. The equality predicate inside the not() function is subject to existential semantics; that is, at least one phone element with this speciﬁc number has to exist. The outcome of this test is then negated with the not() function. In other words, the two queries differ because • The query in Figure 6.48 checks whether there is at least one phone that is not equal to 416-555-2937 (even if other phone elements are equal to this value). • The query in Figure 6.49 checks whether there is not at least one phone that is equal to 416-555-2937 (that is, there is no phone that is equal to this value).

/customerinfo[phone != "416-555-2937"]/name Robert Shoemaker Matt Foreman

XPath: Output:

Figure 6.48

/customerinfo[not(phone = "416-555-2937")]/name Matt Foreman

XPath: Output:

Figure 6.49

6.10

Predicate is true if at least one phone element does not match

Predicate is true if none of the phone elements match

THE CURRENT CONTEXT AND THE PARENT STEP

You probably know that in a ﬁle system the dot (.) denotes the current location in the ﬁle system, and two dots (..) refer to the parent directory. The same notation exists in XPath to refer to the current node when navigating a document tree, or to the parent of the current node. This is illustrated in Figure 6.50, which shows four versions of an XPath expression. All of them return the same result from our input data; that is, the name element of the customers who live in Aurora. For the discussion of these four XPath expressions you may want to refer to the document tree shown in section 3.1, Understanding XML Document Trees. Also, remember that the node name right before the square brackets of a predicate determines the input to the predicate and to the step that immediately follows the predicate. For example, XPath (a) in Figure 6.50 ﬁrst produces a sequence of customerinfo elements. For each of these customerinfo elements the predicate checks whether there is an addr element that has a child element city whose value is Aurora. If so, the respective customerinfo element is input to the ﬁnal step, /name, which returns the child element name. XPath (b) is different because the predicate is applied to addr, not to customerinfo. Hence, this XPath ﬁrst produces a sequence of addr elements, which are input to the predicate. Any addr element that has a child element city with value Aurora is then input to the subsequent

152

Chapter 6

Querying XML Data: Introduction and XPath

step after the predicate. Since we want to return name elements, we need to navigate from addr to name, which are siblings in our documents. Because an XML document tree has no direct links between siblings, we use the parent step (..) to go one level up in the tree to their common parent, and from there to name.

XPath (a) (b) (c) (d) Output:

Figure 6.50

/customerinfo[addr/city = "Aurora"]/name /customerinfo/addr[city = "Aurora"]/../name /customerinfo/addr/city[. = "Aurora"]/../../name /customerinfo/name[../addr/city = "Aurora"] Robert Shoemaker

Four different ways to write a predicate and return the name element

In XPath (c) the predicate in square brackets is applied to city, which means that this XPath ﬁrst produces a sequence of city elements, which are used as input (as context nodes) to the predicate. The predicate [. = "Aurora"] uses the dot to refer to the current context, which in this case is always a city element. Any city element for which the predicate is true is then input (context) for the subsequent navigation after the predicate. If you want to return name elements, you need to navigate from city to name, which are in different branches of the document. Hence you need to navigate via the nearest common ancestor, which is customerinfo. Since city is a grandchild of customerinfo, you need to go two levels up in the tree (/../..) before you can reach the name element (/name). XPath (d) is different from (a), (b), and (c) because there is no /name step after the predicate. Instead, XPath (d) ﬁrst navigates from customerinfo to name to produce a sequence of name elements. The square brackets are applied to name, to ﬁlter the names that get returned. The predicate [../addr/city = "Aurora"] means that a name element is returned only if it has a parent that has a child element addr that has a child element city whose value is Aurora. XPath (a) is the most preferable path expression among the four options in Figure 6.50, because it avoids parent steps completely. Avoiding parent steps is good for performance and keeps queries easy to understand. Figure 6.51 shows four more XPath expressions. All of them return empty results because their navigation doesn’t correspond to the structure of the sample data. The parent step in XPath (a) is incorrect for the sample data because it navigates from customerinfo to name with an intermediate parent step as if name was a sibling of customerinfo, which is not the case. XPath (b) tries to return name elements that are children of addr. But, no such name elements exist. Similarly, XPath (c) tries to return name elements that are children of the parent of city (that is, children of addr). Again, no such name elements exist. XPath (d) intends to return name elements that have a child element addr with a city whose value is Aurora. But, this predicate is always false for the sample data because addr is not a child of name.

6.11

Positional Predicates

XPath (a) (b) (c) (d) Output:

Figure 6.51

6.11

153

/customerinfo[addr/city = "Aurora"]/../name /customerinfo/addr[city = "Aurora"]/name /customerinfo/addr/city[. = "Aurora"]/../name /customerinfo/name[addr/city = "Aurora"]

Four different XPath expressions that don’t match the sample data

POSITIONAL PREDICATES

So far you have used value predicates and structural predicates. Value predicates compare an element or attribute to a literal value such as a string or a number. Structural predicates don’t look at values but at the structure of an XML document by checking for the existence of an element or attribute by name. Positional predicates can be used to select nodes based on the order in which they appear in a document or, more generally, in a sequence. As shown in Figure 6.52, a positional predicate is simply an integer number in square brackets. Both documents in the sample data contain multiple phone elements, but this query only returns the ﬁrst phone element from each document. XPath: Output:

Figure 6.52

/customerinfo/phone[1] 905-555-7258 905-555-4789

Positional predicate to select the ﬁrst phone element

Similarly, the XPath in Figure 6.53 selects the third phone element under each customerinfo element. In the sample data, the customer Robert Shoemaker has three phone numbers but Matt Foreman has only two phones. Hence, the result only contains Robert’s third phone number and none of Matt’s phone numbers. XPath: Output:

Figure 6.53

/customerinfo/phone[3] 905-555-8743

Positional predicate to select the third phone element

To obtain the last phone element from each document irrespective of the number of phone elements in any given document, use the function last() in the predicate. This function takes no arguments but serves as an index to the last item in a sequence (see Figure 6.54).

154

Chapter 6

Querying XML Data: Introduction and XPath

/customerinfo/phone[last()] 905-555-8743 416-555-3376

XPath: Output:

Figure 6.54

Positional predicate to select the last phone element

Related to positional predicates is the function position(). It takes no arguments but returns the position of the context item in the sequence that is being processed. For example, the positional predicate [3] is the same as the predicate [position() = 3].

6.12

UNION AND CONSTRUCTION OF SEQUENCES

Most of the XPath examples so far have returned one type of element, such as phone numbers or names. Sometimes it is desirable to obtain multiple different elements or attributes from each document. This can be achieved with the union operator, which is either written as the union keyword or the pipe character: |. The XPath in Figure 6.55 uses the union operator in the last step of the XPath, to combine the street and city elements into a single sequence. The result contains four elements, street and city from each of the two customers in the sample data. You will later use SQL/XML to return the street and city in two separate columns, which can be a more desirable return format (see Chapter 7, Querying XML Data with SQL/XML). XPath: Output:

Figure 6.55

/customerinfo/addr/(street|city) 845 Kean Street Aurora 1596 Baseline Toronto

XPath with a union operator

The union of sequences is similar to the construction of sequences. The comma is a sequence constructor and in many cases it produces the same result as a union. For example, the XPath /customerinfo/addr/(street,city)

returns the same result as the union in Figure 6.55. However, there are a couple of differences between union and construction of sequences. The comma operator allows you to construct sequences from atomic values. The | operator cannot take atomic values as input, it has to take sequences of element or attribute nodes as input. Secondly, the union removes duplicate nodes while the comma operator does not. The de-duplicating of the union is based on node identities, not on node values. This means that two elements are not necessarily considered duplicates just because they have the same element name and value. They are considered duplicates only if they are indeed the same element from the same document.

6.13

General and Value Comparisons

155

In addition to the union operator there is also an intersect and an except operator. The intersect operator produces the nodes that occur in both sequences, and the except operator returns the nodes that are in the ﬁrst but not the second sequence.

6.13

XPATH FUNCTIONS

If you look back at Figure 6.1 at the beginning of this chapter, you see that XPath and XQuery do not only share the same data model but also a common set of functions and operators. Throughout this chapter we have used some of these functions such as data(), string(), and not(). XPath and XQuery provide a large number of built-in functions. These include aggregate functions such as count() and sum(), string functions such as contains() and substring(), as well as numeric and other functions. Figure 6.56, Figure 6.57, and Figure 6.58 provide examples of how to use functions in XPath expressions. The count() function returns the number of nodes produced by the expression that is provided as the function argument. Remember that Robert Shoemaker has three phone numbers and Matt Foreman has two. Other functions such as upper-case() and concat() behave in intuitive ways. XPath: Output:

Figure 6.56

XPath: Output:

Figure 6.57

XPath: Output:

Figure 6.58

/customerinfo/count(phone) 3 2

Return the number of phone elements per document /customerinfo/upper-case(name) ROBERT SHOEMAKER MATT FOREMAN

Convert the customer names to upper case

/customerinfo/concat(name," – ", addr/city) Robert Shoemaker - Aurora Matt Foreman - Toronto

Concatenate the customer name and city

Section 8.7, XQuery Functions, contains a more extensive discussion of XPath and XQuery functions. Additionally, Appendix C provides pointers to the complete reference of all supported XPath and XQuery functions in DB2 for z/OS and DB2 for Linux, UNIX, and Windows.

156

6.14

Chapter 6

Querying XML Data: Introduction and XPath

GENERAL AND VALUE COMPARISONS

All the comparison operators that you have used so far (=, !=, =) are called general comparisons because they allow you to compare sequences of zero, one, or multiple items. This is based on existential semantics, as discussed in section 6.8. General comparisons provide a lot of ﬂexibility and serve you well in the vast majority of cases. There are also value comparison operators, such as eq (equal), lt (less than), le (less than or equal), gt (greater than), ge (greater or equal), and ne (not equal). Value comparisons are different from general comparisons because they can only compare single items. For example, /customerinfo/addr[city eq "Toronto"] is a valid value comparison as long as there is only one city element per addr. The query /customerinfo[phone eq "408-463-4963"] will fail at runtime because the sample data contains multiple phone elements per customerinfo. The DB2 error message is SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "item()" is expected in the context.

The “( item(), item()+ )” is a regular expression that denotes a sequence of one item followed by one or more items. In total that’s two or more items. So this message is a very formal way of saying that there is a sequence of multiple items (that is, multiple phone elements) when only a single item was allowed. In many cases you can work around this error by writing the XPath expression as /customerinfo/phone[. eq "408-463-4963"] because the dot always refers to exactly one of the phone elements at a time. Another solution is to simply use a general comparison instead: /customerinfo[phone = "408-463-4963"]. Another issue with value comparisons is that they perform string comparisons by default. For example, the XPath /customerinfo/addr[pcode-zip lt 95123] will fail with the following message because it tries to use the lt operator with a numeric value (95123), instead of a string value (“95123”). SQL16003N An expression of data type "xs:integer" cannot be used when the data type "xs:string" is expected in the context. SQLSTATE=10507

You can avoid this error by casting the pcode-zip element to xs:integer, such as [xs:integer(pcode-zip) lt 95123], or by using a general comparison instead. Value comparisons have one property that general comparisons do not have, and that is transitivity. If x eq y and y eq z then you are safe to conclude that y eq z is also true. This is not possible with the existential semantics of general comparisons for sequences. For example, (1,2,3) = (3,4,5) and (3,4,5) = (5,6,7), but (1,2,3) != (5,6,7) because there is no item in (1,2,3) that is equal to any item in (5,6,7).

6.16

Summary

157

In summary, the use of value comparisons opens up various opportunities for errors but in most cases provides little gain. Most applications do not require transitivity and are well-served with general comparisons. One potential beneﬁt of value comparisons is that you can force errors if you want to be alerted when data types or element occurrences are different than what you expect.

6.15

XPATH AXES AND UNABBREVIATED SYNTAX

We have introduced XPath through a series of practical examples. In a more formal introduction you might read about XPath axes. An axis is the direction of movement when navigating through a document. DB2 supports the child axis, the descendant axis, the attribute axis, the self axis, the parent axis, and the descendant-or-self axis. We have used all of these axes in the examples in the previous sections of this chapter. For example, the path /customerinfo/addr/@country uses the child axis to navigate from customerinfo to its child element addr, and the attribute axis to navigate from addr to its attribute country. All XPath examples in this book use the so-called abbreviated XPath syntax, because it is simple, easy to understand, and recommended. XPath also offers an unabbreviated syntax, which means that the axes are spelled out explicitly in each step of an XPath. This is rarely used. For example: Abbreviated: /customerinfo/addr/@country Unabbreviated: /child::customerinfo/child::addr/attribute::country Abbreviated: /customerinfo//phone Unabbreviated: /child::customerinfo/descendant-or-self::node()/child::phone In a nutshell, the unabbreviated XPath syntax is verbose, clumsy, and not used much in practice. We recommend that you do not use it. We have explained it here merely so that you recognize it if it ever crosses your path (no pun intended).

6.16

SUMMARY

XPath is the fundamental language for traversing XML documents, evaluating XML predicates, and retrieving XML values. A thorough understanding of XPath is a prerequisite for querying XML data in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Both SQL/XML and XQuery involve XPath. Understanding XPath begins with understanding the XQuery and XPath data model. This data model is inherently different from the relational model. The better you understand the XQuery data model the easier it is for you to write XML queries. Every value in the XQuery and XPath data model is a sequence of zero, one, or multiple items. An item is either an atomic value or a node. Commonly used nodes include document nodes, element nodes, attribute nodes, and text nodes. Element nodes can include child nodes to form hierarchies

158

Chapter 6

Querying XML Data: Introduction and XPath

of nodes, such as XML documents. Hence, a sequence of zero, one, or multiple XML documents is a value in the XQuery and XPath data model. A sequence of individual elements, a sequence of integer numbers, and so on are also values in the data model. Every XQuery or XPath query takes a value of this data model as input and produces another value of the data model as output. Most commonly an XPath expression consists of one or multiple steps, separated by a slash (/), where each step is an element name or wildcard. This allows you to navigate into an XML document tree to select speciﬁc elements. If you want to select attribute nodes then the last step in a path must be an attribute name that’s preceded by the @ sign. Since an XML document can contain elements that occur multiple times, a single XPath expression may select multiple nodes. At each step an XPath can contain a predicate to restrict the search in the document. XPath predicates must be enclosed in square brackets. The evaluation of XPath expressions and predicates is always based on existential semantics. Roughly speaking, existential semantics means that the existence of at least one matching item is sufﬁcient for a predicate to evaluate to true. This is of particular importance when you query XML documents with repeating elements. Repeating XML elements and existential semantics are some of the most profound differences between the XML world and relational world. In the following chapters you learn how to use XPath in SQL/XML and XQuery.

C

H A P T E R

7

Querying XML Data with SQL/XML

he SQL language standard includes a variety of functions and features to process XML data. This functionality is commonly referred to as SQL/XML. The SQL/XML functions that allow you to embed XPath and XQuery expressions in SQL are of particular interest. These functions enable you to use familiar SQL statements enriched with XPath expressions to query XML data in a DB2 database. They also facilitate the simultaneous processing of XML and relational data in the same query. This marriage of two worlds, XML and relational, is extremely powerful and versatile.

T

Although SQL/XML allows the integration of SQL and XQuery, this chapter focuses on the integration of SQL and XPath, which is supported in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows. The discussion of SQL/XML in this chapter assumes that you have a good understanding of XPath (see Chapter 6, Querying XML Data: Introduction and XPath). The examples in this chapter also use the same two sample documents that were used throughout Chapter 6. Please refer to Figure 6.7 in section 6.3, Sample Data for XPath, SQL/XML, and XQuery. All examples are based on the following customer table: CREATE TABLE customer(id INTEGER, info XML)

We assume that this table contains two rows with values 1003 and 1004 in the id column, and the two documents from Figure 6.7 in the XML column info. The remainder of this chapter is structured as follows: • An overview of SQL/XML is given in section 7.1. • The core SQL/XML functionality for extracting selected information from XML documents and deﬁning XML predicates is covered in sections 7.2, 7.3, and 7.4. 159

160

Chapter 7

Querying XML Data with SQL/XML

• Common mistakes with SQL/XML predicates are highlighted in section 7.5. • Parameter markers, dynamically computed XPath, sorting of XML data, and handling of binary data are discussed in sections 7.6 through 7.9.

7.1

OVERVIEW OF SQL/XML

The term SQL/XML refers to the XML-speciﬁc features and functions in the SQL:2003 and SQL:2006 standards. SQL/XML deﬁnes the following: • The XML data type, which is a regular SQL type just like INTEGER or CHAR for example. SQL/XML deﬁnes the semantics of this type, not its storage format. • Functions that convert XML type values to and from non-XML data types, such as CHAR, VARCHAR, CLOB, and others. These functions are XMLSERIALIZE, XMLPARSE, and XMLCAST. • The function XMLVALIDATE for XML Schema validation and the predicate IS VALIDATED, which checks the validation status of an XML document or fragment. • XML publishing functions, also sometimes called constructor functions, such as XMLELEMENT, XMLATTRIBUTES, and XMLAGG, which allow you to construct new XML documents or fragments. The input data for such XML construction can come from relational columns, from XML columns, or both. This topic is covered in Chapter 10, Producing XML from Relational Data. • Functions to embed XPath and XQuery in SQL statements. These functions are XMLQUERY, XMLTABLE, and the XMLEXISTS predicate. All of these SQL/XML functions are supported in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In this chapter we focus on the following: • XMLQUERY—A scalar function that is typically used in the SELECT clause of an SQL query to extract XML fragments or values from an XML document. • XMLTABLE—A table function that is used in the FROM clause of an SQL statement. It reads one or multiple values from an XML document and returns them as a set of rows. • XMLEXISTS—A predicate that is commonly used in the WHERE clause of an SQL statement to express predicates over XML data. • XMLCAST—A function that converts individual XML values to SQL data types. Now, let’s turn to examples to see how these functions work.

7.2

Retrieving XML Documents or Document Fragments with XMLQUERY

161

7.2 RETRIEVING XML DOCUMENTS OR DOCUMENT FRAGMENTS WITH XMLQUERY The simplest way of retrieving XML data with SQL is to include an XML column name in the SELECT list of an SQL query. For example, the SQL statement in Figure 7.1 returns a single column of type XML (info) and two rows, one row for each of our two sample documents in the customer table. Below the SQL statement in Figure 7.1 you see a corresponding XQuery that returns the same result. --SQL: SELECT info FROM customer; --XQuery: xquery db2-fn:xmlcolumn('CUSTOMER.INFO');

Figure 7.1

Retrieve all documents from the table

You can extend the SQL query in Figure 7.1 with other features of the SQL language, such as a WHERE clause to select only speciﬁc rows (documents) from the table. This is shown in Figure 7.2, together with an equivalent XQuery for comparison. --SQL: SELECT info FROM customer WHERE id = 1003; --XQuery: xquery db2-fn:sqlquery('SELECT info FROM customer WHERE id = 1003');

Figure 7.2

Retrieve selected documents from the table

In many situations it is desirable not to retrieve full documents from the database, but just speciﬁc XML elements, attributes, or fragments that are of interest. For example, if you only need to retrieve the customer names, you can use the XMLQUERY function in the SELECT clause to extract just that element (see Figure 7.3). The argument of the XMLQUERY function can be any XQuery or XPath expression. This expression needs to know which column to operate on, because a table could have multiple XML columns. The solution is to preﬁx the XPath with $INFO, a reference to the XML column in our sample table. This reference has to be in uppercase and must start with the $ sign (see section 7.2.1 for details). The SQL/XML statement in Figure 7.3 uses SQL as the top-level language and has an embedded XPath expression. Below it you see a corresponding XQuery that executes the same XPath expression without the use of any SQL. The query result and performance is the same. In particular, note that the return type of the XMLQUERY function is always XML. We will later discuss cases where SQL/XML can have advantages over XQuery and vice versa.

162

Chapter 7

Querying XML Data with SQL/XML

--SQL/XML: SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer --XQuery: xquery db2-fn:xmlcolumn('CUSTOMER.INFO')/customerinfo/name; --Output: Robert Shoemaker Matt Foreman 2 record(s) selected.

Figure 7.3

Extracting one element from each document

The XMLQUERY function in Figure 7.3 is a scalar function, which means that it takes one value as input and produces one value as output. The XMLQUERY function is applied to one row at a time and so its input value is always the XML document of the current row. The XMLQUERY function typically never processes XML documents from multiple rows at the same time. Its output value is the result of the XPath expression applied to the current document. This result is always a sequence of zero, one, or more items. Such a sequence represents a single value (instance) of the XQuery Data Model.

7.2.1

Referencing XML Columns in SQL/XML Functions

Figure 7.3 shows only one of three ways in which the XML column can be referenced inside the XMLQUERY function. Here are all three ways in more detail: • Direct reference of the XML column name as $INFO. This $INFO is an XQuery variable that is implicitly bound to an XML column of the same name. This is only supported in DB2 for Linux, UNIX, and Windows version 9.5 and higher. It only works if the XML column name is unique across all tables that are referenced in the FROM clause. For brevity we will use this notation in most of the examples in this chapter. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer;

• Explicit assignment of the XML column name to an alias of your choice, which is then used as the context at the beginning of the XPath expression. This assignment is done in the passing clause of the XMLQUERY function. It also allows you to qualify the column name with its table name (passing customer.info AS "i") to avoid ambiguity. The variable name $i has to be unique within each SQL/XML function, not across all functions. You will later see that this passing clause also allows you to pass parameter markers or expressions into the embedded XQuery. This is supported since version 9 of DB2 for z/OS and DB2 for Linux, UNIX, and Windows.

7.2

Retrieving XML Documents or Document Fragments with XMLQUERY

163

SELECT XMLQUERY('$i/customerinfo/name' passing info as "i") FROM customer; -- query with two tables, both have an XML column "info": SELECT XMLQUERY('$i/customerinfo/name' passing c1.info as "i"), XMLQUERY('$i/customerinfo/name' passing c2.info as "i") FROM customer c1, customer2 c2;

• No XQuery variable at the beginning of the XPath expression. Instead, the XML column name is identiﬁed in the passing clause without assignment to a variable. This is only supported in DB2 for z/OS. SELECT XMLQUERY('/customerinfo/name' passing info) FROM customer;

7.2.2

Retrieving Element Values Without XML Tags

There are several ways in which you can return the customer names without the element tags around them. One option is to use /text() in the XPath expression to only return the text node of the name element, as in Figure 7.4 (a). The column in the query result set is still of type XML. Alternatively, you can wrap the function XMLCAST() around the XMLQUERY function to convert the XML result to a non-XML type, as in Figure 7.4 (b). XMLCAST() automatically removes the tags from the returned elements. The output is the same as from Figure 7.4 (a), except that the return type is VARCHAR(25) instead of XML. --(a) SQL/XML: SELECT XMLQUERY('$INFO/customerinfo/name/text()') FROM customer; --(b) SQL/XML: SELECT XMLCAST( XMLQUERY('$INFO/customerinfo/name') AS VARCHAR(25)) FROM customer; --Output: Robert Shoemaker Matt Foreman 2 record(s) selected.

Figure 7.4

Returning element values without tags

A common requirement is to retrieve multiple values from a document, such as the customers’ street and city, and to return them in separate columns of the same result row. Separate columns can be produced by using multiple XMLQUERY functions in the SELECT clause (see Figure 7.5).

164

Chapter 7

Querying XML Data with SQL/XML

The same can be achieved with the XMLTABLE function, which is discussed later. Figure 7.5 also shows that you can return a mix of relational columns and XML values. SELECT id, XMLQUERY('$INFO/customerinfo/addr/street/text()'), XMLQUERY('$INFO/customerinfo/addr/city/text()') FROM customer;

1003 1004

845 Kean Street 1596 Baseline

Aurora Toronto

2 record(s) selected.

Figure 7.5

7.2.3

Returning multiple element values in separate columns

Retrieving Repeating Elements with XMLQUERY

The SQL/XML query in Figure 7.6 uses the path expression /customerinfo/phone, which you know returns multiple elements from each of the two input documents. This SELECT statement produces one result row for each of the two input rows. Each result row contains the sequence of phone numbers from the corresponding input document. Each of these two sequences is returned as a string, which the consuming application then needs to break down. However, such a sequence of two or more phone elements is not a well-formed XML document, because a single common root element is missing. Hence, if your application uses an XML parser to process this non-well-formed query result, it will fail with an error. SELECT id, XMLQUERY('$INFO/customerinfo/phone') FROM customer;

1003

1004

905-555-7258416-555-2937905555-8743 905-555-4789416-555-3376

2 record(s) selected.

Figure 7.6

Returning a sequence of elements from each document

Figure 7.7 shows the same query with /text(), and you see that the result values in each sequence are simply concatenated.

7.3

Retrieving XML Values in Relational Format with XMLTABLE

165

SELECT id, XMLQUERY('$INFO/customerinfo/phone/text()') FROM customer;

1003 1004

905-555-7258416-555-2937905-555-8743 905-555-4789416-555-3376

2 record(s) selected.

Figure 7.7

Returning a sequence of text nodes from each document

The conclusion is that the XMLQUERY function is typically not very useful to return repeating elements. As a solution, use the XMLTABLE function, which is explained in the next section.

7.3

RETRIEVING XML VALUES IN RELATIONAL FORMAT WITH XMLTABLE

The XMLTABLE function is very versatile and one of the most powerful SQL/XML functions. Let’s start with some simple examples of the XMLTABLE function and then get back to returning the repeating phone elements in a more suitable format.

7.3.1

Generating Rows and Columns from XML Data

The query in Figure 7.8 uses the XMLTABLE function in the FROM clause. The XMLTABLE function references the info column and is therefore implicitly joined with the table customer. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custID INTEGER PATH custname VARCHAR(20) PATH street VARCHAR(20) PATH city VARCHAR(16) PATH

CUSTID -----1003 1004

CUSTNAME -------------------Robert Shoemaker Matt Foreman

'@Cid', 'name', 'addr/street', 'addr/city') AS T;

STREET -------------------845 Kean Street 1596 Baseline

CITY -----------Aurora Toronto

2 record(s) selected.

Figure 7.8

Using XMLTABLE to return XML values in relational columns

In DB2 for z/OS the XMLTABLE function must contain a PASSING clause to deﬁne the reference to the XML column, like this: XMLTABLE('$i/customerinfo' PASSING info AS "i"

166

Chapter 7

Querying XML Data with SQL/XML

The XMLTABLE function contains one row-generating XQuery expression and, in the COLUMNS clause, multiple column-generating expressions. The row-generating expression is the XPath $INFO/customerinfo and is applied to each XML document in the XML column and produces one or multiple rows per document. The row-generating expression produces one customerinfo element (fragment) per document. The output of the XMLTABLE function contains one row for each of these customerinfo elements. The number of elements produced by the row-generating XQuery expression determines the number of rows produced by the XMLTABLE function. The COLUMNS clause transforms XML data into relational format. Each of the entries in this clause deﬁnes a column with a column name and an SQL data type. In Figure 7.8, the returned rows have four columns named custID, custname, street, and city. The values for each column are extracted from the customerinfo fragments that are produced by the row-generating expression, and then cast to the SQL data types. For example, the path addr/city is applied to each customerinfo element to obtain the value for the column city. The row-generating expression provides the context for the column-generating expressions. This means that the column-generating expressions are not absolute paths, but relative to the row-generating expression. You can typically append the column-generating expressions to the row-generating expression to get an intuitive idea of what a given XMLTABLE function returns in its columns. The result set of the XMLTABLE query can be treated like any SQL table. You can query and manipulate it much like you use regular row sets or views. The column deﬁnitions in the COLUMNS clause can use any SQL data type, such as INTEGER, DECIMAL, CHAR, DATE, and so on. If an extracted XML value cannot be cast to the assigned SQL type, the query fails with an error message. DB2 for Linux, UNIX, and Windows also allows you to use the db2-fn:xmlcolumn() or db2-fn:sqlquery() functions in the row-generating expression of the XMLTABLE function (see Figure 7.9). In this case you omit the table name customer from the FROM clause. The query result is the same as in Figure 7.8. (This is not available in DB2 for z/OS.) SELECT T.* FROM XMLTABLE('db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo' COLUMNS custID INTEGER PATH '@Cid', custname VARCHAR(20) PATH 'name', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city') AS T;

Figure 7.9

Alternative syntax in DB2 for Linux, UNIX, and Windows

7.3

Retrieving XML Values in Relational Format with XMLTABLE

7.3.2

167

Dealing with Missing Elements

XML data can contain optional elements that are not present in all documents. For example, in our sample data you can see that Robert Shoemaker does not have an assistant element. What happens if the optional element assistant is referenced in the row-generating or a columngenerating expression, respectively? Let’s look at these two cases separately. In Figure 7.10 the optional assistant element is referenced in the row-generating expression of the XMLTABLE function. The query seeks to return the name and phone number of all assistants in our customer data. Since the XMLTABLE function returns exactly one row for each node that is produced by the row-generating expression, it does not return any rows for the documents that do not contain an assistant element. Therefore, the query in Figure 7.10 returns the name and phone number of Matt Foreman’s assistant, but no information from Robert Shoemaker’s XML document where no assistant element is present. We will revisit this situation at the end of section 7.3. in a more complex scenario. SELECT T.* FROM customer, XMLTABLE('$i/customerinfo/assistant' PASSING info AS "i" COLUMNS a_name VARCHAR(20) PATH 'name', a_phone VARCHAR(20) PATH 'phone') AS T;

A_NAME A_PHONE -------------------- -------------------Gopher Runner 416-555-3426 1 record(s) selected.

Figure 7.10

Optional element in the row-generating expression

In Figure 7.11 the optional assistant element is referenced in a column-generating expression of the XMLTABLE function. This query intends to return the customer name and the assistant name from each document. For each document where the assistant element does not exist, the column expression assistant/name produces an empty sequence, which is automatically converted to a NULL value.

168

Chapter 7

Querying XML Data with SQL/XML

SELECT T.* FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS c_name VARCHAR(20) PATH 'name', a_name VARCHAR(20) PATH 'assistant/name') AS T; C_NAME -------------------Robert Shoemaker Matt Foreman

A_NAME -------------------NULL Gopher Runner

2 record(s) selected.

Figure 7.11

Optional element in a column-generating expression

If you prefer to generate a default value for missing elements instead of NULL values, use the default clause to deﬁne a default value other than NULL. This is done in Figure 7.12. SELECT T.* FROM customer, XMLTABLE('$i/customerinfo' PASSING info AS "i" COLUMNS c_name VARCHAR(20) PATH 'name', a_name VARCHAR(20) default 'none' PATH 'assistant/name') AS T;

C_NAME -------------------Robert Shoemaker Matt Foreman

A_NAME -------------------none Gopher Runner

2 record(s) selected.

Figure 7.12

7.3.3

Deﬁning a default value for missing elements

Avoiding Type Errors

Be aware that every expression in the COLUMNS clause must return a value that can be cast to the speciﬁed data type. Otherwise the XMLTABLE execution fails. Consider the following cases: • Incompatible data types. For example, the query in Figure 7.8 fails when it encounters an XML document where the Cid attribute has a non-numeric value, which cannot be cast to INTEGER. • String length. If the XMLTABLE function deﬁnes a column of type CHAR(n) or VARCHAR(n), and the column-generating expression produces a string value that’s longer than n, then either one of two things happen:

7.3

Retrieving XML Values in Relational Format with XMLTABLE

169

•

The value is truncated to n bytes, without warning or error. This truncation is mandated by the latest SQL/XML standard and implemented in DB2 for z/OS.

•

The query fails with error SQL16061N. This behavior was allowed by a previous version of the SQL/XML standard and is still effective in DB2 for Linux, UNIX, and Windows.

The following examples show how such cases can be handled. In Figure 7.13, the deﬁnition of the custID column uses the XQuery if-then-else and castable expressions to check whether the Cid attribute can indeed be cast to INTEGER, and returns -1 if not. The value for the column custname is produced by the substring function so that only the ﬁrst 20 characters of the actual name are used. The column-generating expression for the city uses if-then-else and the string-length function to test the length of the city value and returns an error ﬂag if it is too long. Such techniques can be useful if strict data types are not enforced with XML Schema validation. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custID INTEGER PATH '(if (@Cid castable as xs:integer) then @Cid else -1)', custname VARCHAR(20) PATH 'name/substring(.,1,20)', street VARCHAR(20) PATH 'addr/street', city VARCHAR(16) PATH 'addr/city/(if (string-length(.) xquery for $i in (1,5,3) return {$i}; 1 5 3 3 record(s) selected.

db2 => xquery let $j := (1,5,3) return {$j}; 1 5 3 1 record(s) selected.

Figure 8.5

8.2.3

The difference between for and let

Understanding the where and order by Clauses

Figure 8.6 shows two more versions of the previous query with the for clause. The ﬁrst version has an additional where clause to restrict the result set to values greater than 2. The second query in Figure 8.6 adds an order by clause to return the result items in ascending order. Both the where and the order by clause use the variable $i that is introduced in the for clause. db2 => xquery for $i in (1,5,3) where $i > 2 return {$i}; 5 3 2 record(s) selected.

db2 => xquery for $i in (1,5,3) where $i > 2 order by $i return {$i}; 3 5 2 record(s) selected.

Figure 8.6

The effect of the where and order by clauses

8.2

Processing XML Data with FLWOR Expressions

8.2.4

195

FLWOR Expressions with Multiple for and let Clauses

An XQuery FLWOR expression can contain multiple for or let clauses. Figure 8.7 shows two nested for clauses that act similarly to nested loops in a programming language. The outer for clause iterates over the sequence (1,5,3) and the inner for iterates over the sequence ("a","b"). For each iteration of the outer for clause, the inner for clause iterates over all the items in its sequence. This generates the full Cartesian product between the input sequences. An analogy in the SQL world is a SELECT statement with two tables in the FROM clause and no join predicate. db2 => xquery for $i in (1,5,3) for $j in ("a","b") return {$i,$j}; 1 1 5 5 3 3

a b a b a b

6 record(s) selected.

Figure 8.7

Two nested for clauses produce a Cartesian product

The XQuery in Figure 8.8 also contains two nested for clauses. Their input sequences contain a common item, the atomic value 5, which is identiﬁed by a join predicate in the where clause. This is analogous to an SQL join. The difference is that SQL operates on sets of relational rows while XQuery operates on sequences of items. In these examples the items are just atomic values to allow for an easy introduction of the language. In the following sections we return to the customer sample data where the items are XML nodes, including elements, attributes, and full documents. db2 => xquery for $i in (1,5,3) for $j in (7,5) where $i = $j return {$i,$j}; 5 5 1 record(s) selected.

Figure 8.8

Two nested for clauses with a join predicate

Since the XQuery let clause does not iterate, it does not contribute to the generation of a Cartesian product of sequences. For example, the query in Figure 8.9 contains a for clause and two let clauses. Each iteration of the for clause leads to one item in the query result. The return

196

Chapter 8

Querying XML Data with XQuery

clause constructs result elements. The value of each result element is the sequence of the values of the variables $i, $j, and $k. db2 => xquery for $i in (1,5,3) let $j := ("a","b") let $k := $i *2 return {$i,$j,$k}; 1 a b 2 5 a b 10 3 a b 6 3 record(s) selected.

Figure 8.9

A FLWOR expression with for and let clauses

All variable names in XQuery have to be preceded by the dollar sign ($). The XQuery standard allows one or multiple spaces between the dollar sign and the beginning of the actual variables, so that both $var and $ var are valid variable names. However, for readability and to avoid confusion it’s best to not use spaces. The same applies to hyphens. Note that $a-b and $ a-b are valid variable names that happen to contain a hyphen. But, a – b is interpreted as an arithmetic operation because there are spaces between the hyphen and the characters a and b. LEARNING XQUERY When it comes to learning a new language there is no better way than learning by doing. We suggest that you download and install the latest version of DB2 Express-C, which is free, so that you can run the XQuery examples in this section hands-on. The examples show that you can explore the behavior of XQuery even without any tables in the database.We encourage you to extend and modify these examples and to try other combinations of for, let, where, order by, and return clauses.You may ﬁnd that XQuery becomes intuitive quite quickly.

8.3 COMPARING FLWOR EXPRESSIONS, XPATH EXPRESSIONS, AND SQL/XML This section compares and examines XPath, FLWOR, and SQL/XML queries in several ways. We look at traversing XML documents to extract speciﬁc elements, coding and placing XML predicates, result set cardinalities, and the integration of FLWOR expressions in SQL statements. We discuss several examples of how “the same” query can be written in several different ways. By “the same” we mean that the same result is returned from the sample data. The examples are not exhaustive; that is, they do not show all possible ways in which a certain query can be written.

8.3

Comparing FLWOR Expressions, XPath Expressions, and SQL/XML

8.3.1

197

Traversing XML Documents

Figure 8.10 illustrates ﬁve different ways to retrieve the customer name elements. There is no signiﬁcant performance difference between them, but for readability and maintainability it is a good idea to use as simple a syntax as possible to express a query. Hence, options (4) and (5) are good choices in Figure 8.10. The ﬁrst FLWOR expression in Figure 8.10 iterates over the customerinfo elements and binds them to the variable $c, one at a time. The return clause then uses $c as the context to navigate to the name element. The second FLWOR expression iterates directly over the name elements and binds them to the variable $n, one at a time. The return clause then only emits the values of $n. The navigation to the name element has shifted from the return clause to the for clause. The third FLWOR expression iterates over the customer documents; that is, over the document nodes that are at the top of each document tree. The return clause then navigates from these document nodes, represented by $i, to the customerinfo/name elements. You will see shortly that the decision of what to iterate over in the for clause makes a difference as soon as you add predicates to the query. The fourth expression is a simple XPath that returns the sequence of all name elements. The ﬁfth query is an SQL/XML statement that uses the XMLQUERY function to extract the name elements. --(1) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $c/name; --(2) xquery for $n in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name return $n; --(3) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") return $i/customerinfo/name; --(4) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name; --(5) SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer;

Figure 8.10

Five different ways to retrieve the customer name elements

198

8.3.2

Chapter 8

Querying XML Data with XQuery

Using XML Predicates

Figure 8.11 extends the sample queries of Figure 8.10 by adding a predicate to only return the name of the customer whose Cid attribute has the value 1003. All ﬁve queries return the same result. Again, the ﬁrst two FLWOR expressions in Figure 8.11 differ in whether the step to the name element happens in the for or the return clause. This difference affects the where clause, which uses the variable from the for clause. If the for clause assigns the variable $i to customerinfo elements, then the where clause can simply use the XPath $i/@Cid to access the Cid attribute. This is because Cid is a child of customerinfo. The second FLWOR expression, however, binds the variable $i to name elements. This forces the where clause to use a parent step to navigate from $i to the Cid attribute. This is an extra navigation step, which makes the second FLWOR expression slightly more expensive. The third FLWOR expression shows that ﬁltering predicates can not only be located in the where clause but also in the XPath expression of the for clause. In fact, the entire query can again be expressed as a single XPath, which is the fourth query. And ﬁnally, the ﬁfth query is an SQL/XML statement, which uses the XMLEXISTS predicate to properly include the ﬁltering condition. --(1) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return $i/name; --(2) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/name where $i/../@Cid = 1003 return $i; --(3) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid = 1003] return $c/name; --(4) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid = 1003]/name; --(5) SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]');

Figure 8.11

Five different ways to apply a predicate

8.3

Comparing FLWOR Expressions, XPath Expressions, and SQL/XML

199

The next example (Figure 8.12) shows four different queries that return phone elements whose attribute type has the value cell. The ﬁrst FLWOR expression uses two nested for clauses. The outer for clause iterates over the customerinfo elements and assigns them to the variable $c. The inner for clause uses the path $c/phone to iterate over the phone elements of the current customer. For each such phone element, the where clause checks whether the type attribute has the value cell. If so, the return clause returns that phone element. The second FLWOR expression shows that the same query result can be achieved without nested for clauses. It uses only a single for clause to iterate directly over the phone elements. The predicate could be applied in the where clause, but this query adds the predicate to the return clause. You will see later that predicates in the return clause can lead to different query results if element construction is involved. The third query is a simple XPath without any FLWOR clauses. The last query is an SQL/XML statement that uses the XMLTABLE function to produce one result row per cell phone, just like the other queries. --(1) xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo for $p in $c/phone where $p/@type = "cell" return $p; --(2) xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone return $i[@type = "cell"]; --(3) xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone[ @type="cell"]; --(4) SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone[@type="cell"]' COLUMNS phone XML PATH '.') as T;

Figure 8.12

Four different queries that return the same phone elements NOTE An advantage of SQL/XML queries is that they can contain parameter markers and host variables in their predicates, as discussed in section 7.6.This is not possible when you use XQuery without SQL.

200

Chapter 8

8.3.3

Querying XML Data with XQuery

Result Set Cardinalities in XQuery and SQL/XML

Let’s look at result set cardinalities using the three queries in Figure 8.13 as examples. Each of the three queries returns all ﬁve customer phone numbers, three from one of our sample documents and two from the other. The ﬁrst query is an XPath expression that produces a sequence of ﬁve text nodes, and each item in that sequence is returned as a separate result row. The second query uses the XMLQUERY function and returns the same ﬁve phone numbers in two result rows. The reason is that XMLQUERY is a scalar function in an SQL statement, and scalar functions produce one value for each input row. In our example there are two input rows (documents) and for each of them XMLQUERY produces one sequence of phone numbers. You can turn the items in these sequences into separate rows only if you use a table function (as opposed to a scalar function), which generates a set of rows. This is what the XMLTABLE function does. xquery db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo/phone/text(); 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 5 record(s) selected. SELECT XMLQUERY('$INFO/customerinfo/phone/text()') FROM customer; 905-555-7258416-555-2937905-555-8743 905-555-4789416-555-3376 2 record(s) selected. SELECT T.phone FROM customer, XMLTABLE('$INFO/customerinfo/phone' COLUMNS phone VARCHAR(20) PATH '.') as T; 905-555-7258 416-555-2937 905-555-8743 905-555-4789 416-555-3376 5 record(s) selected.

Figure 8.13

Three different queries that return the same ﬁve phone numbers

8.3

Comparing FLWOR Expressions, XPath Expressions, and SQL/XML

201

A key difference between XPath or XQuery expressions on the one hand and SQL/XML statements on the other is that XPath and XQuery expressions always return a single column of type XML. XQuery cannot return multiple columns in a result set or data types other than XML. SQL/XML statements can read values from XML documents and return them as relational result sets that have multiple columns and traditional SQL data types (see section 7.3, Retrieving XML Values in Relational Format with XMLTABLE). NOTE

The examples in this section have shown that many simple queries do not require XQuery FLWOR expressions but can be written much simpler as plain XPath expressions. Indeed, many applications are well-served by combining XPath and SQL and do not necessarily require the extra power of XQuery. However, XQuery has very valuable features that XPath alone does not provide. For example, construction of XML data and joins across multiple XML documents is not possible with XPath alone. Section 8.4 and Chapter 9, Querying XML Data: Advanced Queries and Troubleshooting, provide examples.

8.3.4

Using FLWOR Expressions in SQL/XML

Note that SQL/XML and XQuery are not mutually exclusive. Chapter 7 focused on examples that combine XPath and SQL, which is supported both in DB2 for Linux, UNIX, and Windows and DB2 for z/OS. In DB2 for Linux, UNIX, and Windows, the same SQL/XML functions can also take more complex XQuery expressions as input, such as FLWOR expressions. Figure 8.14 shows an example. It returns the name of the customer whose Cid attribute has the value 1003. Remember that the XMLEXISTS predicate is truly an existence check. If the XQuery or XPath expression in the XMLEXISTS returns an empty sequence, then XMLEXISTS evaluates to FALSE and the current row is eliminated. SELECT XMLQUERY('for $i in $INFO/customerinfo/name return $i/text()') FROM customer WHERE XMLEXISTS('let $i := $INFO/customerinfo where $i/@Cid = 1003 return $i');

Figure 8.14

Return the name of the customer whose Cid is 1003

If the same result can be achieved with simple XPath then for simplicity it is recommended to avoid FLWOR expressions in SQL/XML functions. For example, the query in Figure 8.15 is simpler than the query in Figure 8.14 and returns an identical result set.

202

Chapter 8

Querying XML Data with XQuery

SELECT XMLQUERY('$INFO/customerinfo/name/text()') FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = 1003]');

Figure 8.15

A simpler query to return the same result as Figure 8.14

Figure 8.16 provides an example of how you should not integrate XQuery in SQL. The problem with this query is that the predicate on the Cid attribute is included in the FLWOR expression in the SELECT clause of the SQL statement. In this location, the predicate does not eliminate any rows from the customer table. To work as expected, the predicate needs to be in the WHERE clause of the SQL statement, using XMLEXISTS. This issue has been discussed in section 7.5, Common Mistakes with SQL/XML Predicates. SELECT XMLQUERY('for $i in $INFO/customerinfo/name where $i/@Cid = 1003 return $i/text()') FROM customer;

Figure 8.16

8.4

Do not place row-ﬁltering predicates in the SELECT clause!

CONSTRUCTING XML DATA

Constructing XML data in XQuery is easy. You can simply type regular XML tags as part of your XQuery. This method is called direct XML construction. For example, an XML element or document just by itself is already a valid XQuery expression. Figure 8.17 is a simple example where the XQuery consists of nothing but a direct element constructor. The name of the constructed element is title and its value is the literal string Hello. The result of the XQuery is the constructed element itself. This cannot be done with XPath alone. db2 => xquery Hello;

Hello 1 record(s) selected. db2 =>

Figure 8.17

8.4.1

Constructing the element title with the value "Hello"

Constructing Elements with Computed Values

It is often desirable to generate XML elements whose values are dynamically computed during query execution. Constructed elements can have computed values if they contain XQuery variables or other dynamic expressions. Such expressions must be enclosed in curly brackets and are

8.4

Constructing XML Data

203

often used in the return clause of a FLWOR expression. For example, the query in Figure 8.18 retrieves the name and city values, and returns this information in a newly constructed XML document. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo where $i/@Cid = 1003 return {$i/name/text()} {$i/addr/city/text()} ;

Robert ShoemakerAuro ra 1 record(s) selected.

Figure 8.18

Construction of an XML document with dynamic values

Several things are noteworthy about Figure 8.18. The returned XML data uses XML element names that do not exist in the XML documents that are stored in the table. In other words, the query reads one XML format but returns another. This performs a transformation of the data. Although XQuery is not always a substitute for XSLT (Extensible Stylesheet Language Transformations), it can carry out many transformations easily and efﬁciently. In contrast to Figure 8.17, the values of the constructed elements in Figure 8.18 are not provided as literal strings but computed by XPath expressions. These XPath expressions must be enclosed in curly brackets to indicate that they are to be evaluated and not used as literal string values. If you forget the curly brackets, the query result contains the actual path expressions, which is not useful: $i/name/text() 10 return $i; SQL16061N The value "Unshipped" cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "xs:double". Error QName=err:FORG0001. SQLSTATE=10608

Figure 8.33

Cannot compare xs:string to xs:double!

What if you have some documents where the Status attribute contains numeric values and some documents where it contains alphanumeric string values? In that case you might still want to use the query in Figure 8.33 to ﬁnd all orders whose Status has a numeric value greater than 10. You can use the XQuery expression castable together with the if-then-else expression to apply the numeric predicate only if the Status attribute of a given document is a valid integer number. For all other documents the value false is produced to exclude them from the result set. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/( if (@Status castable as xs:integer) then (@Status > 10) else false ) return $i;

Figure 8.34

XQuery with the expression castable

The SQL/XML statement in Figure 8.35 intends to read all purchase orders where the ﬁrst item in the order is less expensive than the second item. Clearly, the purchase order in Figure 8.28 should be in the result set because the price of its ﬁrst item is 9.99 while the price of the second item is 49.99. But, opposite to what you might expect, the predicate in Figure 8.35 does not select the purchase order in Figure 8.28. Let’s examine why that is. First of all, note that the predicate [item[1]/price < item[2]/price] does not include any literal value that could provide an indication of the data type of the comparison. Hence, according to the XQuery standard, DB2

212

Chapter 8

Querying XML Data with XQuery

simply performs a string comparison, and the string “9.99” is greater than the string “49.99”. In summary, the query in Figure 8.35 runs, but does not work the way you want. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[ item[1]/price < item[2]/price]');

Figure 8.35

String comparison between two elements in a document

The solution is to cast either the left side of the predicate, or the right side, or both to xs:double, as shown in Figure 8.36. If at least one of the two operands is cast to a speciﬁc data type, then this determines the data type of the comparison operation and DB2 tries to cast the other operand to the same data type. Consequently, the query in Figure 8.36 performs a numeric comparison of the two price elements and therefore includes the purchase order in Figure 8.28 in the result set, as expected. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[ item[1]/xs:double(price) < item[2]/price]');

Figure 8.36

Numeric comparison between two elements in a document

Note that the casting functions, which are actually called type constructors, can only cast at most one item at a time. The following expression would fail because one purchase order contains multiple item elements, and a sequence of two or more items cannot be cast to a double value. xs:double($i/PurchaseOrder/item/price)

To cast all items in the sequence, use the type constructor at the end of the XPath expressions, such as the following: $i/PurchaseOrder/item/xs:double(price) $i/PurchaseOrder/item/price/xs:double(.)

8.6

ARITHMETIC EXPRESSIONS

XQuery provides arithmetic operators for addition (+), subtraction (–), multiplication (*), division (div), integer division (idiv), and modulus (mod). A subtraction operator must be preceded by whitespace if it could otherwise be interpreted as part of a variable or tag name. For example, price-discount will be interpreted as a single name, but price -discount and price - discount will be interpreted as arithmetic expressions between two separate items. Arithmetic operators can be used with elements, attributes, or a mix of both.

8.6

Arithmetic Expressions

213

Figure 8.37 provides two examples, one in SQL/XML and one in XQuery notation. Both multiply the quantity and the price of each item in the purchase order that has PoNum=5000. Note that the for clause of the XQuery iterates over item elements and computes the value of each item separately. SELECT T.id, T.itemvalue FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS id VARCHAR(15) PATH 'partid', value DECIMAL(9,2) PATH 'quantity * price') as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum= 5000]'); ID ITEMVALUE --------------- ----------100-100-01 29.97 100-103-01 249.95 2 record(s) selected.

xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder[@PoNum= 5000]/item let $q := $i/quantity let $p := $i/price return {$q * $p}; 29.97 249.95 2 record(s) selected.

Figure 8.37

SQL/XML and XQuery with arithmetic expression

The ﬁrst step in evaluating an arithmetic expression is to evaluate its operands. If one of the operands is an empty sequence, the result of the arithmetic expression is also an empty sequence. If one of the operands is a sequence of more than one item, a type error is raised. This happens in Figure 8.38. This query iterates over purchase orders, not over items. Since a purchase order typically has multiple items, the let clauses bind a sequence of multiple quantity elements to $q and a sequence of multiple price elements to $p. This leads to an error in the multiplication in the return clause. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder[@PoNum= 5000] let $q := $i/item/quantity let $p := $i/item/price return {$q * $p}; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "item()" is expected.

Figure 8.38

Operands in an arithmetic expression must be zero or one item

214

Chapter 8

Querying XML Data with XQuery

An error is also raised if one of the operands cannot be cast to xs:double. For example, if a quantity element contains the string value “ﬁve” then the arithmetic expression fails at runtime. XQuery provides a division operator (div) and an integer division operator (idiv). The latter simply casts its result to type xs:integer. For example, the expression 5 div 2 returns the value 2.5, whereas the expression 5 idiv 2 produces the value 2. The idiv operator always rounds down to next integer value, which is forced by the cast to xs:integer. For testing purposes you can run XQuery expressions with cast and arithmetic operations in the DB2 Command Line Processor, such as in Figure 8.39. xquery xs:integer(3.9); 3 1 record(s) selected.

xquery

10 + 100 idiv 9;

21 1 record(s) selected.

Figure 8.39

8.7

Testing XQuery expressions in the CLP

XQUERY FUNCTIONS

The XQuery language provides a large number of built-in functions. These include aggregate functions such as count and sum, string functions such as contains and starts-with, functions to manipulate date and timestamp values, numeric functions, and others. A complete discussion of all functions is beyond the scope of this book. Appendix C, Further References, contains pointers to the complete reference of all supported XPath and XQuery functions in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. In this section we list only a subset of the available XQuery functions to highlight those that are most frequently used and have been found useful in DB2 pureXML production applications. We provide some examples and encourage you to try more functions and queries hands-on with the DB2 sample database. In general, all functions can be applied to elements as well as to attributes. We categorize the discussion of XQuery functions as follows: • String functions (section 8.7.1) • Number and aggregation functions (section 8.7.2) • Sequence functions (section 8.7.3)

8.7

XQuery Functions

215

• Node and namespace functions (section 8.7.4) • Date and time functions (section 8.7.5) • Boolean functions (section 8.7.6) All XQuery functions belong to a default namespace that is always implicitly bound to the namespace preﬁx fn. Since it is a default namespace, the preﬁx can be omitted. For example, concat and fn:concat refer to the same concatenation function.

8.7.1

String Functions

Some of the most commonly used string functions are listed in Table 8.1. Table 8.1

Commonly Used String Functions

String Functions

Description

concat

The function fn:concat returns a string that is the concatenation of two or more atomic values.

string-join

The function fn:string-join takes as input a sequence of string values and a separator character. It returns a single string in which the input strings are concatenated but separated by the separator character.

contains

The function fn:contains returns true if a string contains a given substring.

matches

The function fn:matches returns true if a string matches a given regular expression.

starts-with

The function fn:starts-with returns true if a string begins with a given substring.

ends-with

The function fn:ends-with returns true if a string ends with a given substring.

lower-case

The function fn:lower-case converts a string to lowercase.

upper-case

The function fn:upper-case converts a string to uppercase.

translate

The fn:translate function replaces selected characters in a string with replacement characters.

string

The function fn:string returns the string representation of a value.

string-length

The function fn:string-length returns the length of a string.

substring

The function fn:substring returns a substring of a string, based on a start position and a length. It is similar to the substr function in SQL.

substring-after

The function fn:substring-after returns the tail of the input string after the ﬁrst occurrence of a given search string.

(continues)

216

Chapter 8

Table 8.1

Querying XML Data with XQuery

Commonly Used String Functions (Continued)

String Functions

Description

substring-before

The function fn:substring-before returns the beginning of the input string up to (but excluding) the ﬁrst occurrence of a given search string.

tokenize

The function fn:tokenize breaks a string into a sequence of substrings.

normalize-space

The function fn:normalize-space strips leading and trailing whitespace characters from a string and replaces each internal sequence of whitespace characters with a single space character.

A simple example of the concat function is shown in Figure 8.40. Here, the concat function has four arguments. The ﬁrst and third arguments are literal string values, while the second and fourth parameters are expressions based on the variable $i that is bound in the for clause. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item where $i/../@PoNum=5000 return concat("Order ",$i/../@PoNum," – Item ",$i/partid); Order 5000 - Item 100-100-01 Order 5000 - Item 100-103-01 2 record(s) selected.

Figure 8.40

Concatenation of string literals and expressions

Figure 8.41 demonstrates three string functions. The query uses the concat function to concatenate the values of the attributes PoNum and Status into a single string. In the second column it utilizes the string-join function to produce a list of partid values that are separated by the semicolon. Note that the arguments of the concat functions are single values while the ﬁrst argument of the string-join function evaluates to a sequence of multiple elements. The contains function in the WHERE clause restricts the result set to purchase orders that have at least one item whose name contains the word “Super”. SELECT XMLQUERY('$PORDER/PurchaseOrder/concat(@PoNum,@Status)') AS id, XMLQUERY('string-join($PORDER/PurchaseOrder/item/partid,";")') AS items FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder/item[contains(name,"Super")]');

Figure 8.41

Query with three XQuery string functions

8.7

XQuery Functions

IDSTATUS -----------------5000Unshipped 5001Shipped 5004Shipped

217

ITEMS -----------------------------------------100-100-01;100-103-01 100-101-01;100-103-01;100-201-01 100-100-01;100-103-01

3 record(s) selected.

Figure 8.41

Query with three XQuery string functions (Continued)

XQuery functions can be nested. The query in Figure 8.42 returns the name of an item from purchase order 5000, if the item name contains a comma and contains the word Basic after the comma. The function substring-after is the ﬁrst argument of the contains function and produces the part of the name after the comma. Thus, the contains function is applied only to that second part of each item name. SELECT XMLQUERY('$PORDER/PurchaseOrder/item[ contains(substring-after(name,","), "Basic")]/name') FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');

Snow Shovel, Basic 22 inch 1 record(s) selected.

Figure 8.42

Query with nested XQuery string functions

You can use the function tokenize to split a string into multiple smaller strings. For example, the query in Figure 8.43 splits the values of the partid elements based on the occurrences of the “-” character. The function returns the substrings as a sequence. Instead of using a single character to split the input string, you can also tokenize a string based on the occurrences of a substring or regular expression. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ @PoNum=5000]/item/tokenize(partid,"-"); 100 100 01 100 103 01 6 record(s) selected.

Figure 8.43

Splitting a string into a sequence of separate items

218

Chapter 8

Querying XML Data with XQuery

Although the query in Figure 8.43 returns the tokenized substrings in separate rows, it can be more useful to return them in separate columns instead, which happens in Figure 8.44. The query in Figure 8.44 uses the XMLTABLE function to generate one row per order item. Each generated row has an INTEGER column called OrderNo and an XML column called partid. The INTEGER column contains the purchase order number (PoNum), and the XML column contains the sequence of substrings produced by the tokenize function. In the SELECT clause, this XML column is not returned as-is, but used as input to each of three XMLQUERY functions. They use positional predicates [1], [2], and [3], respectively, to obtain the ﬁrst, second, and third token of the sequence separately. SELECT T.orderno, XMLCAST(XMLQUERY('$PARTID[1]') as CHAR(3)) as id1, XMLCAST(XMLQUERY('$PARTID[2]') as CHAR(3)) as id2, XMLCAST(XMLQUERY('$PARTID[3]') as CHAR(3)) as id3 FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS OrderNo INTEGER PATH '../@PoNum', partid XML PATH 'tokenize(partid,"-")') as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');

ORDERNO ------------5000 5000

ID1 --100 100

ID2 --100 103

ID3 --01 01

2 record(s) selected.

Figure 8.44

Splitting a string into separate columns

We encourage you to try other string functions on your own. For example, use the translate function to change the delimiter in the partid values from 100-103-01 to 100/103/01. Or, use the starts-with function to ﬁnd all items whose name begins with the word “Snow”.

8.7.2

Number and Aggregation Functions

Let’s turn to numeric XQuery functions, some of which are shown in Table 8.2. Table 8.2

Commonly Used Number and Aggregation Functions

Numeric and Aggregation Functions

Description

sum

The function fn:sum returns the sum of the values in a sequence.

avg

The function fn:avg returns the average of the values in a sequence.

8.7

XQuery Functions

Table 8.2

219

Commonly Used Number and Aggregation Functions (Continued)

Numeric and Aggregation Functions

Description

max

The function fn:max returns the maximum of the values in a sequence.

min

The function fn:min returns the minimum of the values in a sequence.

abs

The function fn:abs returns the absolute value of a numeric value.

round

The function fn:round returns the integer that is closest to the given numeric value.

Figure 8.45 shows two XQuery expressions with number and string functions. The ﬁrst one returns the sum of item prices for each purchase order where the value of the Status attribute starts with “Ship”. For example, this includes orders where the status is Shipped or Shipping. A separate sum is computed for the items within each such purchase order. The second query computes the average item price across all orders that match the starts-with predicate. A single average value is computed for these orders, because the XPath expression that produces the sequence of purchase orders is the argument of the avg function. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ starts-with(@Status,"Ship")]/sum(item/price); 73.97 33.97 59.98 33.97 4 record(s) selected. xquery avg( db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder[ starts-with(@Status,"Ship")]/item/price ); 18.3536363636364 1 record(s) selected.

Figure 8.45

The XQuery aggregation functions sum and avg

The same two queries can be coded in SQL/XML notation, as shown in Figure 8.46. They produce the same results as their counterparts in Figure 8.45. Note that the second SELECT statement in Figure 8.46 uses the SQL function AVG, not the XQuery function avg.

220

Chapter 8

Querying XML Data with XQuery

SELECT XMLQUERY('$PORDER/PurchaseOrder/sum(item/price)') FROM purchaseorder WHERE XMLEXISTS ('$PORDER/PurchaseOrder[starts-with(@Status,"Ship")]');

SELECT AVG(T.itemprice) FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS itemprice DECIMAL(9,2) PATH 'price') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[starts-with(@Status,"Ship")]');

Figure 8.46

The XQuery functions sum and the SQL functions avg

In Figure 8.45 and Figure 8.46 you can replace the functions sum and avg with the function count to obtain the number of elements rather than the sum or average of their values. Try it out.

8.7.3

Sequence Functions

The count function is not a numeric function but a sequence function (see Table 8.3) because it counts the number of items in a sequence. Table 8.3

Commonly Used Sequence Functions

Date and Time Functions

Description

count

The function fn:count returns the number of items in a sequence.

data

The function fn:data returns the input sequence but replaces any nodes in the sequence with their values.

distinct-values

The function fn:distinct-values returns the distinct values in a sequence. It is similar to the SQL function distinct.

deep-equal

The function fn:deep-equal compares two documents or sequences and returns true if they meet the requirements for deep equality. Roughly speaking, two documents or sequences are deep equal if every aspect of their structure, values, and data type is equal.

empty

The function fn:empty returns true if the argument is an empty sequence.

exactly-one

The function fn:exactly-one returns its argument if the argument contains exactly one item.

zero-or-one

The function fn:zero-or-one returns its argument if the argument contains one item or is an empty sequence.

one-or-more

The function fn:one-or-more returns its argument if the argument is a sequence of one or more items.

8.7

XQuery Functions

Table 8.3

221

Commonly Used Sequence Functions (Continued)

Date and Time Functions

Description

last

The function fn:last takes no parameters but returns the number of items in the sequence that is currently being processed. It is usually used in a positional predicate to return the last item in a sequence.

position

The function fn:position returns the position of the context item in the sequence that is currently being processed.

Figure 8.47 shows three examples that use sequence functions. The goal is to ﬁnd all the different values that Status attributes in purchase orders can have. The ﬁrst XQuery in Figure 8.47 returns the value of the Status attribute from all purchase orders in the purchaseorder table. It uses the function data to obtain the attribute values instead of the attribute nodes. The second XQuery uses the distinct-values function to retrieve unique Status values only. The result shows that the sample data contains two different spellings of the value Unshipped, one with lowercase s and one with uppercase S. To address this, the third XQuery uses the string function upper-case to convert all Status values to uppercase. The SQL/XML statement in Figure 8.48 produces the same result by using the SQL functions DISTINCT and UPPER. xquery db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/data(@Status); Unshipped Shipped Shipped UnShipped Shipped Shipped 6 record(s) selected. xquery distinct-values(db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/@Status); Unshipped Shipped UnShipped 3 record(s) selected. xquery distinct-values(db2-fn:xmlcolumn("PURCHASEORDER.PORDER") /PurchaseOrder/upper-case(@Status)); UNSHIPPED SHIPPED 2 record(s) selected.

Figure 8.47

Using the XQuery sequence functions data() and distinct-values()

222

Chapter 8

Querying XML Data with XQuery

SELECT DISTINCT(UPPER(T.stat)) FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS stat VARCHAR(15) PATH '@Status') AS T;

Figure 8.48

Using the SQL function DISTINCT

The SQL/XML statement in Figure 8.49 returns the ﬁrst and the last item of purchase order 5000 in two separate columns of type XML. The function last(), with no argument, returns the number of items in the sequence and therefore points to the last item. SELECT XMLQUERY('$PORDER/PurchaseOrder/item[1]'), XMLQUERY('$PORDER/PurchaseOrder/item[last()]') FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');

Figure 8.49

8.7.4

Positional predicates to obtain the ﬁrst and last items

Namespace and Node Functions

Some commonly used namespace and node functions are listed in Table 8.4. The namespace functions are discussed in Chapter 15, Managing XML Data with Namespaces. Table 8.4

Commonly Used Namespace and Node Functions

Name and Node Functions

Description

name

The function fn:name returns the name of a node, typically an element or attribute name. The returned name includes the namespace preﬁx of the node, if applicable.

local-name

The function fn:local-name returns the name of a node, but does not include a namespace preﬁx.

namespace-uri

The function fn:namespace-uri returns the namespace URI of the given node.

namespace-urifor-preﬁx

The function fn:namespace-uri-for-preﬁx returns the namespace URI that is associated with a namespace preﬁx for an element.

in-scope-preﬁxes

The function fn:in-scope-preﬁxes returns a list of preﬁxes for all inscope namespaces of an element.

The functions name and local-name are very powerful because they allow access to element and attribute names. In contrast, all previous queries in this chapter used element and attribute

8.7

XQuery Functions

223

names only to get to their values. As an example, the XMLTABLE function in Figure 8.50 iterates over all the child elements of the item elements of purchase order 5000. For each child element it returns the element’s name and value together with the PoNum of the purchase order. Note that the row-generating expression ends with a wildcard that selects all child elements under item. The expressions 'local-name(.)' and '.' in the column deﬁnitions use the dot to refer to whatever the current child element is. SELECT T.OrderNo, T.node, T.value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item/*' COLUMNS OrderNo INTEGER PATH '../../@PoNum', node VARCHAR(10) PATH 'local-name(.)', value VARCHAR(40) PATH '.' ) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');

ORDERNO NODE VALUE ---------- ---------- -------------------------------------5000 partid 100-100-01 5000 name Snow Shovel, Basic 22 inch 5000 quantity 3 5000 price 9.99 5000 partid 100-103-01 5000 name Snow Shovel, Super Deluxe 26 inch 5000 quantity 5 5000 price 49.99 8 record(s) selected.

Figure 8.50

Producing a list of element names and values

Similarly you can use the function local-name to produce a list of all tags that occur in a given document. This is shown in Figure 8.51. The row-generating expression of the XMLTABLE function is //(*, @*). To understand what this means, remember that //* selects all elements at all levels of the document, and //@* selects all attributes at all levels of the document. In the expression //(*, @*) the parentheses and the comma construct a sequence that combines all elements and all attributes at all levels. In short, the row-generating expression produces all elements and attributes of the document. The column seq indicates the order in which the nodes appear in the document, and the column node produces their names. The column type determines whether the node is an attribute, an element, or a leaf element. The if-then-else expression uses the node test self::attribute() which evaluates to true if the node is an attribute. The else branch contains another if-then-else expression to check whether the current node has any element children. If yes, it must be an element itself. Otherwise it’s considered a leaf-element.

224

Chapter 8

SELECT T.* FROM purchaseorder, XMLTABLE('$PORDER//(*, @*)' COLUMNS seq FOR ORDINALITY, node VARCHAR(20) PATH type VARCHAR(15) PATH

'local-name(.)', 'if (self::attribute()) then "Attribute" else (if (./*) then "Element" else "Leaf-Element")'

) AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5000]');

SEQ NODE ----------- -----------------1 PurchaseOrder 2 PoNum 3 OrderDate 4 Status 5 item 6 partid 7 name 8 quantity 9 price 10 item 11 partid 12 name 13 quantity 14 price

TYPE -----------------Element Attribute Attribute Attribute Element Leaf-Element Leaf-Element Leaf-Element Leaf-Element Element Leaf-Element Leaf-Element Leaf-Element Leaf-Element

14 record(s) selected.

Figure 8.51

8.7.5

Querying XML Data with XQuery

Producing a list of all element and attribute names

Date and Time Functions

Some noteworthy date and time functions are listed in Table 8.5.

8.7

XQuery Functions

Table 8.5

225

Commonly Used Date and Time Functions

Date and Time Functions

Description

adjust-date-totimezone

The function fn:adjust-date-to-timezone adjusts an xs:date value to a speciﬁc time zone, or removes the timezone component from the value. Similar functions exist for xs:time and xs:dateTime values.

current-date, current-time, current-dateTime

These functions return the current date, time, or date and time in the UTC timezone (UTC = Coordinated Universal Time, which is Greenwich Mean Time).

current-local-date, current-local-time, current-local-dateTime

These functions return the current date, time, or date and time in the local time zone of the operating system, without time zone indicator. (DB2 for Linux, UNIX, Windows, version 9.5 FP5, and 9.7 FP1.)

dateTime

The function fn:dateTime constructs an xs:dateTime value from an xs:date value and an xs:time value.

day-from-date

The function fn:day-from-date returns the day component of an xs:date value. Similar functions exist to extract the months or year from an xs:date value, or to extract the hours, minutes, seconds, or timezone from xs:time or xs:dataTime values.

An example of an SQL/XML query that manipulates dates is shown in Figure 8.52. The goal of the query is to list the identiﬁer, order date, year, and age of all orders that are older than 90 days. Let’s look at the predicate in the WHERE clause ﬁrst. The predicate selects all orders whose OrderDate attribute is less than the current date minus 90 days. The string literal P90D denotes a duration of 90 days. The P is the duration indicator, and 90D speciﬁes the length of the duration. Similarly, the string P2DT5H45M could be used to denote a duration of 2 days, 5 hours, and 45 minutes. Any such duration string needs to be cast to the type xdt:dayTimeDuration to be interpreted as a duration and not as xs:string. This casting allows you to subtract the duration from the current date to produce a date in the past (90 days ago). For each matching order, the XMLTABLE function in Figure 8.52 extracts the OrderDate, the year portion of the date, and the age of the order. The age is calculated by subtracting the current date from the order date. Subtraction of one date from another produces a duration. In this example, the returned durations are negative, because current-date() is always larger than any existing OrderDate. The query result shows, for example, that purchase order 5000 has been placed 1069 days prior to January 21, 2009.

226

Chapter 8

Querying XML Data with XQuery

SELECT poid, CURRENT DATE as today, T.odate, T.year, T.age FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS odate DATE PATH '@OrderDate', year CHAR(4) PATH 'year-from-date(@OrderDate)', age CHAR(15) PATH 'xs:date(@OrderDate) - current-date()' ) as T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate < current-date() - xdt:dayTimeDuration("P90D")]');

POID ----------5000 5001 5002 5003 5004 5006

TODAY ---------01/21/2009 01/21/2009 01/21/2009 01/21/2009 01/21/2009 01/21/2009

ODATE ---------02/18/2006 02/03/2005 02/29/2004 02/28/2005 11/18/2005 03/01/2006

YEAR ---2006 2005 2004 2005 2005 2006

AGE ----------P1069D -P1449D -P1789D -P1424D -P1161D -P1058D

6 record(s) selected.

Figure 8.52

Using date types and functions

Note that current-date() produces the current date in UTC time. If you are living in California, where the local time is eight hours behind UTC, then from 4 p.m. onwards currentdate() gives you tomorrow’s date. New functions to produce the local date and time are being added (refer to Table 8.5) but you can also use XQuery functions to adjust a date or a time to a given time zone, such as in the following query: xquery adjust-date-to-timezone(current-date(), xdt:dayTimeDuration("-PT8H"));

8.7.6

Boolean Functions

And ﬁnally, XQuery Boolean functions are listed in Table 8.6. An example of using the function fn:false is Figure 8.34 in section 8.5 of this chapter. The use of the function fn:not() was discussed in the context of XPath in section 6.9. Please refer to these sections for examples. Table 8.6

Commonly Used Boolean Functions

Boolean Functions

Description

not

The function fn:not returns false if the effective Boolean value of a sequence is true, and true if the effective Boolean value of a sequence is false.

false

The function fn:false returns the value false.

true

The function fn:true returns the value true.

8.8

8.8

Embedding SQL in XQuery

227

EMBEDDING SQL IN XQUERY

In section 6.5, How to Execute XPath in DB2, we explained how the function db2-fn:sqlquery lets you embed SQL in XPath queries. The same works in XQuery FLWOR expressions and it allows you to include relational predicates in your XQuery. You can even pass parameters from the outer XQuery to the embedded SQL statement. Remember that the embedded SQL statement has to return a single column of type XML. For the following examples, note that the table purchaseorder has several relational columns that contain values extracted from the XML document in the same row. CREATE TABLE purchaseorder(poid BIGINT, status VARCHAR(10), custid BIGINT, orderdate DATE, porder XML);

An interesting pair of queries is shown in Figure 8.53. The ﬁrst query is an SQL/XML statement that uses the XMLQUERY function in the SELECT clause to compute the sum of the item prices of any selected order. The WHERE clause restricts the result set to those orders in the table where the relational column status has the value Unshipped, the column orderdate has the value 2006-02-18, and the order information in the XML column contains at least one item with a price greater than 40. For each of these orders, the query computes the sum of all item prices. The second query is a FLWOR expression that produces the same result from our sample data. Its input is deﬁned by the function db2-fn:sqlquery, which produces the sequence of XML documents that are selected by the embedded SQL statement. This allows you to use relational predicates in an XQuery. The XQuery iterates with the for clause over the PurchaseOrder elements of these input documents. For each such element it evaluates the XML predicate on price and returns the sum of item prices for any matching order. SELECT XMLQUERY('$PORDER/PurchaseOrder/sum(item/price)') FROM purchaseorder WHERE status = 'Unshipped' AND orderdate = '2006-02-18' AND XMLEXISTS('$PORDER/PurchaseOrder/item[price > 40]'); xquery for $i in db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE status = 'Unshipped' AND orderdate = '2006-02-18'" )/PurchaseOrder where $i/item[price >= 40] return sum($i/item/price);

Figure 8.53

Two queries that produce the same result

There is typically no signiﬁcant performance difference between the two queries in Figure 8.53. Both can use an XML index on /PurchaseOrder/item/price and relational indexes on status and orderdate at the same time.

228

Chapter 8

Querying XML Data with XQuery

Let’s extend the previous example slightly to illustrate parameter passing from XQuery to the enclosed SQL statement. Assume you want to return all orders that have the same shipping status and order date as the purchase order with number 5000. The XQuery in Figure 8.54 does that easily. It uses the for and where clauses to select purchase order 5000 and assign it to the variable $i. The return clause then produces the sequence of all orders where the relational columns status and orderdate have the same value as $i/@Status and $i/@OrderDate respectively. The functions parameter(1) and parameter(2) can only be used in SQL statements inside the db2-fn:sqlquery function. They refer to the XQuery expressions that are provided as additional arguments to the db2-fn:sqlquery function, according to the order in which they appear. That is, $i/@Status is bound to parameter(1) and $i/@OrderDate to parameter(2). Effectively, this is a self-join on the purchaseorder table. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder where $i/@PoNum = 5000 return db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE status = parameter(1) AND orderdate = parameter(2)", $i/@Status, $i/@OrderDate );

Figure 8.54

XQuery that contains an SQL statement with parameters

Figure 8.55 shows how you can code the same self-join in SQL/XML notation without any XQuery concepts beyond XPath. The FROM clause contains two references to the purchaseorder table, p1 and p2. The alias p1 is used in the XMLTABLE function to ﬁnd purchase order 5000 and to extract Status and OrderDate from it. These generated relational columns are then joined with alias p2 in the WHERE clause to ﬁnd all orders with the same status and date. The queries in Figure 8.54 and Figure 8.55 look very different from each other, but the DB2 query compiler generates the same execution plan for both. SELECT p2.porder FROM purchaseorder p1, purchaseorder p2, XMLTABLE('$po1/PurchaseOrder[@PoNum = 5000]' passing p1.porder as "po1" COLUMNS status VARCHAR(10) PATH '@Status', orderdate DATE PATH '@OrderDate' ) AS T WHERE p2.status = T.status AND p2.orderdate = T.orderdate;

Figure 8.55

A different notation for the same self-join as in Figure 8.54

8.9

8.9

Using SQL Functions and User-Deﬁned Functions in XQuery

229

USING SQL FUNCTIONS AND USER-DEFINED FUNCTIONS IN XQUERY

There are many built-in SQL functions that are not part of the XQuery language. For example, functions such as sqrt (square root), rand (random number), or cos (cosine) are available as SQL functions in DB2 but they are not available as built-in XQuery functions. Additionally you might have developed your own user-deﬁned functions (UDFs), either in the SQL Procedural Language (SQP PL) or in an external programming language such as Java or C. It is possible to use such functions from the SQL world within XQuery expressions. The trick is to use the db2fn:sqlquery function to embed SQL functions in XQuery. Assume that you have a legacy application that processes partid values, which are product identiﬁers, in a different format. For example, a partid such as 100-103-01 needs to be converted to 01(100)103. This is achieved by the UDF in Figure 8.56. It breaks a given partid into its three pieces and assembles them in a different way to meet the requirements of the legacy system. CREATE FUNCTION convert(partid VARCHAR(15)) RETURNS VARCHAR(15) BEGIN ATOMIC DECLARE p1, p2, p3, new VARCHAR(10) DEFAULT ''; SET p1 = substr(partid,1,3); SET p2 = substr(partid,5,3); SET p3 = substr(partid,9,2); SET new = p3||'('||p1||')'||p2; RETURN new; END#

Figure 8.56

User-deﬁned function to convert product identiﬁers

The FLWOR expression in Figure 8.57 uses this UDF in its let clause to convert every partid in purchase order 5000 to the different format. The db2-fn:sqlquery function contains an SQL statement, which in this case is simply a VALUES clause. Since the result of the embedded SQL statement must be of type XML, the XMLTEXT function is used to turn the VARCHAR result value of the function convert into an XML text node. The convert function takes a single parameter, which has to be cast to the input type of the function, that is, VARCHAR(15). The expression $i/partid provides the actual value that is passed into the convert function.

230

Chapter 8

Querying XML Data with XQuery

xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item let $new := db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST (parameter(1)as VARCHAR(15)))))", $i/partid) where $i/../@PoNum = 5000 return {$i/partid/text()}{$new};

100-100-0101(100)100 100-103-0101(100)103 2 record(s) selected.

Figure 8.57

Using an SQL UDF within an XQuery

You can use the db2-fn:sqlquery function anywhere where built-in XQuery functions are allowed. Figure 8.58 gives you a couple of ideas. The ﬁrst FLWOR expression uses the db2-fn:sqlquery function in the construction of the element new. Note that it has to be in curly brackets so that it gets properly evaluated and not treated as a literal string. The second XQuery uses db2-fn:sqlquery in a path expression. The XPath in the return clause is $i/PurchaseOrder/item/partid except that the db2-fn:sqlquery function is applied to the last step, partid. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER")/PurchaseOrder/item where $i/../@PoNum = 5000 return {$i/partid/text()} { db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST(parameter(1) AS VARCHAR(15)))))", $i/partid) };

xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/@PoNum = 5000 return $i/PurchaseOrder/item/db2-fn:sqlquery(" VALUES(XMLTEXT(convert(CAST(parameter(1) AS VARCHAR(15)))))", partid);

Figure 8.58

8.10

Further examples of using the db2-fn:sqlquery function

SUMMARY

XQuery is a powerful query language for XML data. XPath is a subset of the XQuery language and used in every XQuery expression that accesses XML documents. Hence, XPath is a critical part of XQuery.

8.10

Summary

231

One of the most commonly used expressions in XQuery is the FLWOR expression, which is named after its keywords for, let, where, order by, and return. The for clause of a FLWOR expression lets you iterate over documents, elements, attributes, atomics values, or any sequence of items in the XQuery data model. In each iteration, a variable is assigned to the next item in the sequence for further manipulation. The let clause allows you to assign an entire sequence, such as an intermediate result, to a single variable. The where and order by clauses are used to ﬁlter and sort the result of the FLWOR expression. The result is then returned by the return clause, possibly with further manipulation. FLWOR expressions can express queries over sets of documents, perform joins across documents, and combine data from multiple XML documents or different parts of a single document into a query result. Other important expressions in XQuery include constructor expressions, such as direct element and attribute constructors, which are used to create XML nodes and construct new XML documents within a query. Conditional expressions (if-then-else) allow for advanced logic. Additionally, XQuery supports cast expressions, arithmetic expressions, logical and comparison operators, and sequence and transform expressions. XQuery also offers a rich set of built-in functions, such as string functions, numeric functions, aggregation functions, and date and time functions. Not every XML application requires XQuery. Many applications are well-served with the combined power of XPath and SQL. In fact, many queries in XQuery notation can also be expressed in SQL/XML with embedded XPath.

This page intentionally left blank

C

H A P T E R

9

Querying XML Data: Advanced Queries & Troubleshooting

n this chapter we discuss advanced XML query topics, common errors, and guidelines for avoiding performance pitfalls. The examples include both XQuery and SQL/XML queries. This chapter is organized along the following topics:

I

• Aggregation and grouping in XML queries (section 9.1) • Joins between XML columns as well as joins between XML and relational data (section 9.2) • XML queries with case-insensitive string predicates (section 9.3) • Guidelines for avoiding common performance problem (section 9.4) • Common errors in XML queries and how to resolve them (section 9.5)

9.1

AGGREGATION AND GROUPING OF XML DATA

The recommended and most efﬁcient way to perform grouping and aggregation of XML data is to use the XMLTABLE function to extract XML values to relational columns, and then to apply the SQL GROUP BY clause and SQL aggregation functions to these columns. The XQuery 1.0 language by itself, speciﬁcally the FLWOR expression, does not have a GROUP BY clause. This shortcoming makes grouping more difﬁcult in XQuery than SQL, although not entirely impossible. In the following we discuss grouping and aggregation queries that use the purchase order sample data as input. A sample document is shown in Figure 9.1.

233

234

Chapter 9

Querying XML Data: Advanced Queries & Troubleshooting

100-100-01 Snow Shovel, Basic 22 inch 3 9.99 100-103-01 Snow Shovel, Super Deluxe 26 inch 5 49.99

Figure 9.1

9.1.1

Sample document in the purchaseorder table

Aggregation and Grouping Queries with XMLTABLE

As an example, let’s determine the number of purchase orders per year since 2004. This is done in Figure 9.2. The XMLTABLE function together with the year-from-date function produces a relational column year of type CHAR(4). This year column is then used in both the SELECT clause and in the GROUP BY clause, as you normally would with relational columns. The relational COUNT() function produces the desired aggregation. The XMLEXISTS predicate in the WHERE clause ensures that the query only looks at orders that were placed in 2004 or later. SELECT year, COUNT(*) AS num_orders FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder' COLUMNS year CHAR(4) PATH 'year-from-date(@OrderDate)') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate >= xs:date("2004-01-01")]') GROUP BY year;

YEAR NUM_ORDERS ---- ----------2004 1 2005 3 2006 2 3 record(s) selected.

Figure 9.2

Using SQL group by and aggregation on extracted XML values

9.1

Aggregation and Grouping of XML Data

235

This pattern of writing XML queries has been found very useful. The XMLTABLE function raises selected values from the XML level to the SQL level, and then you can apply SQL functions and groupings to these values as you normally do in purely relational queries. Let’s apply this pattern to another business question. What is the total value of shipped and unshipped items that were ordered in 2006? The answer is computed by the query in Figure 9.3. To write this query, you might want to start with the WHERE clause to restrict the orders to 2006. The path expression in the XMLEXISTS predicate navigates to the OrderDate attribute and checks whether it is greater than or equal to the ﬁrst day of 2006, and less than or equal to the last day of 2006. Note that both dots in the predicate refer to the OrderDate attribute, which is the current node in the navigation. In the XMLEXISTS predicate, don’t use the year-fromdate function to restrict the orders to 2006 because that function would prevent the use of an XML index that might exist on the OrderDate attribute.

NOTE

While the WHERE clause takes care of the ﬁltering, the XMLTABLE function extracts the data items needed to aggregate the value of shipped and unshipped items. For each item in an order it produces one row with the item price, quantity, and shipping status. This allows you to use SQL concepts to group by the status and to sum the item values. The value of an item in an order is the item price multiplied by its quantity. SELECT orderstatus, SUM(itemprice * itemqty) AS value FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS orderstatus VARCHAR(10) PATH 'upper-case(../@Status)', itemprice DECIMAL(9,2) PATH 'price', itemqty INTEGER PATH 'quantity') AS T WHERE XMLEXISTS('$PORDER/PurchaseOrder/@OrderDate[ . >= xs:date("2006-01-01") and . = xs:date("2005-01-01") and . xs:date("2006-01-01") and item/price >= 20 and item/price < 30 ]');

100-100-01Snow Shovel, Basic 22 inch39.99100-103-01Snow Shovel, Super Deluxe 2 6 inch549.99 1 record(s) selected.

Figure 9.30

Wrong way to write a between predicate

Both SQL/XML statements in Figure 9.31 write the “between” condition correctly and ensure that both range predicates are applied to the same item price. In the expression item/price[. >= 20 and . < 30], both dots refer to the same price element. Hence, this query selects orders that have at least one item with at least one price element whose value is indeed between 20 and 30. (No such order exists in the sample database.) Based on this notation, DB2 knows that both range predicates are always applied to the same XML node. This allows DB2 to evaluate both predicates with a single start-stop scan (start at 20, stop at 30) over an XML index deﬁned on the price element. SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01") and item/price[. >= 20 and . < 30]]'); SELECT porder FROM purchaseorder WHERE XMLEXISTS('$PORDER/PurchaseOrder[@OrderDate > xs:date("2006-01-01")]/item/price[. >= 20 and . < 30]'); 0 record(s) selected.

Figure 9.31

Correct way to write a between predicate

If each item element has at most one price element, then the expression item[price >= 20 and price < 30] also selects the correct query result. However, DB2 does not know that each item has at most one price and therefore cannot apply a single start-stop index scan. Instead, DB2 has to use two separate index scans plus an index ANDing operator to combine the result (see Table 9.1). This is less efﬁcient. Therefore it is always recommended to write “between”

256

Chapter 9

Querying XML Data: Advanced Queries & Troubleshooting

predicates with the “dot” (current context), as shown in Figure 9.31. Further details on XML index usage and execution plans are provided in Chapters 13 and 14. Table 9.1

Optimal (left) and Suboptimal Execution Plan (right)

price[. >= 20 and . < 30]

[price >= 20 and price < 30]

RETURN | NLJOIN | /-+-\ / \ FETCH XSCAN | /---+---\ / \ RIDSCN TABLE: | purchaseorder SORT | XISCAN 20 = 20 price < 30

Index 20 30

9.4.3

Index

20 30

Large Global Sequences

Figure 9.32 provides another example of how you should not write queries. The idea of this query comes from a real XML application, but is changed here to ﬁt the purchase order data. The query starts with a let clause and assigns the sequence of all purchase order items in the table to the variable $allitems. This is the ﬁrst of multiple problems in this query. Unless the table is tiny, the sequence in $allitems is typically very large. Using let to combine items from all (or many) documents in the entire table often results in suboptimal performance. The next step of the query, for $pid…, iterates over the distinct partid values of all the item elements in the sequence $allitems. For each distinct partid it returns a constructed XML element prod_info that contains the partid (produced by $pid) as well as the name and the price of the item. Note how the name and the price are obtained for each distinct partid; that is, for each value of $pid. The variable $pid is used to probe back into the sequence $allitems to ﬁnd all items with a matching partid. This probe happens in the predicate $allitems[partid = $pid]. The same is done for price. This coding is not straightforward, needlessly complex, and bad for performance. In particular, the big sequence $allitems is a large temporary object and not indexed. Hence, the predicates

9.4

How to Avoid “Bad” Queries

257

in the return clause ([partid = $pid]) both require a sequential scan over all items in all purchase orders, for each $pid. An analogy in the relational world would be a query that copies all rows from a table to a temporary table, then performs a “select distinct” on that table to obtain a set of keys, and then a table scan on the temp table for each of these keys. xquery let $allitems := ( for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") return $i/PurchaseOrder/item ) for $pid in distinct-values($allitems/partid) order by $pid return {distinct-values($allitems[partid = $pid]/name)} {distinct-values($allitems[partid = $pid]/price)} ;

Figure 9.32

Expensive usage of large sequences

The result of the query in Figure 9.32 is simply the partid, name, and price for all distinct items that occur in the purchase orders. The same result can be computed in a much easier way, as shown in Figure 9.33. This query simply generates one tuple for each item element and uses the SQL function DISTINCT to remove duplicates. In the original case, the performance improved by two orders of magnitude. The rewritten query is also easier to understand. SELECT distinct T.pid, T.name, T.price FROM purchaseorder, XMLTABLE('$PORDER/PurchaseOrder/item' COLUMNS pid VARCHAR(10) PATH 'partid', name VARCHAR(50) PATH 'name', price INTEGER PATH 'price') as T;

Figure 9.33

9.4.4

Rewritten query avoids large intermediate sequences

Multilevel Nesting SQL and XQuery

A general guideline is to introduce only as much complexity in your queries as you really need. For example, it is certainly possible to have an XQuery with an embedded SQL statement that has an embedded XQuery, and so on. But, experience shows that nesting the two languages more than one level deep is usually not needed to express the desired query logic. Therefore, we recommend using only one level of embedding XQuery into SQL or vice versa. As a result, queries are easier to understand and to maintain, and often also easier to optimize and execute for DB2. Figure 9.34 shows an example of an XQuery with an embedded SQL statement, which in turn has embedded XQuery expressions in the XMLQUERY function and XMLEXSISTS predicate. The

258

Chapter 9

Querying XML Data: Advanced Queries & Troubleshooting

embedded SQL statement produces the purchase order elements from all orders that belong to customer 1001 and whose PoNum attribute has the value 1002. For those orders, the XQuery checks whether the Status is Shipped and returns all order items in a newly constructed element POitems. Using XQuery within the SQL statement and around the SQL statement is needlessly complex. xquery for $i in db2-fn:sqlquery(" SELECT XMLQUERY('$PORDER/PurchaseOrder') FROM purchaseorder WHERE custid =1001 AND XMLEXISTS('$PORDER/PurchaseOrder[@PoNum=5002]') ") where $i[@Status="Shipped"] return {$i/item};

Figure 9.34

Unnecessary double-nesting of XQuery and SQL

To simplify the query in Figure 9.34, you can choose to either have all XML manipulation outside of the SQL query or all XML manipulation embedded within the SQL query. Both options are demonstrated in Figure 9.35. In the ﬁrst query in Figure 9.35, all XML operations are pulled out of the SQL statement and into the surrounding XQuery. In the second query, all XML operations are pushed from the surrounding XQuery into the SQL statement. xquery for $i in db2-fn:sqlquery("SELECT porder FROM purchaseorder WHERE custid =1001") where $i/PurchaseOrder[@PoNum = 5002 and @Status="Shipped"] return {$i/ PurchaseOrder/item};

SELECT XMLQUERY('{$PORDER/PurchaseOrder/item}') FROM purchaseorder WHERE custid =1001 AND XMLEXISTS('$PORDER/PurchaseOrder[@PoNum = 5002 and @Status="Shipped"]');

Figure 9.35

9.5

Two simpler versions of the query in Figure 9.34

COMMON ERRORS AND HOW TO AVOID THEM

This section lists some common error messages that you might encounter when you run XML queries. We discuss probable causes and ways to resolve the problems. DB2 has more than 250 XML-related error messages and we cannot discuss all of them here. Additionally, a speciﬁc error message might have multiple different causes and we cannot describe all of them in this section. Therefore we look at a few select queries, their errors, and how to ﬁx them.

9.5

Common Errors and How to Avoid Them

259

Error messages related to XML processing have numbers in the 16000-range of messages and SQL Codes. That is, the SQL Codes related to XML processing errors are -16000, -16001, -16002, and so on. This is the same in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. Additionally, in DB2 for Linux, UNIX, and Windows the error messages for these SQL Codes are numbered SQL16000N, SQL16001N, SQL16002N, and so on. Each error message raised by a faulty XML query also contains an error code, such as err:XPDY0002, which is the error code deﬁned by the W3C. These error codes are listed at http://www.w3.org/2005/xqt-errors/, and you can also search for them in the DB2 information center.

9.5.1 SQL16001N Figure 9.36 and Figure 9.37 show queries that fail at compile time with error SQL16001N, which indicates that an XPath or XQuery expression does not have a context; that is, the path does not have a proper starting point. In Figure 9.36, INFO is not a valid context, because the XML column name is only recognized if coded as a variable that starts with a $ sign ($INFO). SELECT info FROM customer WHERE XMLEXISTS('INFO/customerinfo[name="Matt Foreman"]'); SQL16001N An XQuery expression starting with token "INFO" cannot be processed because the focus component of the dynamic context has not been assigned. Error QName=err:XPDY0002. SQLSTATE=10501

Figure 9.36

Use $INFO instead of INFO to avoid this error

In Figure 9.37, the path in the return clause starts with /addr, but no context is provided to indicate from where this expression should navigate to the addr element. The correct coding in this query is $c/addr instead of /addr. xquery for $c in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return /addr[@country]; SQL16001N An XQuery expression starting with token "/" cannot be processed because the focus component of the dynamic context has not been assigned. Error QName=err:XPDY0002. SQLSTATE=10501

Figure 9.37

The path in the return clause should start with $c

9.5.2 SQL16002N The error SQL16002N happens at compile time whenever the query parser encounters a keyword or symbol that is unexpected or not recognized. This can happen in many different cases. The query in Figure 9.38 fails because the uppercase keyword FOR is not valid. It has to be lowercase.

260

Chapter 9

Querying XML Data: Advanced Queries & Troubleshooting

xquery FOR $d IN db2-fn:xmlcolumn ("customer.info")/customerinfo RETURN $d; SQL16002N An XQuery expression has an unexpected token "d" following "FOR $". Expected tokens may include: "". Error QName=err:XPST0003. SQLSTATE=10505

Figure 9.38

The keywords for, in, and return must be lowercase

In Figure 9.39, the expression $INFO/customerinfo/ must not end with a slash (/). The slash starts another step in the XPath expression and must be followed be an element name, attribute name, wildcard (*), function name, and so on. Hence the empty string "" after the / is not expected. SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo/' COLUMNS name VARCHAR(20) PATH 'name', city VARCHAR(20) PATH 'addr/city' ) as T; SQL16002N An XQuery expression has an unexpected token "" following "$INFO/customerinfo". Expected tokens may include: "".

Figure 9.39

To avoid this error remove the / after customerinfo

Furthermore, a slash cannot be followed by the square bracket that begins a predicate. Therefore the square bracket in Figure 9.40 causes error SQL16002N. SELECT XMLQUERY('$INFO/customerinfo/name') FROM customer WHERE XMLEXISTS('$INFO/customerinfo/[addr/city = "Aurora"]'); SQL16002N An XQuery expression has an unexpected token "[" following "tomerinfo/". Expected tokens may include: "".

Figure 9.40

A predicate must not be preceded by a slash (/)

9.5.3 SQL16003N Error SQL16003N happens during query execution; that is, at runtime and not at compile time. It indicates that DB2 has encountered a value of a certain data type that is not valid in this situation. The query in Figure 9.41 fails because a sequence of multiple phone elements cannot be cast to a single SQL value. In this error message, the notation ( item(), item()+ ) is a regular expression that represents a sequence of one item followed by one or more items. In total that’s two or more items, but only a single item is allowed here.

9.5

Common Errors and How to Avoid Them

261

SELECT T.* FROM customer, XMLTABLE('$INFO/customerinfo' COLUMNS custname VARCHAR(20) PATH 'name', phone VARCHAR(15) PATH 'phone') AS T; SQL16003N An expression of data type "( item(), item()+ )" cannot be used when the data type "VARCHAR_15" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507

Figure 9.41

Cannot cast multiple phone numbers to a single VARCHAR value

Figure 9.42 shows a query that fails because it tries to compare a value of type xs:date with the value "2006-02-18Z” of type xs:string, which is not allowed. xquery for $i in db2-fn:xmlcolumn("PURCHASEORDER.PORDER") where $i/PurchaseOrder/xs:date(@OrderDate) = "2006-02-18Z" return $i; SQL16003N An expression of data type "xs:string" cannot be used when the data type "xs:date" is expected in the context. Error QName=err:XPTY0004. SQLSTATE=10507

Figure 9.42

The string literal “2006-02-18Z” must be cast to xs:date

9.5.4 SQL16005N The query in Figure 9.43 references a variable $c that has not been properly introduced. Normally, variables are introduced by assignment in a for or a let clause. Here, the for clause deﬁnes the variable $b, which should be used instead of $c in the return clause. xquery for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $c/name; SQL16005N An XQuery expression references an element name, attribute name, type name, function name, namespace preﬁx, or variable name "c" that is not deﬁned within the static context. Error QName=err:XPST0008. SQLSTATE=10506

Figure 9.43

The variable $c has not been introduced

Figure 9.44 demonstrates a trickier case. The query tries to return a sequence of name and addr elements, but it lacks parentheses. The expression return ($b/name, $b/addr) is correct and avoids the error. The error message claims that the variable $b is not known. Clearly, $b has been deﬁned in the for clause, so the error is seemingly misleading or even wrong.

262

Chapter 9

Querying XML Data: Advanced Queries & Troubleshooting

xquery for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $b/name, $b/addr; SQL16005N An XQuery expression references an element name, attribute name,type name, function name, namespace preﬁx, or variable name "b" that is notdeﬁned within the static context. Error QName=err:XPST0008. SQLSTATE=10506

Figure 9.44

Missing parentheses in the return clause

But, the error message in Figure 9.44 is correct. The comma in the return clause is the XQuery comma operator, which constructs sequences. It has the lowest precedence of all operators. Hence, the XQuery expression in Figure 9.44 deﬁnes a sequence of two expressions, which are • for $b in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return $b/name

• $b/addr In the ﬁrst expression, $b is properly introduced in the for clause. In the second expression, $b is not deﬁned, which causes the error message. If you change the return clause to return ($b/name, $b/addr), the parentheses ensure that the comma operator only applies to $b/name and $b/addr, and both of these expressions refer to $b deﬁned in the for clause. The use of the parentheses here is similar to parentheses in arithmetics, such as 3 * (2 + 3) to evaluate the + operator before the multiplication operator.

9.5.5 SQL16015N When you construct elements with a direct element constructor, and you include a sequence of expressions that provide the child nodes, attributes (if any) must come before elements in this sequence. xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo return {$i/name}{$i/@Cid}; SQL16015N An element constructor contains an attribute node named "Cid" that follows an XQuery node that is not an attribute node. QName=err:XQTY0024. SQLSTATE=10507

Figure 9.45

Within a constructed element, attributes must be ﬁrst

9.5

Common Errors and How to Avoid Them

263

The error in Figure 9.45 is avoided if you construct the info element as return {$i/@Cid}{$i/name};

or as return >-+------------+--------------------------------------------------> >--copy----$VariableName--:=--CopySourceExpression-+---------------> >--modify--ModifyExpression----------------------------------------> >--return--ReturnExpression----------------------------------------|

Figure 12.5

High-level syntax of the transform expression

Such XML modiﬁcations can be performed in an SQL UPDATE statement, in a query, or as part of an INSERT statement (Figure 12.6). If you modify a document in a query, the query reads the document from an XML column, changes it on-the-ﬂy, and returns the modiﬁed document to the application. This leaves the original version of the document in the DB2 table unchanged. If you modify a document in an UPDATE statement, you make a permanent change to the data that is stored in DB2. Such an UPDATE is logged in the DB2 transaction log and subject to all the transaction management concepts that also apply to relational updates, such as commit, rollback, and recovery, when applicable. Concurrency control (locking) and logging happens at the full document level. You can also modify a new document at insert time if you include an XQuery transform expression in an SQL INSERT statement.

326

Chapter 12

Modify a document as part of a query. The original document in the database is not changed.

Make a permanent change to a document in the database. This UPDATE is logged.

XML Document

Updating a stored document

Figure 12.6

Updating and Transforming XML Documents

Modify a new document during INSERT. The modified document is inserted and logged.

XML Document

XML Document

XML Document

XML Document

Updating a returned document upon retrieval.

Updating a new document upon insert.

Three ways of modifying XML documents

The concepts of changing XML element or attribute values, inserting new elements, renaming elements, and so on are independent from whether you do this in an UPDATE statement, in a query, or in an INSERT statement. The following sections describe the capabilities of the XQuery transform expressions and their usage in SQL UPDATE statements. Sections 12.10 and 12.11 then show how the same document modiﬁcations can be performed in queries and INSERT statements.

12.3

UPDATING THE VALUE OF AN XML NODE IN A DOCUMENT

A simple and common kind of XML update is to change the value of a speciﬁc element or attribute node in an XML document.

12.3.1

Replacing an Element Value

As an example, assume you have to update the address of a customer to change the value of the street element to “43 WestCreek”. Figure 12.7 shows the original document on the left and the desired updated document on the right. Original document

Updated document

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Jim Noodle 43 WestCreek Markham Ontario N9C 3T6 905-555-7258

Figure 12.7

Changing the value of an element

12.3

Updating the Value of an XML Node in a Document

327

The UPDATE statement that performs the desired modiﬁcation of the document is shown in Figure 12.8. It assumes that the document to be updated resides in the info column of the customer table in a row with the relational cid value 1002. The SET clause of the UPDATE statement assigns a new value to the XML column info. This new value is produced by the XMLQUERY function, which contains an XQuery transform expression. The copy clause refers to the original XML column value ($INFO), and assigns the original document to the variable $mycust. Subsequently, the modify clause manipulates this variable. The modify clause contains the update operation replace value of to replace the value of the element street with the new string literal “43 WestCreek”. Finally, the variable $mycust, which contains the modiﬁed document, is returned in the return clause of the transform expression. UPDATE customer SET info = XMLQUERY(' transform copy $mycust := $INFO modify do replace value of $mycust/customerinfo/addr/street with "43 WestCreek" return $mycust ') WHERE cid = 1002

Figure 12.8

Update statement to replace the value of an element

In Figure 12.8 and many other typical update cases, the right side of the copy clause is just the variable that refers to the original document, in this case $INFO. The right side of the copy clause could be a more complex expression, but it must always evaluate to a single node. It cannot be an empty sequence or a sequence of more than one item. This single node can have descendants, which means it can be (and often is) the root of a full XML document. In many update examples you will also see that the return clause simply returns the variable that holds the modiﬁed document. However, the return clause could contain a more complex expression, including element construction or a FLWOR expression. Updates with more complex expressions in the copy and the return clauses are discussed in section 12.10. Since the transform keyword is optional, it is omitted from here on.

12.3.2

Replacing an Attribute Value

Replacing an attribute value is just as easy as replacing an element value. The UPDATE statement in Figure 12.9 changes the Cid attribute to the new value 1099. The entire UPDATE statement is the same as in Figure 12.8 except that the path to the target node and the new value are different. The literal value 1099 could be in double quotes but does not have to be because it can be interpreted as a number.

328

Chapter 12

Updating and Transforming XML Documents

UPDATE customer SET info = XMLQUERY(' copy $mycust := $INFO modify do replace value of $mycust/customerinfo/@Cid with 1099 return $mycust ') WHERE cid = 1002

Figure 12.9

12.3.3

Replacing the value of an attribute

Replacing a Value Using a Parameter Marker

Often you will want to prepare and compile an UPDATE statement only once, and then pass in a new value every time you execute it. This avoids recompiling the statement in the database server for each execution. The mechanism to use parameters is the same as for SQL/XML queries. The PASSING clause of the XMLQUERY function allows you to pass a SQL-style parameter marker (“?”) as a variable ($z) into the XQuery expression (Figure 12.10). Note that XQuery variables are case-sensitive. For example, $z and $Z are not the same. The query in Figure 12.10 also uses a parameter marker in the WHERE clause to select the row to be updated. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/phone with $z return $newinfo' PASSING CAST(? AS VARCHAR(15)) AS "z") WHERE cid = ?

Figure 12.10

Updating XML values with parameter markers

You can run the UPDATE statement in Figure 12.10 from an application, such as a Java program. You would use JDBC statements to prepare and compile the statement, bind a value from an application variable to the parameter marker, and then execute the statement.

12.3.4

Replacing Multiple Values in a Document

You can update multiple values in the same document in a single UPDATE statement. Figure 12.11 illustrates that the modify clause allows for a comma-separated list of update operations. The entire list is enclosed in parentheses. This enables you to easily combine two or more update operations in a single statement.

12.3

Updating the Value of an XML Node in a Document

329

UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify (do replace value of $newinfo/customerinfo/addr/street with "85 Leicester Rd" , do replace value of $newinfo/customerinfo/addr/pcode-zip with "W7B 8X1" ) return $newinfo ') WHERE cid = 1002

Original document

Updated document

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Jim Noodle 85 Leicester Rd Markham Ontario W7B 8X1 905-555-7258

Figure 12.11

Updating multiple values in a single UPDATE statement

If you want to update multiple values in a single UPDATE statement and use parameter markers for all values, the PASSING clause of the XMLQUERY function needs to contain a list of typed parameter markers together with the variable names that refer to them (see Figure 12.12). UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify (do replace value of $newinfo/customerinfo/addr/street with $str, do replace value of $newinfo/customerinfo/addr/pcode-zip with $zip ) return $newinfo' PASSING CAST(? AS VARCHAR(30)) AS "str", CAST(? AS VARCHAR(10)) AS "zip") WHERE cid = 1002

Figure 12.12

12.3.5

Updating multiple values with parameter markers

Replacing an Existing Value with a Computed Value

The value that you use to update an existing element or attribute does not necessarily have to be a ﬁxed value but can be computed based on the existing values in the document. For example,

330

Chapter 12

Updating and Transforming XML Documents

assume that the customer documents can contain an element numorders that tracks the total number of orders that a customer has placed. The UPDATE statement in Figure 12.13 increments the value of the element numorders by 1. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/numorders with $newinfo/customerinfo/numorders + 1 return $newinfo ') WHERE cid = 1002

Original document

Updated document

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6

16

17

Figure 12.13

Incrementing the numeric value of an element

Similarly, the UPDATE statement in Figure 12.14 modiﬁes the value of the element street by appending an apartment number. It uses the XQuery function concat. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace value of $newinfo/customerinfo/addr/street with concat($newinfo/customerinfo/addr/street, " Apt #4") return $newinfo ') WHERE cid = 1002

Figure 12.14

Appending an apartment number to the street

If you write more elaborate updates, you might ﬁnd it tedious to repeat a long path such as $newinfo/customerinfo/addr/street whenever you reference an existing node in the document. Figure 12.15 uses a let clause to assign this long path to the variable $s. Subsequently, the do replace value clause uses $s multiple times instead of repeating the long path. Note that the modify clause contains a FLWOR expression that only consists of the let and the return clause while the for, where, and order by clauses are omitted. Hence, the XQuery expression

12.4

Replacing XML Nodes in a Document

331

in Figure 12.15 also contains two return clauses. The ﬁrst one belongs to the let and its FLWOR expression (bold font), and the second one is the return of the transform expression. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify let $s := $newinfo/customerinfo/addr/street return do replace value of $s with concat($s, " Apt #4") return $newinfo ') WHERE cid = 1002

Figure 12.15

12.4

Using let to assign a long path to a short variable

REPLACING XML NODES IN A DOCUMENT

Suppose a customer has moved to a different city and you need to update the address in the XML document that holds the customer’s information. You could write an UPDATE statement with replace value of expressions to individually change the values of all elements and attributes that make up the address of the customer (country, street, city, prov-state, and pcodezip). However, such an update can be lengthy and tedious to write. It can be a lot easier to simply replace the existing addr element and all of its children with a new addr element. Such a replacement of a node is done with a replace expression. The replace expression works differently from the replace value of expression. The former replaces the whole node (the old node is deleted), whereas the latter replaces only the value of the target node. Figure 12.16 shows an UPDATE statement that replaces the existing addr element and all of its child nodes with a new addr fragment. The structure of the new XML fragment does not have to be identical to the original one. Indeed, the new address in Figure 12.16 contains the elements state and zipcode, which are different from the original address. Similarly, you could decide to replace the original addr element and all of its children, with a single email element, if you wanted to. If you choose to validate updated documents with an XML Schema, the new structure of the document has to conform with the XML Schema.

332

Chapter 12

Updating and Transforming XML Documents

UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace $newinfo/customerinfo/addr with 555 Bailey Avenue San Jose California 95141 return $newinfo ') WHERE cid = 1002

Original document

Updated document

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Jim Noodle 555 Bailey Avenue San Jose California 95141 905-555-7258

Figure 12.16

Replacing an element node

Note that the new addr fragment in the modify clause of the UPDATE statement in Figure 12.16 is not enclosed in single quotes because it is not a string value. Instead, the new addr element and its children are constructed with direct element and attribute constructors (see section 8.4, Constructing XML Data). The XML value that provides the new address can also be computed with an expression. For example, Figure 12.17 uses an XPath expression to obtain the addr element from the customer whose Cid attribute has the value 1004. This address element replaces the address of customer 1002. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do replace $newinfo/customerinfo/addr with db2-fn:xmlcolumn("CUSTOMER.INFO")/customerinfo[@Cid=1004]/addr return $newinfo ') WHERE cid = 1002

Figure 12.17

Updating multiple values in a single UPDATE statement

12.5

Deleting XML Nodes from a Document

12.5

333

DELETING XML NODES FROM A DOCUMENT

This section describes how to delete elements or attributes from a document. As an example, suppose that a phone number of a customer is invalid and you want to remove the entire phone element from the corresponding XML document. Figure 12.18 shows a ﬁrst attempt at writing an appropriate UPDATE statement. It looks much like the previous UPDATE statements except that the updating expression is delete instead of replace value of. In the delete expression, simply specify the path to the elements or attributes that you want to remove from the document. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do delete $newinfo/customerinfo/phone return $newinfo') WHERE cid = 1003

Original document

Updated document

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8

Figure 12.18

Deleting an element

The document that is being updated in Figure 12.18 contains multiple phone elements, and the delete expression removes all of them. If you don’t want to delete all occurrences of a repeating element, add a predicate to the target path to delete only selected occurrences. For example, the following delete expression removes a phone element only if its type attribute has the value home: do delete $newinfo/customerinfo/phone[type="home"]

This delete expression removes exactly one phone element from the original document in Figure 12.18, and leaves the other two phone elements untouched. In general, this expression can delete zero, one, or multiple phone elements from a document, depending on how many phone elements with type equal to home occur in a given document. Modifying repeating elements is further discussed in section 12.8.

334

Chapter 12

Updating and Transforming XML Documents

Predicates in the update expression only serve to select nodes within any given document. They do not help you to efﬁciently ﬁnd the documents that should be updated. Predicates that select documents for update must be placed in the WHERE clause of the SQL UPDATE statement.They can include XMLEXISTS predicates. NOTE

If you want to delete an attribute, such as country, simply use a delete expression with an XPath that points to the attribute: do delete $newinfo/customerinfo/addr/@country

You can also remove an entire XML fragment from an XML document. For example, the statement in Figure 12.19 deletes the entire addr element including all the child elements and attributes it contains. UPDATE customer SET info = XMLQUERY(' copy $newinfo := $INFO modify do delete $newinfo/customerinfo/addr return $newinfo') WHERE cid = 1002

Original document

Updated document

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Jim Noodle 905-555-7258

Figure 12.19

12.6

Deleting an XML fragment

RENAMING ELEMENTS OR ATTTRIBUTES IN A DOCUMENT

The rename expression enables you to change the name of an element or attribute. For example, the statement in Figure 12.20 renames the addr element to address. The new element name address is a string literal and must be enclosed in double quotes.

12.7

Inserting XML Nodes into a Document

335

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do rename $new/customerinfo/addr as "address" return $new ') WHERE cid = 1002

Original document

Updated document

Jim Noodle < addr country="Canada"> 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Figure 12.20

Changing an element name

DB2 never allows you to update a document in a manner that violates the rules for well-formed XML documents. For example, in an element such as you cannot rename the attribute xid to yid. This update operation is rejected because it would produce an element with two attributes that have the same name (yid), which is not permitted in any XML document.

12.7

INSERTING XML NODES INTO A DOCUMENT

This section describes how to add element or attribute nodes to a document. When you insert a new element or attribute into a document, you must specify the target position of the new node in the document. We ﬁrst discuss the positioning of inserted elements, then the positioning of inserted attributes, and then look at several examples.

12.7.1

Deﬁning the Position of Inserted Elements

Suppose you want to insert the new element [email protected] into the XML document for customer Jim Noodle. You have to decide which existing element is going to be the parent for the new email element. For example, you might decide that email is going to be a child element of the root element customerinfo. This makes email a sibling of the elements name, addr, and phone. Then you can further choose the position of the email element among its siblings. For example, should email appear before or after the addr element? Alternatively, you could decide that email is going to be a child element of addr and therefore becomes a sibling of street, city, prov-state, and pcode-zip. The insert operation in the modify clause allows you to add new nodes to an XML document. It offers ﬁve ways to specify the position of the new node: into, as last into, as ﬁrst

336

Chapter 12

Updating and Transforming XML Documents

into, after, and before. Examples of using these ﬁve options for a new element are listed in

Table 12.1. Table 12.1

Five Options for Inserting an Element into a Document

Insert Operation

Position of the Inserted Node

insert [email protected] into $new/customerinfo

email becomes a child element of customerinfo. The position of email

among the existing children of customerinfo is nondeterministic. insert [email protected] as last into $new/customerinfo

email becomes the last child element of customerinfo.

insert [email protected] as ﬁrst into $new/customerinfo

email becomes the ﬁrst child element of customerinfo.

insert [email protected] after $new/customerinfo/addr

email becomes a sibling of addr and therefore a child of customerinfo. email appears immediately after addr.

insert [email protected] before $new/customerinfo/addr

email becomes a sibling of addr and a child of customerinfo. email appears immediately before addr.

The path that deﬁnes the target location of the insert, such as $new/customerinfo or $new/customerinfo/addr, has to produce exactly one node. If the path does not exist in the document or if it exists more than once, the operation fails with error SQL16085N. If you look up the explanation for SQL16085N you ﬁnd that a common reason is described as “the target node of an insert expression is not a single element node or document node.” Beware that the words “not a single element node” do not necessarily imply that more than one target node was found. It’s equally possible that no target node was found. “Not a single element” means that either zero or more than one node was found, so you should check for both cases when you encounter error SQL16085N. For example, if you misspell a tag name in the target path, error SQL16085N is raised because no target node was found.

12.7.2

Deﬁning the Position of Inserted Attributes

To insert a new attribute instead of an element, you have to use a computed attribute constructor. It consists of the keyword attribute followed by the attribute name and an expression or constant that provides the attribute value. The same ﬁve insert options are available as for elements and are shown in Table 12.2. The difference for attributes is that the operations into $new/ customerinfo, as last into $new/customerinfo, and as ﬁrst into $new/ customerinfo all have the same effect. Their effect is that the new attribute becomes an attribute of the element customerinfo. Since the XML data model does not deﬁne a positional order

12.7

Inserting XML Nodes into a Document

337

among the attributes of an element, attributes are always unordered. Therefore the keywords last, ﬁrst, before, and after do not affect the position of attributes. If you insert an attribute before or after $new/customerinfo/addr, the attribute becomes a sibling of addr and is therefore added to the parent of addr, which is customerinfo. Table 12.2

Five Options for Inserting a Attribute into a Document

Insert Operation

Position of the Inserted Node

insert attribute email {"[email protected]"} into $new/customerinfo

In all three cases, email becomes an attribute of customerinfo. The position of email among the existing attributes is undeﬁned because attributes are not ordered.

insert attribute email {"[email protected]"} as last into $new/customerinfo insert attribute email {"[email protected]"} as ﬁrst into $new/customerinfo insert attribute email {"[email protected]"} after $new/customerinfo/addr

In both cases, email becomes an attribute of the parent of addr, which is customerinfo.

insert attribute email {"[email protected]"} before $new/customerinfo/addr

12.7.3

Insert Examples

For the following examples, assume that an email element has to be inserted into the XML document for Robert Shoemaker. This document is identiﬁed by the relational cid value 1003. Figure 12.21 shows a ﬁrst attempt at performing this update. The UPDATE statement fails with errors message SQL20345N because the target path is speciﬁed as $new instead of $new/customerinfo. When the target path is $new, the email element is inserted as a sibling and not as a child of the customerinfo element. The result is a sequence of two elements (customerinfo, email), which is not a well-formed XML document. Since XML columns can only contain wellformed documents, the update fails. It fails for the same reason if you specify before $new/ customerinfo or after $new/customerinfo as the target position.

338

Chapter 12

Updating and Transforming XML Documents

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert [email protected] as last into $new return $new') WHERE cid = 1003 SQL20345N The XML value is not a well-formed document with a single root element. SQLSTATE=2200L

Original document

Rejected XML value

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743 [email protected]

Figure 12.21

Cannot insert an element as a sibling of the root element

Figure 12.22 shows the corrected UPDATE statement and the correctly modiﬁed XML document. You could similarly insert the email element as ﬁrst into $new/customerinfo. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert [email protected] as last into $new/customerinfo return $new') WHERE cid = 1003

Original document

Updated document

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743 [email protected]

Figure 12.22

Inserting a new element as the last element

12.7

Inserting XML Nodes into a Document

339

If you want the email element to appear in the document before the phone elements, you can explicitly request it to be inserted before the ﬁrst occurrence of any existing phone elements using the positional predicate [1]. This is shown in Figure 12.23 where the positional predicate selects exactly one phone element as the target location. If you omit the positional predicate, the UPDATE statement fails with error SQL16085N. The statement in Figure 12.23 would also fail if the document contained no phone elements. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert [email protected] before $new/customerinfo/phone[1] return $new') WHERE cid = 1003

Original document

Updated document

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 [email protected] 905-555-7258 416-555-2937 905-555-8743

Figure 12.23

Inserting a new element before an existing element

If you want to insert the email element after the last phone element but before any other elements that might appear at end of the document, specify the insert position to be after $new/customerinfo/phone[last()]. As another example, Figure 12.24 shows an UPDATE statement that inserts the new email element as the ﬁrst child of the addr element. Alternatively, the UPDATE statement in Figure 12.25 inserts the email address as an attribute of the addr element. In the updated document, the attribute email happens to appear before the attribute country. But this order is not relevant and not guaranteed because XML attributes have no deﬁned order. If you change the target position of the inserted attribute to after $new/customerinfo/ addr/city or before $new/customerinfo/addr/@country, the updated document is still the same as shown in Figure 12.25.

340

Chapter 12

Updating and Transforming XML Documents

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert [email protected] as first into $new/customerinfo/addr return $new') WHERE cid = 1003

Original document

Updated document

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker [email protected] 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Figure 12.24

Inserting a new element as the ﬁrst child element of a target node

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do insert attribute email {"[email protected]"} into $new/customerinfo/addr return $new') WHERE cid = 1003

Original document

Updated document

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Figure 12.25

12.8

Inserting an attribute

HANDLING REPEATING AND MISSING NODES

If a single XPath expression identiﬁes multiple nodes in a single document, they are called repeating nodes. In previous sections you saw that the XML document for Robert Shoemaker contains multiple phone elements. Hence, the element phone is a repeating element and the path /customerinfo/phone produces a sequence of more than one element node.

12.8

Handling Repeating and Missing Nodes

341

As deﬁned by the XQuery Update standard, the delete expression is the only update operation that can directly process multiple occurrences of a node. It simply deletes all of them, as you saw in section 12.5. All other update expressions (replace, replace value of, rename, and insert) require special attention when dealing with repeating nodes. The same applies to missing nodes. If you try to delete an element or attribute that does not exist, the delete expression performs no action and returns successfully. However, all other update expressions fail when they try to modify an element or attribute that does not exist in the target document. The UPDATE statement in Figure 12.26 tries to change the value of a phone element but fails. At runtime, DB2 detects that there is more than one phone element in the target document and returns error SQL16085N. You can type “? SQL16085N” at the DB2 command prompt to ﬁnd that the explanation for reason code XUTY0008 is that “the target node of a replace expression is not a single node”. This reason code indicates that the target path $new/customerinfo/phone has either produced multiple phone elements or none. However, it must produce exactly one node for the update to be successful. The error prevents you from updating multiple phone elements with the same number, which would not make sense. If no phone element exists, the error ensures that you are not led to believe that the new phone number was successfully written to the document. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone with "123-456-7890" return $new ') WHERE cid = 1003 SQL16085N The target node of an XQuery "replace value of" expression is not valid. Error QName=err:XUTY0008. SQLSTATE=10703.

Figure 12.26

Trying to replace the value of a repeating element

If you know that there are multiple phone elements, a common way to avoid error SQL16085N is to add a predicate to the target path to select exactly one phone element for update. As an example, Figure 12.27 uses the predicate [@type="cell"] to only update the cell phone number. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify do replace value of $new/customerinfo/phone[@type="cell"] with "123-456-7890" return $new ') WHERE cid = 1003

Figure 12.27

Replacing one of multiple occurrences of an element

342

Chapter 12

Updating and Transforming XML Documents

Using the predicate in Figure 12.27 works well if every possible target document contains exactly one phone element with a type attribute equal to cell. However, if a document does not contain a cell phone element, the UPDATE statement in Figure 12.27 still fails with error SQL16085N. In that case, another option is to use the XQuery if-then-else expression, as shown in Figure 12.28. If a cell phone element exists then its value is replaced with a new value, else a new cell phone element with the new number is inserted. This implements an “upsert” operation. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify if ($new/customerinfo/phone[@type="cell"]) then do replace value of $new/customerinfo/phone[@type="cell"] with "123-456-7890" else do insert 123-456-7890 as last into $new/customerinfo return $new ') WHERE cid = 1001

Figure 12.28

Conditional update and insert of an element

The most resilient solution for handling both repeating and missing elements is a FLWOR expression in the modify clause (see Figure 12.29). The for clause iterates over the target elements one at a time, so that the replace value of expression in the return clause is always applied to exactly one element. If you remove the condition where $j/@type = "cell", all phone elements are updated with the same number "123-456-7890", regardless of their type. If a document does not contain a cell phone or no phone elements at all, the return clause of the FLWOR expression is never invoked so that the replace value of expression never fails due to a missing node. In summary, the FLWOR expression in the modify clause enables an UPDATE statement to • Modify multiple or all occurrences of a repeating node (without warning) • Add predicates to select which occurrences of a repeating node to modify • Silently proceed and return successfully even if a target node is not found

12.9

Modifying Multiple XML Nodes in the Same Document

343

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify for $j in $new/customerinfo/phone where $j/@type = "cell" return do replace value of $j with "123-456-7890" return $new') WHERE cid = 1000

Original document

Updated document

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 905-555-8743

Robert Shoemaker 845 Kean Street Aurora Ontario N8X 7F8 905-555-7258 416-555-2937 123-456-7890

Figure 12.29

12.9

Iterating over the occurrences of a repeating element

MODIFYING MULTIPLE XML NODES IN THE SAME DOCUMENT

You can have multiple update operations for the same document in the modify clause of a single UPDATE statement. However, you cannot rename, replace, or update the value of the same node more than once. In this section we discuss examples where multiple combined update operations are or are not in conﬂict with each other.

12.9.1

Snapshot Semantics and Conﬂict Situations

The XQuery Update standard deﬁnes that all update operations in the modify clause are applied independently from each other to the original document. They don’t see each others’ effects. This is called snapshot semantics, which means that each update operation is logically applied to a separate snapshot of the original document. As an example, let’s look at the UPDATE statement in Figure 12.30, which contains two updating expressions in the modify clause, separated by a comma. The ﬁrst expression inserts an additional phone element. The second expression deletes all phone elements. The obvious question is whether the newly inserted phone element is instantly removed by the delete expression, and whether that depends on the order in which the insert and the delete operations appear in the modify clause. As it turns out, the new phone element is not affected by the delete expression, irrespective of the order in which the operations appear in the modify clause. Due to snapshot

344

Chapter 12

Updating and Transforming XML Documents

semantics, both the insert and the delete expressions in Figure 12.30 are independently applied to a snapshot of the original document. Therefore the delete expression does not see the newly inserted phone element and only removes the old phone elements that existed in the document prior to this update. Hence, there is no conﬂict between the insert and the delete expression in Figure 12.30. UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify( do insert 777-555-3333 after $new/customerinfo/addr , do delete $new/customerinfo/phone ) return $new ') WHERE cid = 1002

Original document Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Figure 12.30

Updated document Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 777-555-3333

Combining an insert and a delete operation

For comparison, let’s look at a different combination of an insert and a delete expression in Figure 12.31. One of the expressions deletes the addr element, and the other expression inserts a new POBox element into the addr element. Again, the order of the two operations in the modify clause is irrelevant. Nevertheless, the two operations conﬂict with each other because the delete expression removes the parent element (addr) of the newly inserted POBox element. For this case, the language standard deﬁnes that delete “wins” over insert and the updated document has no addr or POBox elements. Be aware of these effects when you code complex updates.

12.9

Modifying Multiple XML Nodes in the Same Document

345

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify( do delete $new/customerinfo/addr , do insert 15 into $new/customerinfo/addr ) return $new ') WHERE cid = 1002

Original document Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

Figure 12.31

12.9.2

Updated document Jim Noodle 777-555-3333

A different combination of an insert and a delete operation

Converting Elements to Attributes and Vice Versa

The UPDATE statement in Figure 12.32 is another interesting example. It combines two insert expressions and two delete expressions in a single statement. The objective is to turn the existing Cid attribute into an element called customerid, and the existing element name into an attribute called custname . Four update operations are required to make this happen: • Insert a customerid element and compute its value from the existing Cid attribute • Insert a custname attribute and take its value from the existing name element • Delete the existing Cid attribute • Delete the existing name element Again, the order of these four expressions in the modify clause does not matter. Snapshot semantics ensures that the four expressions are applied in isolation and produce the intended result. In particular, the insert expressions see their own logical snapshots of the original document, which enables them to read the Cid attribute and the name element even though these nodes are being deleted at the same time.

346

Chapter 12

Updating and Transforming XML Documents

UPDATE customer SET info = XMLQUERY(' copy $new := $INFO modify(do insert {$new/customerinfo/data(@Cid)} as first into $new/customerinfo , do insert attribute custname {$new/customerinfo/name} into $new/customerinfo, do delete $new/customerinfo/@Cid, do delete $new/customerinfo/name ) return $new') WHERE cid = 1002

Document before the update

Document after the update

Jim Noodle 25 EastCreek Markham Ontario N9C 3T6 905-555-7258

" 2001,"" 2002,""

Figure 17.23

Schema identiﬁers in the delimited format input ﬁle

The input ﬁle in Figure 17.23 tells DB2 to use the XML Schema CUSTXSD1 to validate the XML documents contained in ﬁles data2.xml and data4.xml, and the schema CUSTXSD2 to validate the XML document data3.xml. Additionally you need to include the XMLVALIDATE USING XDS clause in the IMPORT or LOAD command (see Figure 17.24). Otherwise the SCH attributes in the input are ignored and no validation is performed. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xml XMLVALIDATE USING XDS INSERT INTO customer

Figure 17.24

Performing XML Schema validation during IMPORT with multiple schemas

532

Chapter 17

Validating XML Documents against XML Schemas

What happens if the delimited format input ﬁle contains schema references (SCH attributes) but you use the XMLVALIDATE USING SCHEMA clause in the LOAD or IMPORT command? In this case the XML Schema speciﬁed in the XMLVALIDATE USING SCHEMA clause takes precedence, all documents are validated against that one schema, and the SCH attributes in the input ﬁle are ignored. For a large number of documents you normally don’t create the delimited format input ﬁle manually—you may have an application or script that creates it for you. Also, note that DB2’s EXPORT utility can export tables (or subsets of a table deﬁned by a query) to the ﬁle system. When you export XML data, the EXPORT utility automatically generates a delimited format ﬁle and optionally includes SCH attributes with schema identiﬁers for all documents that have been validated. Samples of the output produced by EXPORT are shown in Figure 17.23, Figure 17.25, and Figure 17.27.

17.7.3 Using a Default XML Schema When schema references are included in the delimited format input ﬁle, it is possible that not every XDS has a SCH attribute (see Figure 17.25). In this case, the LOAD and IMPORT commands allow you to specify a default schema for those records that do not have a SCH attribute in the input ﬁle. 2000,"" 2001,"" 2002,""

Figure 17.25

Schema identiﬁers in the delimited format input ﬁle

The IMPORT command in Figure 17.26 contains the DEFAULT option in the XMLVALIDATE USING XDS clause to indicate that any input documents that don’t have a schema reference in the XDS must be validated against the schema custxsd1. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS DEFAULT db2admin.custxsd1 INSERT INTO customer

Figure 17.26

Specifying a default schema for validation

Note that the DEFAULT clause takes precedence over the IGNORE and MAP clauses (discussed in the next sections).

17.7.4 Overriding XML Schema References Assume you need to import XML data using the delimited format input ﬁle in Figure 17.27. This input ﬁle contains references to XML Schemas custxsd1, custxsd2, and custxsd3.

17.7

Validation during Load and Import Operations

2000,"" />" />"

Schema identiﬁers in the delimited format input ﬁle

Let’s say you only want to validate the documents that reference schema custxsd1, but not the documents that reference custxsd2 or custxsd3. One reason could be that you received the input data but you only have schema custxsd1 and not the other two. Another reason could be that the documents for schemas custxsd2 and custxsd3 are already known to be valid and you want to save the CPU cycles of validating them again. In such cases you can add the IGNORE keyword with a list of schema identiﬁers to the XMLVALIDATE USING XDS clause. An example is shown in Figure 17.28. It tells DB2 to perform validation based on the schemas speciﬁed in the SCH attributes, but not to validate any documents that reference any of the schemas listed in the IGNORE clause. IMPORT FROM c:\xml\tab.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS IGNORE (db2admin.custxsd2, db2admin.custxsd3) INSERT INTO customer

Figure 17.28

Disabling validation for selected XML Schemas

Instead of ignoring certain XML Schemas you can also override them with a different schema. The MAP clause allows you to specify alternate XML Schemas to use in place of those speciﬁed by the SCH attributes in the delimited format input ﬁle. The MAP clause speciﬁes a list of one or more XML Schema pairs, where each pair represents a mapping from one XML Schema to another. The ﬁrst XML Schema in the pair represents a schema that is referenced by an SCH attribute in an XDS. The second XML Schema in the pair represents the schema that should be used to perform validation. An example is shown in Figure 17.29, where the IMPORT command uses the schema custxsd1 whenever it sees schema custxsd2 or custxsd3 in an SCH attribute in the input ﬁle. IMPORT FROM c:\xml\tab.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING XDS MAP ((db2admin.custxsd2, db2admin.custxsd1), (db2admin.custxsd3, db2admin.custxsd1)) INSERT INTO customer

Figure 17.29

Import with validation against “mapped” XML Schemas

534

Chapter 17

Validating XML Documents against XML Schemas

The following usage rules apply: • If an XML Schema is present in the left side of a schema pair in the MAP clause, it cannot also be speciﬁed in the IGNORE clause. • If an XML Schema is present in the right side of a schema pair in the MAP clause, it will not be subsequently ignored if listed in the IGNORE clause. • An XML Schema cannot be mapped more than once. It cannot appear on the left side of more than one schema pair. • Schema mappings in the MAP clause are non-transitive. For example, assume schema custxsd3 is mapped to schema custxsd2, and assume a second pair maps schema custxsd2 to schema custxsd1; then schema custxsd1 will not be used instead of schema custxsd3.

17.7.5 Validation Based on schemaLocation Attributes The IMPORT command in Figure 17.30 contains the clause XMLVALIDATE USING SCHEMALOCATION HINTS. This clause indicates that each XML document in the input ﬁle is to be validated against the XML Schema that is referenced by the optional xsi:schemaLocation attribute within the document. An xsi:schemaLocation attribute, which is also called a schema location hint, contains a pair of target namespace and schema location. This pair can identify an XML Schema that you have previously registered in the XML Schema Repository. Earlier in this chapter, Figure 17.2 showed an XML document with an xsi:schemaLocation attribute. IMPORT FROM c:\xml\load_customer.txt OF DEL XML FROM c:\xmldata XMLVALIDATE USING SCHEMALOCATION HINTS INSERT INTO customer

Figure 17.30

Validation with schema location hints

17.8 CHECKING WHETHER AN EXISTING DOCUMENT HAS BEEN VALIDATED DB2 allows you to check whether an XML document that is stored in a table has previously been validated. This can be done in a couple of ways. In DB2 for Linux, UNIX, and Windows you can use the IS VALIDATED predicate, which works similarly to the IS NULL predicate that you might already be familiar with. The query in Figure 17.31 checks every XML document in the info column of the customer table and returns YES if the document has been validated, and NO otherwise.

17.9

Validating Existing Documents in a Table

535

SELECT id, CASE WHEN info IS VALIDATED THEN 'YES' ELSE 'NO' END AS isvalid FROM customer

Figure 17.31

Checking which documents in a table have been validated

The query in Figure 17.32 is very similar but uses a WHERE clause with an XMLEXISTS predicate to check the validation status only of the document(s) where the customer name is Matt Foreman. SELECT CASE WHEN info IS VALIDATED THEN 'YES' ELSE 'NO' END AS isvalid FROM customer WHERE XMLEXISTS('$INFO/customerinfo[name = "Matt Foreman"]')

Figure 17.32

Checking whether a speciﬁc document has been validated

To perform similar checks in DB2 for z/OS you need to maintain an additional column in your user table. The column can contain 0 or 1 to indicate whether the document has been validated. Alternatively you can store the OBJECTID of the XML Schema in a BIGINT column. Then you can easily query this column to determine which schema a given XML document belongs to.

17.9

VALIDATING EXISTING DOCUMENTS IN A TABLE

You might encounter a situation where you already have XML documents stored in an XML column and want to validate them against an XML Schema. Maybe they were never validated and you want to validate them now. Or, maybe they had been validated when they were inserted, but now you want to validate them against a new schema. Either way, the validation of existing documents can be achieved with SELECT or UPDATE statements. Let’s look at the update process ﬁrst. Figure 17.33 shows an UPDATE statement that replaces a document with a validated copy of itself. The WHERE clause uses a relational predicate to identify a single row in the customer table. In this row, the XML document in the info column is replaced with the result of the XMLVALIDATE function. The XMLVALIDATE function itself also takes the info column as input. If the document is not valid against the speciﬁed XML Schema, the update fails. Otherwise the document is replaced with itself and the OBJECTID of the XML Schema gets attached to the document. This links the document to its schema. The function XMLXSROBJECTID can take the document or any part of it as input, and returns the OBJECTID of the schema that the document was validated against (see section 17.10).

536

Chapter 17

Validating XML Documents against XML Schemas

UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE id = 1000

Figure 17.33

Validating an existing document

The UPDATE statement in Figure 17.34 is similar to that in Figure 17.33, but has a different predicate in the WHERE clause. It tries to validate all documents in the XML column that have not been validated before. This update works as expected if all those documents are valid against the speciﬁed XML Schema. However, the problem with this UPDATE statement is that it fails and rolls back as soon as the ﬁrst invalid document is encountered. The reason for this behavior is that the SQL/XML standard requires the XMLVALIDATE function to raise an error if validation fails. You will see later how error handling in a stored procedure can circumvent this problem (see Figure 17.38). UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE info IS NOT VALIDATED

Figure 17.34

Validating multiple existing documents

Beware that a bulk update with validation of a large number of documents can take a signiﬁcant amount of time. All affected documents are rewritten in the table space and logged. If you are only interested in a Yes/No answer whether certain documents are valid for a given schema, and if you don’t require the relationship between documents and schema to be permanently recorded in the database, then a SELECT statement can be used instead of an UPDATE statement. The query in Figure 17.35 reads XML documents from the info column for all customers whose city is Toronto. At the same time it uses the XMLVALIDATE function in the SELECT clause to validate the documents upon retrieval. The query fails at runtime as soon as one document is retrieved that is not valid for the speciﬁed schema. SELECT XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) FROM customer WHERE XMLEXISTS('$INFO/customerinfo[addr/city = "Toronto"]')

Figure 17.35

Retrieving and validating documents at the same time

If the validation is performed in a stored procedure, an exception handler can catch and handle the validation failure. Figure 17.36 shows a simple stored procedure that takes a single XML document as input and returns 1 if the document is valid and 0 if it is not valid. If the input document

17.9

Validating Existing Documents in a Table

537

is not valid for the speciﬁed schema, the exit handler catches the error that is raised by XMLVALIDATE and sets the output parameter isvalid to 0. CREATE PROCEDURE validate(IN doc XML, OUT isvalid INTEGER) LANGUAGE SQL BEGIN DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE EXIT HANDLER FOR INVALID_DOCUMENT SET isvalid = 0; IF (XMLVALIDATE(doc ACCORDING TO XMLSCHEMA ID db2admin.custxsd) IS VALIDATED) THEN SET isvalid = 1; END IF; END #

Figure 17.36

Stored procedure to validate an existing document

The stored procedure in Figure 17.36 can be called from an application or from other stored procedures that manipulate XML documents. You can also call it in the DB2 Command Line Processor, if the ﬁrst parameter of the stored procedure call is a query that produces a single XML document. This is illustrated in Figure 17.37, where the XML document with id = 1003 from the customer table is passed to the stored procedure for validation. The output shows that the output parameter isvalid has the value 1, which means that the document is valid. db2 => call validate((SELECT info FROM customer WHERE id = 1003),?) Value of output parameters -------------------------Parameter Name : ISVALID Parameter Value : 1 Return Status = 0 db2 =>

Figure 17.37

Testing the validation stored procedure in the CLP

The stored procedure in Figure 17.38 is designed to perform the same task as the UPDATE statement in Figure 17.34. That is, it validates all documents in the XML column that have not been validated before. The major difference is that this stored procedure does not fail and abort when the ﬁrst invalid document is encountered. Instead, it loops over the XML documents and uses a CONTINUE handler to count invalid documents instead of raising an error. Alternatively, you could change the CONTINUE handler to write the id values of the invalid documents to a separate table, or take any other appropriate action.

538

Chapter 17

Validating XML Documents against XML Schemas

CREATE PROCEDURE bulkvalidate(OUT num_invalid_docs INTEGER) LANGUAGE SQL BEGIN DECLARE count INTEGER DEFAULT 0; DECLARE INVALID_DOCUMENT CONDITION FOR '2200M'; DECLARE CONTINUE HANDLER FOR INVALID_DOCUMENT SET count = count + 1; FOR doc AS cur1 CURSOR FOR SELECT id, info FROM customer WHERE info IS NOT VALIDATED FOR UPDATE OF INFO DO UPDATE customer SET info = XMLVALIDATE(info ACCORDING TO XMLSCHEMA ID db2admin.custxsd) WHERE CURRENT of cur1; END FOR; SET num_invalid_docs = count; END#

Figure 17.38

17.10

Stored procedure to validate multiple existing documents

FINDING THE XML SCHEMA FOR A VALIDATED DOCUMENT

DB2 for Linux, UNIX, and Windows also allows you to determine which XML Schema was used to validate a particular XML document. Every XML Schema that is registered in DB2 is assigned an internal identiﬁcation number of type BIGINT. You can see this number in the column OBJECTID of the catalog view SYSCAT.XSROBJECTS. Whenever an XML document is validated against an XML Schema, the unique identiﬁer (OBJECTID) is stored with the XML document. The scalar function XMLXSROBJECTID takes an XML document as input and returns the OBJECTID of the XML Schema that was used to validate the XML document. If the input document hasn’t been validated, the value 0 is returned. There are several interesting uses of the function XMLXSROBJECTID. One is to ﬁnd the XML Schema that was used to validate a speciﬁc document. Another is ﬁnding all documents that have been validated against a particular XML Schema. Figure 17.39 shows how to use the function XMLXSROBJECTID in the WHERE clause of an SQL statement to join with the OBJECTID column in the catalog view syscat.xsrobjects. Together with the predicate on the relational id column, this retrieves information about the schema that was used to validate the document with id 1003. Instead of the relational predicate you can certainly also use an XMLEXISTS predicate to qualify one or multiple XML documents based on the contents of the XML document itself.

17.10

Finding the XML Schema for a Validated Document

539

SELECT c.id, SUBSTR(x.objectschema,1,10) AS xmlschema_schema, SUBSTR(x.objectname,1,10) AS xmlschema_name FROM customer c, syscat.xsrobjects x WHERE XMLXSROBJECTID(c.info) = x.OBJECTID AND c.id = 1003; ID XMLSCHEMA_SCHEMA XMLSCHEMA_NAME --------------- ---------------- -------------1003 DB2ADMIN CUSTXSD

Figure 17.39

Finding schema information for a given XML document

There is no hard dependency between a document and the XML Schema it was validated against. This means that an XML Schema can be dropped from the XML Schema Repository even if the database contains documents that were validated against this schema. Those documents continue to carry the OBJECTID of the XML Schema even after the schema is dropped.The OBJECTID now points to a non-existing XML Schema, which has no impact other than the obvious; that is, you won’t ﬁnd the schema that belongs to these documents. NOTE

While the query in Figure 17.39 ﬁnds the XML Schema for a given document, the query in Figure 17.40 ﬁnds the documents that were validated with a given XML Schema. Again, the function XMLXSROBJECTID facilitates the join between the customer table and the XML Schema Repository. The second and the third predicates select the particular XML Schema db2admin. custxsd for which the query ﬁnds all corresponding XML documents. SELECT c.id FROM customer c, syscat.xsrobjects x WHERE XMLXSROBJECTID(c.info) = x.OBJECTID AND x.objectschema = 'DB2ADMIN' AND x.objectname = 'CUSTXSD'

Figure 17.40

Finding documents for given XML Schema, using XMLXSROBJECTID

Since DB2 9.5 for Linux, UNIX, and Windows you can also use the IS VALIDATED predicate with the ACCORDING TO clause, as shown in Figure 17.41. SELECT c.id FROM customer c WHERE c.info IS VALIDATED ACCORDING TO XMLSCHEMA ID db2admin.custxsd

Figure 17.41

Finding documents for given XML Schema, using IS VALIDATED

540

Chapter 17

Validating XML Documents against XML Schemas

If you use multiple XML Schemas to validate documents within a single XML column, and if you frequently need to run queries that relate documents to schemas, consider storing the OBJECTID in an additional column of your table with an index on it. This additional column can greatly improve the performance of ﬁnding schemas and documents that relate to each other. In DB2 for z/OS, such an extra column is the only way to correlate documents to schemas.

17.11

HOW TO UNDO DOCUMENT VALIDATION

It is possible to make a validated XML document look and behave as if it had never been validated. When you “undo” the validation, the linkage between the document and any XML Schema is removed, because the OBJECTID of an XML Schema is no longer associated with the document. All it takes is to update the validated document with itself and reparse it without validation. You will probably rarely have to do this, but we want to show that it is possible if needed. It only applies to DB2 for Linux, UNIX, and Windows. You “remove validation” from a document with an UPDATE statement and the XMLSERIALIZE and XMLPARSE functions as shown in Figure 17.42. This statement serializes the stored document tree back to text format and then parses it again to produce DB2’s internal tree format, but without validation (assuming you don’t have triggers that enforce validation). The document now looks like it has never been validated. UPDATE customer SET info = XMLPARSE(DOCUMENT XMLSERIALIZE(info AS CLOB(5000))) WHERE id = 1000

Figure 17.42

Undoing validation disassociates a document from its schema

Note that the XMLSERIALIZE function requires you to use a character type, such as VARCHAR or CLOB, that is large enough to temporarily hold the serialized document.

17.12

CONSIDERATIONS FOR VALIDATION IN DB2 FOR Z/OS

Throughout this chapter you have seen many ways in which the function XMLVALIDATE can be used in DB2 for Linux, UNIX, and Windows to validate XML documents against an XML Schema. The equivalent function in DB2 9 for z/OS is called SYSFUN.DSN_XMLVALIDATE. The main difference between the two is that DSN_XMLVALIDATE must be an argument to the XMLPARSE function. The other difference is that DSN_XMLVALIDATE does not use an ACCORDING TO XMLSCHEMA clause to identify an XML Schema, but a regular parameter instead. The following sections provide examples.

17.12

Considerations for Validation in DB2 for z/OS

17.12.1

541

Document Validation Upon Insert

The DSN_XMLVALIDATE function can take either two or three input parameters. The ﬁrst parameter is the XML document that you want to validate. It must be of type CLOB or BLOB with a maximum size of 250MB, or of type VARCHAR with a maximum size of 32KB. If you are using DSN_XMLVALIDATE with two parameters, then the second parameter has to be the SQL identiﬁer of the XML Schema that you want to use for validation. This parameter cannot be NULL. Figure 17.43 shows two INSERT statements that use DSN_XMLVALIDATE with two parameters. The ﬁrst statement provides the XML document as a parameter marker, and the second uses a host variable. Both specify that the document is to be validated against the XML Schema SYSXSR.CUSTXSD. An error is returned if an XML Schema with this identiﬁer is not found in DB2’s XML Schema Repository (XSR). INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? AS CLOB), 'SYSXSR.CUSTXSD') ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'SYSXSR.CUSTXSD') ) );

Figure 17.43

Referencing the XML Schema by its SQL identiﬁer

If you are using DSN_XMLVALIDATE with three parameters, then the second and third parameters must be the target namespace and the schema location of the XML Schema that you want to use for validation (see Figure 17.44). This combination of target namespace and schema location must uniquely identify an XML Schema that is registered in the XSR, otherwise an error is raised. If you use DSN_XMLVALIDATE with three parameters, the second and/or the third parameter can be NULL. In this case DB2 still looks for a corresponding XML Schema in its XML Schema Repository. If both parameters are NULL, DB2 expects to ﬁnd exactly one schema in the XSR whose target namespace and schema location are NULL. DB2 for z/OS does not infer the schema from a schema location attribute inside the XML document that you want to validate. INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? AS CLOB), 'http://pureXMLcookbook.org', NULL ) ) ); INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'http://pureXMLcookbook.org', 'customer.xsd' ) ) );

Figure 17.44 (continues)

Referencing the XML Schema by target namespace and schema location

542

Chapter 17

Validating XML Documents against XML Schemas

INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, NULL, 'customer.xsd' ) ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, NULL, NULL ) ) );

Figure 17.44 (Continued)

Referencing the XML Schema by target namespace and schema location

The previous examples provided either the SQL identiﬁer of the XML Schema, or the target namespace and schema location as string literals. Alternatively you can provide them through parameter markers or host variables. The ﬁrst INSERT statement in Figure 17.45 uses the DSN_XMLVALIDATE function with two parameter markers. The ﬁrst provides the document to validate and the second provides the SQL identiﬁer of the XML Schema. The second parameter cannot provide an actual XML Schema document for validation, because DB2 only validates against schemas that were previously registered in the XSR. The second INSERT statement in Figure 17.45 uses DSN_XMLVALIDATE with three host variables, which means that the schema is being identiﬁed by target namespace and schema location. INSERT INTO customer(id, info) VALUES (?, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( (CAST ? as CLOB), ?) ) ); INSERT INTO customer(id, info) VALUES (:id, XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, :tgtnamespace_hv, :schemalocation_hv) ) );

Figure 17.45

Providing schema identiﬁcation via parameter markers or host variables

The DSN_XMLVALIDATE function can only be used as a parameter to the XMLPARSE function, and in that case the XMLPARSE function cannot use the PRESERVE WHITESPACE clause. Validation always implies that boundary whitespace is stripped, not preserved, in both DB2 for z/OS and DB2 for Linux, UNIX, and Windows.

17.12.2

Document Validation Upon Update

If you use SQL UPDATE statements in DB2 for z/OS to replace existing documents, the DSN_XMLVALIDATE function allows you to validate the new document as part of the update

17.12

Considerations for Validation in DB2 for z/OS

543

process. In the previous sections you have seen various different ways in which you can provide input to the DSN_XMLVALIDATE function. All of them work in UPDATE statements as well, as in Figure 17.46. UPDATE customer SET info = XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( :document_hv, 'SYSXSR.CUSTXSD') ) ) WHERE id = 1003

Figure 17.46

17.12.3

DSN_XMLVALIDATE in an UPDATE statement

Validating Existing Documents in a Table

There may be situations where you already have XML documents stored in an XML column and want to validate them against an XML Schema. For example, the query in Figure 17.47 selects all documents for customers in Toronto and validates them upon retrieval. Remember that the DSN_XMLVALIDATE function requires the input document to be of type CLOB or BLOB. However, the column info in our customer table is of type XML. Therefore, at the time of writing, the function XMLSERIALIZE is required to convert the XML documents to type CLOB or BLOB. SELECT XMLPARSE( DOCUMENT SYSFUN.DSN_XMLVALIDATE( XMLSERIALIZE(info AS CLOB), 'SYSXSR.CUSTXSD') ) ) FROM customer WHERE XMLEXISTS('$i/customerinfo/addr[city = "Toronto"]' PASSING info AS "i");

Figure 17.47

Validating existing documents in a table

The query in Figure 17.47 parses and validates all matching documents, which requires more CPU cycles than simply retrieving the documents without reparsing them. The query raises an error as soon as one document is encountered that is not valid against the schema SYSXSR.CUSTXSD. You can capture and handle this error in a stored procedure, similar to how it is discussed in section 17.9.

17.12.4

Summary of Platform Similarities and Differences

Table 17.2 provides a summary of the differences in validation functionality between DB2 for z/OS and DB2 for Linux, UNIX, and Windows. This comparison is a point-in-time snapshot and subject to change. Over time, the supported features in the DB2 for z/OS and DB2 for Linux, UNIX, and Windows continue to converge.

544

Table 17.2

Chapter 17

Validating XML Documents against XML Schemas

Summary of Platform Similarities and Differences

Feature

DB2 for Linux, UNIX, and Windows

DB2 for z/OS

Document validation for INSERT and UPDATE operations

Yes

Yes

Validation function

XMLVALIDATE

DSN_XMLVALIDATE; always has to be an argument of the XMLPARSE function.

Can reference XML Schema by its SQL identiﬁer

Yes

Yes

Can reference XML Schema by target namespace and schema location

Yes

Yes

Can validate existing documents in a table

Yes

Yes

Can perform validation in stored procedures

Yes

Yes

Validation support in the LOAD utility

Yes

You can validate documents after LOAD.

Link between documents and schemas is stored with each validated document

Yes*

You can maintain this information in a separate column of the user table.

IS VALIDATED predicate to

Yes*

You can get this information from a separate column in the user table where you record the schema ID for each document.

check whether a document has been validated Function XMLXSROBJECTID to ﬁnd documents for a given schema, or vice versa

Yes*

*If you query the relationship between documents and schemas often, you might want to maintain this information (the schema ID for any given document) in a separate column that is indexed to ensure good performance.

17.13

SUMMARY

Validating XML documents against XML Schemas is the best way to enforce XML data quality in the database. However, document validation is optional in DB2 and there is no performance or functional penalty if you don’t use an XML Schema. If you choose to validate documents, you typically do so when you insert, update, or load them. Existing documents in the database can

17.13

Summary

545

also be validated in queries. An XML column can contain a mix of validated and non-validated documents, and different documents in a column can be validated with different schemas. In DB2 you are not forced to assign a single XML Schema to an entire XML column. There are two general approaches for document validation in DB2: • Application-centric: Applications use the XMLVALIDATE (or DSN_XMLVALIDATE) function in their INSERT and UPDATE statements. This makes validation a distributed responsibility and provides maximum ﬂexibility. • Database-centric: The database uses triggers and check constraints to enforce validation on a per-XML-column basis. These application- and database-centric techniques can also be combined to implement a custom validation strategy that meets speciﬁc requirements.

This page intentionally left blank

C

H A P T E R

18

Using XML in Stored Procedures, UDFs, and Triggers

tored procedures, user-deﬁned functions (UDFs), and triggers are database objects that encapsulate processing steps to retrieve or manipulate data in the database. They can contain multiple statements that are invoked and executed as a single unit. They are typically used to implement application-speciﬁc logic. Stored procedures and UDFs can be implemented in the SQL Procedure Language (SQL PL) or in external languages such as Java, C, or COBOL. The beneﬁts of stored procedures and UDFs include:

S

• Reduced coding labor due to the creation of reusable processing modules • Richer processing capabilities in the databases by deﬁning custom logic and functions • Improved performance and reduced network trafﬁc because stored procedures and UDFs are executed close to the data; that is, in the database engine Stored procedures are executed with CALL statements, which can be issued from an application program, from another stored procedure, from a UDF, or from a trigger. UDFs are used in SQL statements just like you use predeﬁned SQL functions. Triggers are executed automatically when an insert, delete, or update operation happens on a speciﬁed table. Triggers are used to implement automated reactions to data modiﬁcations and to enforce data integrity rules within the database. The beneﬁts of stored procedures, UDFs, and triggers apply equally to the processing of XML data and relational data. In this chapter we discuss the following topics: • Manipulating XML data in stored procedures (section 18.1) • Manipulating XML data in user-deﬁned functions (section 18.2) • Manipulating XML data in triggers (section 18.3)

547

548

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

For general background on stored procedures, UDFs, triggers, and the SQL Procedure Language, please consult the resources listed in the Appendix C, Further Reading.

18.1

MANIPULATING XML IN SQL STORED PROCEDURES

Stored procedures are a powerful tool for application development. They allow you to deﬁne simple or complex multi-statement operations and processing logic that can be invoked with a single call from the application. Stored procedures can encapsulate and hide complex data manipulation from the client application. Since stored procedures are executed in the database server, they can process data without moving it to the client, which is often beneﬁcial for performance. In previous chapters you have already seen several examples where stored procedures implement speciﬁc tasks: • Section 7.7, Figure 7.41: Stored procedure to execute XPath dynamically • Section 17.3, Figure 17.7: Stored procedure to handle and record validation errors • Section 17.9, Figure 17.36: Stored procedure to validate an existing document • Section 17.9, Figure 17.38: Stored procedure to validate multiple existing documents DB2 for Linux, UNIX, and Windows allows you to use the XML data type not just to deﬁne columns in a table, but also to declare input and output parameters as well as variables in stored procedures and user-deﬁned functions. Stored procedures can therefore manipulate XML documents in their parsed format without incurring additional XML parsing, which is a major performance beneﬁt. Variables of data type XML can be manipulated in stored procedures much like variables of other types. For example, XML variables can receive their value through statements such as a SET statement or a SELECT INTO statement. The only restriction is that XML variables and XML input parameters lose their value upon a COMMIT or ROLLBACK operation. If you want to use an XML variable or parameter after a ROLLBACK or COMMIT statement, you need to assign new values to them ﬁrst. Otherwise error SQL1354N is raised. The best way to use XPath or XQuery expressions in stored procedures is to embed them in the SQL/XML functions XMLQUERY, XMLTABLE, or XMLEXISTS. These can be used in stored procedure statements and accept variables of type XML in their PASSING clause. You can also use XQuery without SQL in stored procedures, but only with dynamic cursors. Static XQuery is not allowed.

18.1.1

Basic XML Manipulation in Stored Procedures

Let’s look at Figure 18.1 to become familiar with the basic capabilities of handling XML data in stored procedures. The table addrtable is deﬁned in addition to the customer table that we

18.1

Manipulating XML in SQL Stored Procedures

549

have been using. The stored procedure has one input parameter and one output parameter, both are of type XML. Additionally, the procedure declares the variables id and address of type INTEGER and XML, respectively. The ﬁrst SET statement extracts the Cid attribute from the input document, converts it to INTEGER, and assigns it to the variable id. Note that the input parameter custDoc is passed into the XMLQUERY function. Next is the SELECT-INTO statement, which demonstrates two important capabilities. First, the INTO clause is used to assign an XML value to the XML output parameter olddoc. Second, the variable id is passed into the XMLEXISTS predicate so that only the matching document is retrieved from the customer table. The last part of the stored procedure shows that you can use the XMLEXISTS predicate directly in an IF statement. It checks whether the address in the input document is in Canada. If this is true then the SET statement extracts the addr element of the document and assigns it to the XML variable address. Subsequently the address and the id variables are inserted into the table addrtable. CREATE TABLE addrtable(id INTEGER, addr XML)# CREATE PROCEDURE processDoc(IN custDoc XML, OUT oldDoc XML) BEGIN ATOMIC DECLARE id INTEGER; DECLARE address XML; SET id = XMLCAST(XMLQUERY('$d/customerinfo/@Cid' PASSING custDoc AS "d") as INTEGER); SELECT info INTO olddoc FROM customer WHERE XMLEXISTS('$INFO/customerinfo[@Cid = $x]' PASSING id AS "x"); IF XMLEXISTS('$d/customerinfo/addr[@country = "Canada"]' PASSING custDoc AS "d") THEN SET address = XMLQUERY('$d/customerinfo/addr' PASSING custDoc AS "d"); INSERT INTO addrtable(id, addr) VALUES(id, XMLDOCUMENT(address)); END IF; END #

Figure 18.1

Stored procedure with basic XML manipulation

Since the body of a stored procedure can contain multiple statements, these statements have to be separated by the semicolon character. This use of the semicolon conﬂicts with the fact that the semicolon is also the default terminating character for statements in the DB2 Command Line Processor (CLP). The same applies to user-deﬁned functions and triggers. To avoid problems you need to use a different terminating character in the CLP. For example, in Figure 18.1 the # is used as the terminating character for the CREATE PROCEDURE statement. You must invoke the CLP

550

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

with the td# option to set the #, or any other character of your choosing, as the statement terminator. If the CREATE PROCEDURE statement in Figure 18.1 is in a ﬁle create_proc.sql then the following command issued at the OS prompt creates the procedure: db2 -td# -f create_proc.sql

18.1.2

A Stored Procedure to Store XML in a Hybrid Manner

Let’s look at a common use case for a stored procedure. Assume you want to store the customer sample documents in a hybrid fashion. You might decide to keep the address information as XML, because you expect it to be of variable format over time, but you want to store customer name and phone information in relational columns. Since each customer can have multiple phone numbers (one-to-many relationship), the phone numbers have to be stored in a separate table with a proper join key. That join key can be a number generated by a sequence for each new XML document that comes in. A sequence is a database object that produces a stream of unique values. Figure 18.2 shows the deﬁnition of the target tables and the sequence. CREATE TABLE cust (id INTEGER, name VARCHAR(20), addr XML); CREATE TABLE phone(id INTEGER, type VARCHAR(20), number VARCHAR(20)); CREATE SEQUENCE id_seq START WITH 1 INCREMENT BY 1 CACHE 100;

Figure 18.2

Table and sequence deﬁnition for hybrid storage

The stored procedure in Figure 18.3 takes a customer XML document as an input parameter. Note that this parameter is of type XML. Each time the procedure is called, it uses the NEXTVAL expression to pull a new id value from the sequence. Then it uses two INSERT statements with XMLTABLE functions to extract the required values for insert into the target tables cust and phone. The ﬁrst insert produces one row per customer, the second produces one row per phone element. The same id value is used for inserts into both tables to ensure referential integrity. Instead of using the sequence, the id could also be passed as a parameter from the calling application, or extracted from the document. CREATE PROCEDURE insertCustomer(IN custDoc XML, OUT id INTEGER) BEGIN ATOMIC SET id = NEXTVAL FOR id_seq; INSERT INTO cust(id, name, addr) SELECT id, T.name, T.address FROM XMLTABLE('$d/customerinfo' PASSING custDoc AS "d" COLUMNS name VARCHAR(20) PATH 'name', address XML PATH 'document{addr}' ) as T;

Figure 18.3

Stored procedure for hybrid XML inserts

18.1

Manipulating XML in SQL Stored Procedures

551

INSERT INTO phone (id, type, number) SELECT id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS type VARCHAR(20) PATH '@type', num VARCHAR(20) PATH '.') AS T; END #

Figure 18.3

Stored procedure for hybrid XML inserts (Continued)

With the stored procedure in Figure 18.3 in place, an application should use the stored procedure call call insertCustomer(?) to insert new customer documents and never use direct INSERT statements. If all inserts are performed through this stored procedure, the relational and XML data in the tables are always consistent. You can have similar stored procedures for update and delete operations. The stored procedures can also contain additional business logic or data manipulation. A challenging situation occurs when the stored procedure in Figure 18.3 fails with the following error message, where is a data value in the input document that cannot be cast to the data type VARCHAR(20): SQL16061N The value cannot be constructed as, or cast (using an implicit or explicit cast) to the data type "VARCHAR_20". Error QName=err:FORG0001. SQLSTATE=10608.

Note that the XMLTABLE functions in the stored procedure cast the customer name, phone type, and phone number to VARCHAR(20). However, the error message does not specify which one of them caused the problem. In this simple example, a quick look at the might reveal which XML element or attribute caused the error. However, in more complex cases it is often difﬁcult to identify which element or attribute is responsible for the error. The solution is to add code to the stored procedure to catch the SQL error, obtain the offending , look for it in the input document, and return the name of the XML element or attribute that caused the problem. This logic is coded in Figure 18.4. The INSERT statements in the procedure in Figure 18.4 are the same as previously in Figure 18.3. The difference in Figure 18.4 is the error handling. The procedure declares SQLSTATE 10680 as a condition, and an exit handler to take appropriate action when this condition occurs. This action is enclosed in a separate BEGIN-END block and only executed when the declared error happens. The exit handler obtains the error information and uses the SUBSTR function to extract the offending and data type from it. Then it uses the XQuery expression $d//(*,@*) [data(.) = $v]/local-name() to obtain the name of the element or attribute that contains the offending value. In this expression, $d represents the XML document and $v the value to

552

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

look for. The ﬁrst part of the expression, $d//(*,@*), iterates over all elements and attributes in the document. For each of those, the predicate [data(.) = $v] checks whether the value of the element or attribute matches the from the error message. If the predicate is true, then the last step of the expression, /local-name(), obtains the name of the element or attribute. The whole expression is an argument of the function string-join, which produces a comma-separated list in case more than one node with the matching value is found in the document. CREATE PROCEDURE insertCustomer(IN custDoc XML, OUT id INTEGER, OUT MESSAGE_TEXT VARCHAR(300)) BEGIN ATOMIC DECLARE vErrMsg VARCHAR(300); DECLARE vValue VARCHAR(100); DECLARE vNode VARCHAR(100); DECLARE vType VARCHAR(100); DECLARE vTokenString VARCHAR(100); DECLARE XMLTABLE_CAST_FAILURE CONDITION FOR SQLSTATE '10608'; DECLARE EXIT HANDLER FOR XMLTABLE_CAST_FAILURE BEGIN -- retrieve error message and token string GET DIAGNOSTICS EXCEPTION 1 vTokenString = DB2_TOKEN_STRING, vErrMsg = MESSAGE_TEXT; SET vValue = SUBSTR(vErrMsg, 23, POSSTR(vErrMsg, '" ')-23); SET vType = SUBSTR(vTokenString, LENGTH(vValue)+2); -- ﬁnd xml nodes whose values match the error token SET vNode = XMLCAST(XMLQUERY(' string-join($d//(*,@*)[data(.) = $v]/local-name(),",")' PASSING custDoc AS "d", vValue AS "v") AS VARCHAR(100)); -- create message text SET MESSAGE_TEXT = 'Failed to cast the value "' || vValue || '", at element or attribute "' || vNode || '", to type "' || vType || '".'; END ; SET id = NEXTVAL FOR id_seq; INSERT INTO cust(id, name, addr) SELECT id, T.name, T.address FROM XMLTABLE('$d/customerinfo' PASSING custDoc AS "d" COLUMNS name VARCHAR(20) PATH 'name', address XML PATH 'document{addr}' ) as T;

Figure 18.4

Stored procedure for hybrid XML inserts with error handling

18.1

Manipulating XML in SQL Stored Procedures

553

INSERT INTO phone (id, type, number) SELECT id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS type VARCHAR(20) PATH '@type', num VARCHAR(20) PATH '.') AS T; SET MESSAGE_TEXT = 'Insert successful.'; END #

Figure 18.4

18.1.3

Stored procedure for hybrid XML inserts with error handling (Continued)

Loops and Cursors

The example in Figure 18.5 shows that you can easily loop over the elements and attributes from one or multiple XML documents. The stored procedure takes an XML document as input and uses a SELECT statement with an XMLTABLE function to produce one row for each phone element. The FOR statement is used to iterate over these rows. When a FOR statement is executed, a cursor is implicitly declared such that each iteration of the FOR loop fetches the next row from the result set until there are no rows left. For each row, the statements in the DO clause of the FOR statement are executed. An IF-THEN-ELSE statement inserts the phone information into the table cellphones if the phone type is cell, and into the table landlines otherwise. To keep stored procedures simple, we recommend the use of FOR statements instead of explicit cursor declarations whenever possible. CREATE TABLE cellphones(id INTEGER, number VARCHAR(20))# CREATE TABLE landlines(id INTEGER, number VARCHAR(20))# CREATE PROCEDURE processPhones(IN custDoc XML) BEGIN ATOMIC FOR phone AS SELECT T.id, T.type, T.num FROM XMLTABLE('$d/customerinfo/phone' PASSING custDoc AS "d" COLUMNS id INTEGER PATH '../@Cid', type VARCHAR(5) PATH '@type', num VARCHAR(20) PATH '.') as T DO IF phone.type='cell' THEN INSERT INTO cellphones(id,number) VALUES(phone.id, phone.num); ELSE INSERT INTO landlines(id, number) VALUES(phone.id, phone.num); END IF; END FOR; END #

Figure 18.5

FOR loop over repeating XML elements

554

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

You can also use XQuery without SQL in stored procedures, but not in a FOR statement or any static manner. You have to construct the XQuery dynamically as a string and prepare and open it as a dynamic cursor. In Figure 18.5 an XQuery string is assigned to the variable xqr. Note that the query string includes the value of the input parameter city. The query is then prepared and opened as a CURSOR WITH RETURN TO CALLER. With this cursor deﬁnition, the result sequence of the XQuery becomes the result set of the stored procedure. The procedure does not fetch from or close the cursor, which allows the calling application to iterate over the result of the query. Alternatively you could decide to have a WHILE loop with a FETCH statement in the stored procedure itself to process the result set. CREATE PROCEDURE cityphones(IN city VARCHAR(20)) BEGIN ATOMIC DECLARE xqr VARCHAR(2048); DECLARE c1 CURSOR WITH RETURN TO CALLER FOR stmt; SET xqr = 'xquery for $i in db2-fn:xmlcolumn("CUSTOMER.INFO") where $i/customerinfo/addr[city="'|| city ||'"] return $i/customerinfo/phone'; PREPARE stmt FROM xqr; OPEN c1; END #

Figure 18.6

18.1.4

Dynamic cursor for an XQuery

A Stored Procedure to Update a Selected XML Element or Attribute

The stored procedure in Figure 18.7 changes the value of a selected XML node in a document. The input parameters to the procedure are an XML document, the path to the node that is to be updated, and the new value of the node. The parameter for the XML document is declared as INOUT, so that the updated document is returned. The procedure constructs an XQuery update expression in an XMLQUERY function. The input parameter xpath provides the target path for the replace clause. Additionally, the document and the new value are passed as parameters into the XQuery Update expression. The statement OPEN c1 USING mydoc, value binds the procedure parameters mydoc and value to the parameters markers in the XMLQUERY function. CREATE PROCEDURE updateXPath (INOUT mydoc XML, IN xpath VARCHAR(1024), IN value VARCHAR(128)) BEGIN ATOMIC DECLARE sql VARCHAR(2048); DECLARE c1 CURSOR FOR stmt; SET sql = 'VALUES XMLQUERY('' copy $new := $original modify do replace value of $new' || xpath ||'

Figure 18.7

Stored procedure to update a selected XML element or attribute

18.1

Manipulating XML in SQL Stored Procedures

555

with $value return $new '' PASSING XMLCAST(? AS XML) AS "original", CAST(? AS VARCHAR(1024)) AS "value") '; PREPARE stmt FROM sql; OPEN c1 USING mydoc, value; FETCH c1 INTO mydoc; CLOSE c1; END #

Figure 18.7

18.1.5

Stored procedure to update a selected XML element or attribute (Continued)

Three Tips for Testing Stored Procedures

The following three tips seem to be not as widely known as they should be, but they are extremely useful. Tip 1: How to Test Stored Procedures in the CLP It is often very useful to test stored procedures in the CLP without having to have application code that calls the procedure and passes an XML document as input. You can simply import your test documents into a DB2 table, such as testdocs, and use an SQL fullselect as the input parameter in the stored procedure call in the CLP. Make sure that the fullselect produces exactly one row with one column of type XML, as shown in Figure 18.8. The second parameter is a question mark as a placeholder for the output parameter oldDoc. CREATE TABLE testdocs(id INTEGER NOT NULL PRIMARY KEY, doc XML); IMPORT FROM testdata.del OF DEL INSERT INTO testdocs; CALL processDoc( (SELECT doc FROM testdocs WHERE id = 3),? );

Figure 18.8

Testing a stored procedure

Tip 2: How to Get the Execution Plan of a Stored Procedure If a stored procedure does not perform well then it can be useful to examine the execution plans of queries or other statements in the stored procedure. One approach is to copy individual statements from the stored procedure and to explain them separately. However, it can happen that a statement has a different execution plan when it is compiled in the context of a stored procedure than when it is compiled by itself. In DB2 for Linux, UNIX, and Windows you can use the following approach to explain the statements within a stored procedure. 1. Establish a connection to the database. 2. Create explain tables if they do not already exist (see section 14.1.1, The Explain Tables in DB2 for Linux, UNIX, and Windows).

556

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

3. Issue the following command at the OS prompt to enable the capturing of execution plans when stored procedures are created in the current session: db2 "CALL SYSPROC.SET_ROUTINE_OPTS('EXPLAIN ALL')"

4. If a CREATE PROCEDURE statement is the only statement in a ﬁle called create_ proc.sql, and if the statement is terminated with the # character, create the procedure with the following command at the OS prompt: db2 -td# -f create_proc.sql

5. Use the db2exfmt utility to write the execution plan to a ﬁle such as myprocplan.txt: db2exfmt -d -1 -o myprocplan.txt

The output ﬁle will contain separate explain information for each statement in the stored procedure. If you want to check whether the capturing of explain information for stored procedures is enabled, use the following SELECT statement: SELECT GET_ROUTINE_OPTS() FROM sysibm.sysdummy1

To revert to not explaining stored procedures, use this statement: db2 "CALL SYSPROC.SET_ROUTINE_OPTS('EXPLAIN NO')"

Tip 3: How to Proﬁle a Stored Procedure IBM Data Studio Developer contains a very useful stored procedure proﬁler that can provide information about the runtime performance of a procedure. For each statement in the stored procedure, the proﬁle reveals the number of executions, the elapsed time, CPU time, and other optional metrics such as the number of rows read or written, or the number of logical and physical page reads. This information is extremely helpful to understand the behavior of a complex stored procedure and to discover which parts of a procedure are particularly expensive to run. If you have a Data Development Project in Data Studio and a stored procedure in the Stored Procedures folder of the Data Project Explorer, right-click on the procedure name and choose Run Proﬁling. The same context menu also has a command to invoke the stored procedure debugger, which is another helpful tool for the development of stored procedures in DB2 for Linux, UNIX, and Windows, and DB2 for z/OS.

18.2

MANIPULATING XML IN USER-DEFINED FUNCTIONS

DB2 9.7 for Linux, UNIX, and Windows allows you to use the XML data type in user-deﬁned functions (UDFs). UDFs can have XML type parameters and variables and can contain SQL/XML statements that manipulate XML data. Most of these capabilities are similar to the XML support in stored procedures. An important difference between UDFs and stored procedures is that UDFs can be used in SQL statements while stored procedures can only be invoked with a CALL statement. In this section we discuss several examples of UDFs that manipulate XML data.

18.2

Manipulating XML in User-Deﬁned Functions

18.2.1

557

A UDF to Extract an Element or Attribute Value

The function getname in Figure 18.9 takes an XML document as input and returns a value of type VARCHAR(25). The body of the function consists of a single RETURN statement. It contains the functions XMLCAST and XMLQUERY to extract the name element and convert it to VARCHAR(25). The PASSING clause of the XMLQUERY function passes the function’s input parameter doc into the XPath expression. Below the function you see an SQL statement that invokes the function in its SELECT clause. The use of the UDF allows an application to retrieve customer names without having to code the actual XPath expression and SQL/XML functions. CREATE FUNCTION getname(doc XML) RETURNS VARCHAR(25) LANGUAGE SQL CONTAINS SQL NO EXTERNAL ACTION DETERMINISTIC BEGIN ATOMIC RETURN XMLCAST(XMLQUERY('$d/customerinfo/name' PASSING doc AS "d") AS VARCHAR(25)); END # SELECT getname(info) AS name FROM customer WHERE cid = 1002 # NAME ------------------------Jim Noodle 1 record(s) selected.

Figure 18.9

Scalar UDF to extract an element value

Such a scalar UDF also enables you to create a table with a generated column whose value is automatically computed based on the XML documents in an XML column: CREATE TABLE custinfo(info XML, name VARCHAR(25) GENERATED ALWAYS AS (getname(info)));

The function in Figure 18.9 is a scalar function, which means it returns a single value. If you want to use a similar function to extract a repeating element then a table function instead of a scalar function can be more appropriate. This is shown next.

18.2.2

A UDF to Extract the Values of a Repeating Element

Figure 18.10 demonstrates a function that extracts the phone elements from a given document. Since a customer document can have multiple phone elements, the return type of the UDF is a table. This UDF is therefore a table function. The structure of the returned table is deﬁned in the second line of the CREATE FUNCTION statement. The body of the function contains a RETURN statement that includes an SQL/XML query that produces the rows and columns of the result table.

558

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

Below the function you see an SQL query that uses the UDF. Since this UDF is a table function, it is used in a table expression in the FROM clause of the SELECT statement. The result set of the query includes two columns from the UDF plus the cid column from the customer table. CREATE FUNCTION getphone(doc XML) RETURNS TABLE(type VARCHAR(10), number VARCHAR(20)) BEGIN ATOMIC RETURN SELECT type, number FROM XMLTABLE('$d/customerinfo/phone' PASSING doc AS "d" COLUMNS type VARCHAR(10) PATH '@type', number VARCHAR(20) PATH '.') ; END #

SELECT cid, p.type, p.number FROM customer, TABLE(getphone(info)) p WHERE cid = 1004# CID ---------------1004 1004

TYPE ---------work home

NUMBER -------------------905-555-4789 416-555-3376

2 record(s) selected.

Figure 18.10

Table UDF to extract repeating element values

You can certainly use multiple UDFs in a single query, as illustrated by the query in Figure 18.11. SELECT getname(info) AS name, p.type, p.number FROM customer, TABLE(getphone(info)) p WHERE cid IN (1004, 1005) NAME ------------------------Matt Foreman Matt Foreman Larry Menard Larry Menard

TYPE ---------work home work home

NUMBER -------------------905-555-4789 416-555-3376 905-555-9146 416-555-6121

4 record(s) selected.

Figure 18.11

18.2.3

Using a scalar UDF and a table UDF in a query

A UDF to Shred XML Data to a Relational Table

A table function can also help you shred XML data into a relational table. Suppose you want to populate the following target table:

18.2

Manipulating XML in User-Deﬁned Functions

559

CREATE TABLE address(cid INTEGER, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30))

To shred XML documents into this table, you can create a table function that takes an XML document as input and returns a set of rows with columns that match the target table. Figure 18.12 deﬁnes such a function. CREATE FUNCTION extractcols(doc XML) RETURNS TABLE(cid INT, name VARCHAR(30), street VARCHAR(40), city VARCHAR(30)) BEGIN ATOMIC RETURN SELECT x.custid, x.custname, x.str, x.city FROM XMLTABLE('$d/customerinfo' PASSING doc AS "d" COLUMNS custid INTEGER PATH '@Cid', custname VARCHAR(30) PATH 'name', str VARCHAR(40) PATH 'addr/street', city VARCHAR(30) PATH 'addr/city' ) AS x ; END #

Figure 18.12

Table function to extract several elements and attributes

You can then include this table function in an INSERT-INTO-SELECT-FROM statement. The ﬁrst INSERT statement in Figure 18.13 reads XML documents from the XML column info of the customer table and shreds them into the address table. The function extractcols takes the XML column info as input and produces relational rows for insert into the target table. The second INSERT statement in Figure 18.13 shreds an XML document that is provided by an application through the parameter marker in the FROM clause. INSERT INTO address(cid, name, street, city) SELECT e.cid, e.name, e.street , e.city FROM customer c, TABLE(extractcols(c.info)) e WHERE c.cid < 1050;

INSERT INTO address(cid, name, street, city) SELECT e.cid, e.name, e.street , e.city FROM TABLE(extractcols(cast(? as XML))) e ;

Figure 18.13

18.2.4

Using a table function to shred XML documents

A UDF to Modify an XML Document

Chapter 12, Updating and Transforming XML Documents, describes XQuery Update expressions that allow you to change the value of an element or attribute, or to insert, rename, or delete elements and attributes in a document. It can be convenient to encapsulate such update expressions in a user-deﬁned function, which then serves as a much simpler update interface for database applications.

560

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

Using the customer documents in the sample database as an example, suppose you want to simplify the task of updating a selected phone element in a document. You could code the UDF in Figure 18.14, which has the following input parameters: • doc: the XML document that is to be updated • phonetype: a string such as “cell” or “work” to indicate which phone is to be updated • number: the new telephone number The function returns the input document where the phone element with the matching type attribute has been given the new value. CREATE FUNCTION updatephone(doc XML, phonetype VARCHAR(8), number VARCHAR(12) ) RETURNS XML BEGIN ATOMIC RETURN XMLQUERY(' copy $new := $p1 modify do replace value of $new/customerinfo/phone[@type=$p2] with $p3 return $new' PASSING doc AS "p1", phonetype as "p2", number as "p3"); END #

Figure 18.14

Scalar UDF to modify an XML document

If an application wants to change the work phone number of customer 1002 to the new value 408-463-4963, it can simply issue the UPDATE statement in Figure 18.15 and does not need to be concerned with the details of the underlying XQuery Update expression. UPDATE customer SET info = updatephone(info, 'work', '408-463-4963') WHERE cid = 1002

Figure 18.15

UPDATE statement with a scalar UDF

Remember that the update expression “replace value of” fails if the target path ($new/customerinfo/phone[@type=$p2]) does not produce exactly one node. In other words, the invocation of the UDF in Figure 18.15 leads to an error if the document for customer 1002 does not contain a phone element whose type attribute has the value work. Therefore you might want to perform an “upsert” operation (update or insert). An “upsert” operation updates the phone element if it exists and inserts a new phone element otherwise. This logic is coded in the UDF in Figure 18.16 with an XQuery if-then-else expression. The else branch constructs a new phone element with a type attribute, and the variables $p2 and $p3 provide the values for this

18.3

Manipulating XML Data with Triggers

561

attribute and element, respectively. Within such attribute and element constructors the variables $p2 and $p3 have to be in curly brackets. CREATE FUNCTION upsert_phone(doc XML, phonetype VARCHAR(8), number VARCHAR(12) ) RETURNS XML BEGIN ATOMIC RETURN XMLQUERY('copy $new := $p1 modify if ($new/customerinfo/phone[@type = $p2]) then do replace value of $new/customerinfo/phone[@type = $p2] with $p3 else do insert {$p3} as last into $new/customerinfo return $new' PASSING doc AS "p1", phonetype as "p2", number as "p3"); END #

Figure 18.16

18.3

Scalar UDF to update or insert an XML element (“upsert”)

MANIPULATING XML DATA WITH TRIGGERS

A trigger deﬁnes a set of operations that are performed in response to an INSERT, UPDATE, or DELETE statement on a speciﬁed table. For example, a trigger can perform updates to other tables, automatically generate or change values for inserted or updated rows, or invoke functions and stored procedures. When an INSERT, UPDATE, or DELETE statement activates a trigger, the operations that are executed by the trigger can reference the column values of the rows that are being inserted, updated, or deleted. So-called transition variables allow you to reference the new column values provided in INSERT and UPDATE statements, or the old values that are removed by DELETE or UPDATE statements. You can deﬁne triggers on tables with XML columns, and you can also deﬁne UPDATE triggers on individual XML columns in a table. Transition variables in triggers do not allow you to access the old or new value of an XML column, which is true in DB2 for z/OS and DB2 for Linux, UNIX, and Windows. But, the transition variables allow you to reference the old or new value of nonXML columns in the same row, such as primary key values. Therefore, triggers can still be used for effective XML manipulation, as you will see in the examples in this section. DB2 for Linux, UNIX, and Windows has one exception where it is possible to reference the new value of an XML column as a transition variable. The exception is that the new value of an XML column can be used in the XMLVALIDATE function to trigger the validation of a document that is being inserted or updated. Such a validation trigger was shown in section 17.5, Automatic Validation with Triggers.

562

18.3.1

Chapter 18

Using XML in Stored Procedures, UDFs, and Triggers

Insert Triggers on Tables with XML Columns

Let’s look at an example in which triggers maintain the hybrid storage of incoming XML data. Suppose you receive XML documents such as the customer documents in the sample database. For reasons explained in section 2.4, Using a Hybrid XML/Relational Approach, you might decide to store the full document in a column of type XML and to extract a few selected element values into relational columns. For example, you might want to use relational columns to store the customer name and city as well as the type and number of the customer phones. Figure 18.17 deﬁnes the appropriate target tables. Since a customer document can contain multiple phone elements, the phone information is stored in a separate table together with a join key. CREATE TABLE cust(cust_id name city info

INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY, VARCHAR(30), VARCHAR(25), XML )#

CREATE TABLE phones(cust_id type number

Figure 18.17

INTEGER NOT NULL, VARCHAR (5), VARCHAR (15) )#

Tables for hybrid XML storage

Next you can deﬁne a trigger that automatically populates the relational columns in both tables whenever an XML document is inserted into the info column with an INSERT statement, such as the following: INSERT INTO cust(info) VALUES(?)

An appropriate insert trigger is shown in Figure 18.18. The trigger is ﬁred after a new row is inserted into the cust table but before the INSERT statement commits. The transition variable newrow can be used to reference the column values of the newly inserted row, except for the XML column. For example, newrow.cust_id identiﬁes the generated primary key value of the inserted row. This primary key value allows subselects in the trigger to identify the newly inserted row in the table and to extract the desired element values from the new XML document in that row. Since the XML document cannot be accessed through the transition variable, the trigger accesses the document directly in the table based on the primary key that it ﬁnds in the transition variable. The body of the trigger contains an UPDATE statement and an INSERT statement. The UPDATE statement populates the columns name and city in the newly inserted row. The INSERT statement adds rows to the phones table, one row for each phone element in the new document. These rows include the primary key cust_id of the cust table so that the relationship between phones and customers is properly maintained.

18.3

Manipulating XML Data with Triggers

563

CREATE TRIGGER cust_insert AFTER INSERT ON cust REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC UPDATE cust SET (name, city) = (SELECT X.name, X.city FROM cust, XMLTABLE('$INFO/customerinfo' COLUMNS name VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city') AS X WHERE cust.cust_id = newrow.cust_id ) WHERE cust.cust_id = newrow.cust_id; INSERT INTO phones(cust_id, type, number) SELECT cust.cust_id, P.type, P.number FROM cust, XMLTABLE('$INFO/customerinfo/phone' COLUMNS type VARCHAR(5) PATH '@type', number VARCHAR(15) PATH '.') AS P WHERE cust.cust_id = newrow.cust_id; END#

Figure 18.18

18.3.2

Insert trigger

Delete Triggers on Tables with XML Columns

Let’s continue with the preceding example. In addition to the insert trigger you also need a delete trigger that removes the correct rows from the phones table whenever rows are deleted from the cust table. Figure 18.19 shows such a delete trigger. The transition variable oldrow provides access to the cust_id values of the rows deleted in the cust table. These values allow the trigger to delete the corresponding rows in the phones table that have the same cust_id value. CREATE TRIGGER delete_cust AFTER DELETE ON cust REFERENCING OLD AS oldrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC DELETE FROM phones WHERE phones.cust_id = oldrow.cust_id; END#

Figure 18.19

Delete trigger

564

Chapter 18

18.3.3

Using XML in Stored Procedures, UDFs, and Triggers

Update Triggers on XML Columns

To complete our example, let’s examine the update trigger in Figure 18.20. It maintains the relational columns in the cust and phones tables whenever the info column in the cust table is updated. Note that an update of a customer document might have changed, added, or removed one or multiple phone elements. Thus, the only way to reliably update the phones table is to issue a DELETE followed by an INSERT statement. The UPDATE, DELETE, and INSERT statements in this trigger are the same as in the previous triggers. CREATE TRIGGER update_cust AFTER UPDATE OF info ON cust REFERENCING NEW AS newrow FOR EACH ROW MODE DB2SQL BEGIN ATOMIC UPDATE cust SET (name, city) = (SELECT X.name, X.city FROM cust, XMLTABLE('$INFO/customerinfo' COLUMNS name VARCHAR(30) PATH 'name', city VARCHAR(20) PATH 'addr/city') AS X WHERE cust.cust_id = newrow.cust_id ) WHERE cust.cust_id = newrow.cust_id; DELETE FROM phones WHERE phones.cust_id = newrow.cust_id; INSERT INTO phones(cust_id, type, number) SELECT cust.cust_id, P.type, P.number FROM cust, XMLTABLE('$INFO/customerinfo/phone' COLUMNS type VARCHAR(5) PATH '@type', number VARCHAR(15) PATH '.') AS P WHERE cust.cust_id = newrow.cust_id; END#

Figure 18.20

18.4

Update trigger

SUMMARY

Stored procedures, user-deﬁned functions (UDFs), and triggers are very powerful tools to customize or automate data processing steps for your speciﬁc application. DB2 for Linux, UNIX, and Windows allows you to create stored procedures and UDFs with input parameters, output parameters, and variables of type XML. Such procedures and functions can contain XQuery and SQL/XML statements to query and manipulate XML data. The beneﬁt of using the XML data type for parameters and variables is that DB2 keeps the XML data internally in the pureXML parsed tree format. This format enables stored procedures and

18.4

Summary

565

UDFs to process XML much more efﬁciently than a textual XML representation in VARCHAR or CLOB parameters would allow. For example, a UDF can read and manipulate data from an XML column without XML parsing because the data stays in DB2’s internal XML storage format. If an application passes an XML document to a stored procedure via an XML type parameter, the document is parsed only once upon entry into the procedure. Any subsequent processing steps within the procedure do not require XML parsing. Hence, the XML data type support in stored procedures and UDFs is a signiﬁcant performance beneﬁt for any custom XML processing logic that you implement. You can also deﬁne triggers on tables with XML columns to implement automated actions that are executed when XML documents are inserted, deleted, or updated. In a trigger, transitional variables give you access to the relational values of the affected rows, but not to the old or new value of an affected XML column. In the body of a trigger you can use the relational primary key values of the affected rows to ﬁnd and access the corresponding XML documents in the table and perform any required operation on them. Stored procedures have been found very useful to encapsulate and hide XML processing from application programs. This reduces application complexity and improves end-to-end performance because SQL/XML statements in DB2 procedures can perform many XML processing tasks more efﬁciently and with less code than application programs.

This page intentionally left blank

C

H A P T E R

19

Performing Full-Text Search

ML applications and data can often be classiﬁed in one of two ways: predominantly datacentric or predominantly document- or content-centric. For example, the processing of orders, sales, or trades is typically data-centric while the management of contracts, emails, or news articles is document-centric. Content-centric XML documents often contain signiﬁcant amounts of free-ﬂow text, including full sentences and paragraphs. Such full text is rare in datacentric XML, which tends to contain atomic data values such as names, dates, prices, quantities, or addresses. Therefore, full-text search is more commonly required for querying content-centric XML than data-centric XML documents.

X

There are also applications that exhibit characteristics of both, data- and document-oriented XML processing. In fact, it is a particular strength of XML to serve as a single format for any combination of data and content. For example, plain text comments can be part of an order, or a description can be part of a product detail record. Wherever individual data items consist of more than one word, and whenever you need to search for substring matches, full-text search can be the right solution. The following topics are discussed in this chapter: • Overview of full-text search capabilities in DB2 (section 19.1) • Sample table and documents used in this chapter (section 19.2) • The DB2 Net Search Extender (sections 19.3 through 19.5) • DB2 Text Search (section 19.6) • Summary of text search administration commands (section 19.7) • Comments on full-text search in DB2 for z/OS (section 19.8)

567

568

Chapter 19

Performing Full-Text Search

19.1 OVERVIEW OF TEXT SEARCH IN DB2 DB2 offers two technologies to perform full-text search. Both of them handle plain text, HTML and XML data, as well as document formats such as PDF and Microsoft Word. • The DB2 Net Search Extender (NSE) has been providing powerful text search capabilities since DB2 8 for Linux, UNIX, and Windows. The Net Search Extender is XML aware and fully functional with the new XML column type in DB2 9 and higher. The DB2 Net Search Extender continues to provide reliable and mature text search in DB2 with proven scalability and performance. • DB2 Text Search is new text search functionality that is based on the technology in the open source project Lucene. The same technology is also used in IBM OmniFind Text Search Server for DB2 z/OS (see section 19.8). DB2 Text Search became ﬁrst available in DB2 9.5 for Linux, UNIX, and Windows, Fixpack 1. Its features and performance continue to be improved in subsequent releases. DB2 Text Search in DB2 9.5 is just the beginning of integrating OmniFind text search capabilities into DB2 on all platforms. In a given DB2 database you can use either the DB2 Net Search Extender or DB2 Text Search, not both. The DB2 Net Search Extender and DB2 Text Search can coexist in the same database instance, but only one of them can be enabled for a given database. You will ﬁnd that many DB2 Text Search features and most of its administration commands are identical or similar to those of the DB2 Net Search Extender. The DB2 Net Search Extender and DB2 Text Search have several design principles in common: • A table in which one or multiple columns are indexed for text search must have a primary key. The primary key values of the table are used in the text index to correlate text search results from the text index back to the rows in the table. Consequently, the ﬁnest granularity of text search results is a row (a document). • When a text index is created, triggers and a staging table (also known as a log table) are also automatically created in DB2. Any insert, update, or delete on the indexed table ﬁres a trigger that in turn writes corresponding information about the data changes into the staging table. The content of this staging table is read to update the text index, and is subsequently deleted. • Text indexes are maintained asynchronously; that is, not in the context of the original insert, update, or delete statements. Updates of the text index are either explicitly invoked with an UPDATE INDEX command, or they happen regularly on a predeﬁned schedule. Table 19.1 summarizes the most important commonalities and differences between the DB2 Net Search Extender and DB2 Text Search as of DB2 Version 9.5 Fixpack 1.

19.1 Overview of Text Search in DB2

Table 19.1

569

Comparing the DB2 Net Search Extender and DB2 Text Search

Feature

DB2 Net Search Extender

DB2 Text Search

Separate Text Search Install

Yes

No, part of DB2 install

DPF Support

Yes (on AIX)

No

Command line interface

Yes

Yes

Administration also through the DB2 Control Center

Yes

No

Administration also through stored procedures

No

Yes

DB2 Backup includes text index

No

No

Asynchronous index updates

Yes

Yes

Synchronous index updates

No

No

Index updates: manual or scheduled

Both

Both

Document models—to index only a subsection (part) of each XML document

Yes

No

Multiple text indexes per column

Yes

No

Indexes on views and nick names

Yes

No

Stop words (avoid indexing irrelevant words, such as "a", "or", and "the")

Yes, optional

No

SQL function: contains

Yes

Yes

XQuery function:

No

Yes

Support for XML namespaces

Limited

No

Can limit the result set size

Yes

Yes

Boolean search (and, or, and not operators for text predicates)

Yes (and: &, or: |)

Yes (and: &&, or: ||)

Wildcards in search predicates

Yes

Yes

Search with escape characters

Yes

Yes

Stemming (reduces search word to its base form)

Yes, optional

Yes, implicitly

Synonym search (Thesaurus)

Yes

Yes

db2-fn:xmlcolumn-contains

(continues)

570

Chapter 19

Table 19.1

Performing Full-Text Search

Comparing the DB2 Net Search Extender and DB2 Text Search (Continued)

Feature

DB2 Net Search Extender

DB2 Text Search

Weighted search

Yes

Yes

Fuzzy search

Yes

No

Proximity search

Yes

No

Ranking/scoring of result set items

Yes

Yes

Case-sensitive search

Yes

No

Linguistic processing (search for linguistic variations of the search term)

English only

All supported languages

19.2 SAMPLE TABLE AND DATA In the remainder of this chapter we use the following sample table and data to illustrate the text search capabilities in DB2 (see Figure 19.1). You will see that it does not take magic to perform efﬁcient XML full-text search in DB2. CREATE TABLE orders (id INTEGER NOT NULL PRIMARY KEY, doc XML) id 1

2

doc Wendy Witch Crystal Ball, Deluxe Edition 5 95.00 Customer requested extra wrapping. Magic Potion, 300ml flask 10 19.95 Await further shipping instructions. William Wizard Magician's Hat, Black 1 75.00 Must be big enough for the rabbit. White Rabbit 1 295.00 Extra soft fur and extra white.

Figure 19.1

Sample table and data

19.3 Enabling a Database for the DB2 Net Search Extender

571

Note that the second document contains a single quote in the name of the ﬁrst item. This quote is not a problem if you import or load the document, or insert with a parameter marker. But, if you execute an insert statement in the DB2 Command Line Processor (CLP) with a literal XML document in the statement, a single quote in an XML value conﬂicts with the single quotes that enclose the document string. Hence, the ﬁrst of the three insert statements in Figure 19.2 fails. You can escape the single quote either by using two single quotes or by using the corresponding entity reference ('). --incorrect: INSERT INTO orders VALUES(1, 'Magician's Hat'); --correct: INSERT INTO orders VALUES(2, 'Magician''s Hat'); INSERT INTO orders VALUES(3, 'Magician's Hat');

Figure 19.2

Inserting XML data with quotes in the CLP

19.3 ENABLING A DATABASE FOR THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender (NSE) requires a separate install in addition to the regular DB2 install. Appendix C, Further Reading, contains links to information about downloading and installing the NSE. After installation you can start and stop the Net Search Extender instances services much like you start and stop a DB2 server. You have to be the DB2 instance owner to issue the following commands at the OS prompt: db2text start db2text stop [force]

The optional keyword force can be used to forcibly stop the NSE even if there are processes still holding locks or if caching for an index is still activated. Be careful with the use of the force option. If you perform db2text stop force while an index update or reorg is in progress, the text index may get damaged and might have to be rebuilt entirely. After starting the DB2 Net Search Extender instance services, the ﬁrst step is to enable a database for text search. Execute the following command at the OS prompt to enable the database for text search: db2text ENABLE DATABASE FOR TEXT CONNECT TO

As for the majority of the db2text commands, you can optionally provide a user name and password for authentication to the database: db2text ENABLE DATABASE FOR TEXT CONNECT TO USER USING

572

Chapter 19

Performing Full-Text Search

The ENABLE DATABASE command creates UDFs, stored procedures, and the following tables and views in the default table space of the database: • db2ext.dbdefaults: Contains default values for text search conﬁguration parameters • db2ext.textindexformats: Stores the list of supported index formats and the currently used document models • db2ext.indexconﬁguration: Stores index conﬁguration parameters • db2ext.textindexes: Keeps track of all text indexes Similarly, you can disable the DB2 Net Search Extender for a database with the following command, which removes the NSE tables, views, and UDFs, and drops all NSE indexes for that database. db2text DISABLE DATABASE FOR TEXT [force] CONNECT TO USER USING

19.4 MANAGING FULL-TEXT INDEXES WITH THE DB2 NET SEARCH EXTENDER The DB2 Net Search Extender allows you to deﬁne one or multiple text indexes per column. It also allows you to index only a certain section of each document instead of indexing all elements and attributes in a document. Such partial indexing leads to fewer index entries per document, smaller text indexes, and better index update and search performance. The following sections illustrate the CREATE INDEX command and its various options for the DB2 Net Search Extender.

19.4.1 Creating Basic Text Indexes Issued at the OS command prompt, the following command creates a text index with the name orderIdx on the column doc in the table orders in the database : db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) CONNECT TO USER USING "

Depending on the operating system and conﬁguration of your command shell, enclosing the command parameter for db2text in double quotes might be necessary, as shown in this example. Specifying a user name and a password for authentication to the database is optional. The table orders must have a primary key; otherwise, a text index cannot be created. The column doc must be of type XML or any character or binary column type, such as CHAR, VARCHAR, CLOB, BLOB, DBCLOB, GRAPHIC, or VARCHAR FOR BIT DATA. Unlike relational indexes in DB2, the CREATE INDEX statement for a text index deﬁnes an index but does not actually build the text index. An UPDATE INDEX command is required after the CREATE INDEX statement to perform the initial index build (see section 19.4.6).

19.4

Managing Full-Text Indexes with the DB2 Net Search Extender

573

For each text index, the Net Search Extender creates a log table and an event table as well as triggers on the user table. Upon insert, delete, update, or import of data, the triggers ﬁre and write change information into the log table, which is later used to update the index. The event table contains information about index updates and potential problems, such as invalid document formats. If you use the DB2 LOAD utility to move documents into your table, the triggers don’t ﬁre and incremental indexing of the loaded documents does not happen. Therefore, it is recommended to use the DB2 IMPORT utility, which activates the triggers. If you insist on using LOAD for performance reasons, then it is your own responsibility to ﬁll the log table appropriately before issuing the next UPDATE INDEX command. The names of the log table and event table are system-generated. DB2 also creates views on these tables to allow easy inspection of the information. Use the SQL statement in Figure 19.3 to obtain the schema and view names for the index called orderIdx. SELECT eventviewschema, eventviewname, logviewschema, logviewname FROM db2ext.textindexes WHERE indname = 'ORDERIDX'

Figure 19.3

Obtaining names of the event and log views for a given text index

19.4.2 Creating Text Indexes with Speciﬁc Storage Paths The previous examples used default locations for the text index and the index building work area. The work area is used to hold temporary ﬁles that are created when text indexes are built or updated. The default locations are deﬁned in the table DB2EXT.DBDEFAULTS and are typically in /sqllib/db2ext/indexes. This default location is often not a good place for large text indexes. The command in Figure 19.4 speciﬁes that the index is created in the ﬁle system /data/index while temporary NSE ﬁles are written to /data/temp. Additionally, the log and event tables are placed in the table space named nse_tspace instead of the default user table space. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) INDEX DIRECTORY /data/index WORK DIRECTORY /data/temp ADMINISTRATION TABLES IN nse_tspace CONNECT TO "

Figure 19.4

Text index with non-default storage locations

The DB2 instance owner needs to have read, write, and execute permissions for the index and the work directory. In a DPF system these directories have to exist on every physical node. For best performance, the index and work directories should be allocated on RAID arrays that allow high I/O throughput.

574

Chapter 19

Performing Full-Text Search

PERFORMANCE TIP When a text index is created or updated, potentially large amounts of data might have to be moved from the work directory to the index directory. If the index directory and the work directory are located in different ﬁle systems, then this move is an expensive copy operation. If the index and work directory are located within the same ﬁle system, an inexpensive rename operation can be performed instead of a copy. Hence, for best performance it is highly recommended that the index and work directory share the same ﬁle system.

The disk space required for an index depends on the amount and type of data that is being indexed and on the length of the primary key in the user table. Since the primary key is part of the index, short keys such as INTEGER or TIMESTAMP are preferable over long keys, such as CHAR(128). As a rule of thumb you should reserve at least 0.7 times as much space for the text index as the size of the data volume you want to index. The work area can require two to three times as much space as the raw data.

19.4.3

Creating Text Indexes with a Periodic Update Schedule

By default a text index is not updated automatically. You have to use the explicit UPDATE INDEX command whenever you want to refresh the text index, or conﬁgure the index for regularly scheduled index updates. The CREATE INDEX statement in Figure19.5 deﬁnes a text index that is automatically refreshed four times a day. The string D(*)H(0,6,12,18)M(30) means that the index is updated every day at 0:30, 6:30, 12:30, and 18:30 hours. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(0,6,12,18)M(30) CONNECT TO "

Figure 19.5

Text index with automatic periodic updates

Alternatively, the string D(1,2,3,4,5)H(*)M(0,15,30,45) would mean that the index gets updated Monday through Friday every 15 minutes. You will see later that there is also an ALTER INDEX command in which you can use the UPDATE FREQUENCY clause to deﬁne or change automatic updates for existing indexes. System load considerations and the time it takes for an index update to ﬁnish should be the guiding factors for choosing an appropriate update interval that is not too short. An update interval of one minute is almost always the wrong thing to do. NOTE

19.4

Managing Full-Text Indexes with the DB2 Net Search Extender

575

Depending on your application, you might want to avoid index maintenance at the scheduled times if there was only an insigniﬁcant number of changes to your data since the last time the index was updated. Figure 19.6 creates an index that is updated every 30 minutes if there are at least 50 document changes queued up in the log table. If there are less than 50 changes in the log table, the index is not updated. After 30 minutes, the scheduler checks again whether 50 or more changes have accumulated. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(*)M(0, 30) UPDATE MINIMUM 50 CONNECT TO "

Figure 19.6

Text index with automatic updates when “enough” new rows are available

Such a combination of UPDATE FREQUENCY and UPDATE MINIMUM allows you to deﬁne an index update schedule in which the index is updated more frequently when there are many changes in the base table and less frequently if there are fewer changes. If omitted, the default value for UPDATE MINIMUM is 1. Instead of updating the index incrementally you can also choose to always re-create the index from scratch. Figure 19.7 deﬁnes an index that is recreated entirely every night at 2 a.m. db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) UPDATE FREQUENCY D(*)H(2)M(0) RECREATE INDEX ON UPDATE CONNECT TO "

Figure 19.7

Text index with automatic re-create

If you deﬁne an index with the RECREATE option, no log table and no triggers are created for this index. Use this option with caution as rebuilding a large text index can take a long time. Note that the DB2 Control Center allows you to administrate the DB2 Net Search Extender and to conﬁgure the update behavior of text indexes. When you right-click on a database name you are presented with the option to enable the database for text search. A right-click on the index folder of a database lets you create regular relational indexes but also text indexes. A multi-step wizard guides you through the text index deﬁnition and allows you to change default parameters such as index location and update characteristics. Figure 19.8 illustrates step 4 of the Create Text Index Wizard, where you can set the frequency of automatic updates. The settings selected in Figure 19.8 result in a CREATE INDEX statement with the clause UPDATE FREQUENCY D(1) H(3) M(30).

576

Chapter 19

Figure 19.8

19.4.4

Performing Full-Text Search

Create Text Index Wizard in the DB2 Control Center

Creating Text Indexes for Speciﬁc Parts of Each Document

When you deﬁne a text index on an XML column, the DB2 Net Search Extender creates index entries for all XML elements and attributes in the XML documents in the column. But, indexing all parts of the documents is not always necessary. Let’s look at the sample document in Figure 19.1. If you manage many “order” documents of this nature, you might want to perform full-text search on item names and comments. In that case, creating a full-text index on these elements is sufﬁcient and leads to a much smaller index as compared to indexing all elements and attributes. A smaller index often allows better update and search performance. If you also need to perform queries with predicates on short data values— such as order date, customer name, item key, quantity, and price—you should use regular XML indexes. With the Net Search Extender you can use document models to control which parts of the document structure are and aren’t indexed, and by which name you can refer to these parts in search queries. A document model itself is a small XML document in the ﬁle system. This model ﬁle is passed as a parameter to the CREATE INDEX command and is read during index creation only. Later changes to the document model do not affect existing indexes.

19.4

Managing Full-Text Indexes with the DB2 Net Search Extender

577

Figure 19.9 shows a simple document model for documents like the ones in Figure 19.1. This document model declares that only item names and comments are indexed. Every XML document model starts with the element XMLModel, which includes one or multiple XMLFieldDeﬁnition elements. Each XMLFieldDeﬁnition assigns a name to a locator. The locator is a simple XPath expression that deﬁnes which elements, attributes, or subtrees to index. The locator can contain XPath wildcards (*), namespace preﬁxes, the XPath union operator (|), and the XPath descendant-and-self axis, which is also known as the “double slash” (//).

Figure 19.9

A simple document model

If the document model is stored in the ﬁle itemModel.xml, then the following command deﬁnes a full-text index for item names and comments: db2text "CREATE INDEX orderIdx FOR TEXT ON orders(doc) FORMAT XML DOCUMENTMODEL XMLModel IN itemModel.xml CONNECT TO "

Note that you might have to specify a full ﬁle system path to the model ﬁle. The document model in Figure 9.10 declares that all elements under /order/item are indexed, except for the items quantity and price, which are explicitly excluded. Depending on the actual data in the XML column, and on the existence of other elements under /order/item, this document model can index more information than the previous one in Figure 19.9. However, for the sample documents in Figure 19.1, both document models index exactly the item name and comment. We will later use these document models in text search queries.