Foreword by Donald Farmer
Principal Program Manager, US-SQL Analysis Services Microsoft Corporation
Smart Business Intelligence Solutions with Microsoft
®
SQL Server 2008 ®
Lynn Langit Kevin S. Goff, Davide Mauri, Sahil Malik, and John Welch
PUBLISHED BY Microsoft Press A Division of Microsoft Corporation One Microsoft Way Redmond, Washington 98052-6399 Copyright © 2009 by Kevin Goff and Lynn Langit All rights reserved. No part of the contents of this book may be reproduced or transmitted in any form or by any means without the written permission of the publisher. Library of Congress Control Number: 2008940532 Printed and bound in the United States of America. 1 2 3 4 5 6 7 8 9 QWT 4 3 2 1 0 9 Distributed in Canada by H.B. Fenn and Company Ltd. A CIP catalogue record for this book is available from the British Library. Microsoft Press books are available through booksellers and distributors worldwide. For further infor mation about international editions, contact your local Microsoft Corporation office or contact Microsoft Press International directly at fax (425) 936-7329. Visit our Web site at www.microsoft.com/mspress. Send comments to
[email protected]. Microsoft, Microsoft Press, Access, Active Directory, ActiveX, BizTalk, Excel, Hyper-V, IntelliSense, Microsoft Dynamics, MS, MSDN, PerformancePoint, PivotChart, PivotTable, PowerPoint, ProClarity, SharePoint, Silverlight, SQL Server, Visio, Visual Basic, Visual C#, Visual SourceSafe, Visual Studio, Win32, Windows, Windows PowerShell, Windows Server, and Windows Vista are either registered trademarks or trademarks of the Microsoft group of companies. Other product and company names mentioned herein may be the trademarks of their respective owners. The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious. No association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred. This book expresses the author’s views and opinions. The information contained in this book is provided without any express, statutory, or implied warranties. Neither the authors, Microsoft Corporation, nor its resellers, or distributors will be held liable for any damages caused or alleged to be caused either directly or indirectly by this book. Acquisitions Editor: Ken Jones Developmental Editor: Sally Stickney Project Editor: Maureen Zimmerman Editorial Production: Publishing.com Technical Reviewer: John Welch; Technical Review services provided by Content Master, a member of CM Group, Ltd. Cover: Tom Draper Design
Body Part No. X15-12284
For Mahnaz Javid and for her work with the Mona Foundation, which she leads —Lynn Langit, author
Contents at a Glance Part I
1 2 3 4 5 Part II
6 7 8 9 10 11 12 13 Part III
Business Intelligence for Business Decision Makers and Architects Business Intelligence Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Visualizing Business Intelligence Results . . . . . . . . . . . . . . . . . . . . 27 Building Effective Business Intelligence Processes . . . . . . . . . . . . 61 Physical Architecture in Business Intelligence Solutions . . . . . . . 85 Logical OLAP Design Concepts for Architects . . . . . . . . . . . . . . 115
Microsoft SQL Server 2008 Analysis Services for Developers Understanding SSAS in SSMS and SQL Server Profiler . . . . . . . Designing OLAP Cubes Using BIDS . . . . . . . . . . . . . . . . . . . . . . . Refining Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . Processing Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . Introduction to MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advanced MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding Data Mining Structures . . . . . . . . . . . . . . . . . . . . Implementing Data Mining Structures . . . . . . . . . . . . . . . . . . . .
153 183 225 257 293 329 355 399
Microsoft SQL Server 2008 Integration Services for Developers
14
Architectural Components of Microsoft SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio . . . . 16 Advanced Features in Microsoft SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions . . . . . . . . . . . . . . . . . . . . . . . . .
435 463 497 515
v
vi
Contents at a Glance
18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 19 Extending and Integrating SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Part IV
20 21 22 23 24 25
Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence Creating Reports in SQL Server 2008 Reporting Services . . . . . Building Reports for SQL Server 2008 Reporting Services . . . . Advanced SQL Server 2008 Reporting Services . . . . . . . . . . . . . Using Microsoft Excel 2007 as an OLAP Cube Client . . . . . . . . Microsoft Office 2007 as a Data Mining Client . . . . . . . . . . . . . SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
603 627 647 671 687 723
Table of Contents Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxi Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii
Part I
1
Business Intelligence for Business Decision Makers and Architects Business Intelligence Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Business Intelligence and Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 OLTP and OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Online Transactional Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Online Analytical Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Common BI Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Data Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Data Marts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Decision Support Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Data Mining Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Extract, Transform, and Load Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Report Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Core Components of a Microsoft BI Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 SQL Server 2008 Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 SQL Server 2008 Reporting Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 SQL Server 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Optional Components of a Microsoft BI Solution . . . . . . . . . . . . . . . . . . . . . . . . 21 Query Languages Used in BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 DMX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
What do you think of this book? We want to hear from you! Microsoft is interested in hearing your feedback so we can continually improve our books and learning resources for you. To participate in a brief online survey, please visit:
www.microsoft.com/learning/booksurvey/
vii
viii
Table of Contents
XMLA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 RDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2
Visualizing Business Intelligence Results . . . . . . . . . . . . . . . . . . . . 27 Matching Business Cases to BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Top 10 BI Scoping Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Components of BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Understanding Business Intelligence from a User’s Perspective . . . . . . . . . . . . 34 Demonstrating the Power of BI Using Excel 2007 . . . . . . . . . . . . . . . . . . . 36 Understanding Data Mining via the Excel Add-ins . . . . . . . . . . . . . . . . . . 45 Viewing Data Mining Structures Using Excel 2007 . . . . . . . . . . . . . . . . . . 47 Elements of a Complete BI Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Reporting—Deciding Who Will Use the Solution . . . . . . . . . . . . . . . . . . . 51 ETL—Getting the Solution Implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Data Mining—Don’t Leave It Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Common Business Challenges and BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Measuring the ROI of BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3
Building Effective Business Intelligence Processes . . . . . . . . . . . . 61 Software Development Life Cycle for BI Projects . . . . . . . . . . . . . . . . . . . . . . . . . 61 Microsoft Solutions Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Microsoft Solutions Framework for Agile Software Development . . . . . 63 Applying MSF to BI Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Phases and Deliverables in the Microsoft Solutions Framework . . . . . . . 65 Skills Necessary for BI Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Required Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Optional Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Forming Your Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Roles and Responsibilities Needed When Working with MSF . . . . . . . . . 76 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4
Physical Architecture in Business Intelligence Solutions . . . . . . . 85 Planning for Physical Infrastructure Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Creating Accurate Baseline Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Assessing Current Service Level Agreements . . . . . . . . . . . . . . . . . . . . . . . 87 Determining the Optimal Number and Placement of Servers . . . . . . . . . . . . . . 89 Considerations for Physical Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Table of Contents
Considerations for Logical Servers and Services . . . . . . . . . . . . . . . . . . . . 92 Understanding Security Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Security Requirements for BI Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Backup and Restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Backing Up SSAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Backing Up SSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Backing Up SSRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Auditing and Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Auditing Features in SQL Server 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Source Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5
Logical OLAP Design Concepts for Architects . . . . . . . . . . . . . . 115 Designing Basic OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Star Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Denormalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Back to the Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Other Design Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Modeling Snowflake Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 More About Dimensional Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Understanding Fact (Measure) Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 146 Other Considerations in BI Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
6 Understanding SSAS in SSMS and SQL Server Profiler . . . . . . . 153 Core Tools in SQL Server Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Baseline Service Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 SSAS in SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 How Do You Query SSAS Objects? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Using MDX Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Using DMX Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Using XMLA Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Closing Thoughts on SSMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
ix
x
Table of Contents
7
Designing OLAP Cubes Using BIDS . . . . . . . . . . . . . . . . . . . . . . . 183 Using BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Offline and Online Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Working in Solution Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Data Sources in Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Data Source Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 Roles in Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Using Compiled Assemblies with Analysis Services Objects . . . . . . . . . . 196 Building OLAP Cubes in BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Examining the Sample Cube in Adventure Works . . . . . . . . . . . . . . . . . . 201 Understanding Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Attribute Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Attribute Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Translations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Using Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Measure Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Beyond Star Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Building Your First OLAP Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Selecting Measure Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Adding Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8
Refining Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Refining Your First OLAP Cube . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Translations and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Calculations (MDX Scripts or Calculated Members) . . . . . . . . . . . . . . . . . 239 Using Cube and Dimension Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Time Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 SCOPE Keyword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Account Intelligence and Unary Operator Definitions . . . . . . . . . . . . . . 246 Other Wizard Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Currency Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Advanced Cube and Dimension Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Table of Contents
9
Processing Cubes and Dimensions . . . . . . . . . . . . . . . . . . . . . . . . 257 Building, Processing, and Deploying OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . 257 Differentiating Data and Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Working in a Disconnected Environment . . . . . . . . . . . . . . . . . . . . . . . . . 259 Working in a Connected Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Understanding Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Choosing Storage Modes: MOLAP, HOLAP, and ROLAP . . . . . . . . . . . . . 267 OLTP Table Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Other OLAP Partition Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Implementing Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270 Aggregation Design Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Usage-Based Optimization Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 SQL Server Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Aggregation Designer: Advanced View . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Implementing Advanced Storage with MOLAP, HOLAP, or ROLAP . . . . . . . . 278 Proactive Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Notification Settings for Proactive Caching . . . . . . . . . . . . . . . . . . . . . . . 282 Fine-Tuning Proactive Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 ROLAP Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Writeback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Cube and Dimension Processing Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
10
Introduction to MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 The Importance of MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Writing Your First MDX Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 MDX Object Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 Other Elements of MDX Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 MDX Core Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Filtering MDX Result Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 Calculated Members and Named Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Creating Objects by Using Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 The TopCount Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Rank Function and Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Head and Tail Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Hierarchical Functions in MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
xi
xii
Table of Contents
Date Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Using Aggregation with Date Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 324 About Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
11
Advanced MDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Querying Dimension Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Looking at Date Dimensions and MDX Seasonality . . . . . . . . . . . . . . . . . . . . . . 332 Creating Permanent Calculated Members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Creating Permanent Calculated Members in BIDS . . . . . . . . . . . . . . . . . . 334 Creating Calculated Members Using MDX Scripts . . . . . . . . . . . . . . . . . . 335 Using IIf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 About Named Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 About Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Understanding SOLVE_ORDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Creating Key Performance Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Creating KPIs Programmatically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Additional Tips on KPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Using MDX with SSRS and PerformancePoint Server . . . . . . . . . . . . . . . . . . . . . 349 Using MDX with SSRS 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Using MDX with PerformancePoint Server 2007 . . . . . . . . . . . . . . . . . . . 352 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
12
Understanding Data Mining Structures . . . . . . . . . . . . . . . . . . . . 355 Reviewing Business Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Categories of Data Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 Working in the BIDS Data Mining Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 Understanding Data Types and Content Types . . . . . . . . . . . . . . . . . . . . 361 Setting Advanced Data Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Choosing a Data Mining Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Picking the Best Mining Model Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 Mining Accuracy Charts and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Data Mining Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Microsoft Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 Microsoft Decision Trees Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 Microsoft Linear Regression Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Microsoft Time Series Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Microsoft Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 Microsoft Sequence Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Microsoft Association Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Table of Contents
Microsoft Neural Network Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Microsoft Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 The Art of Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
13
Implementing Data Mining Structures . . . . . . . . . . . . . . . . . . . . 399 Implementing the CRISP-DM Life Cycle Model . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Building Data Mining Structures Using BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Adding Data Mining Models Using BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Processing Mining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 Validating Mining Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 Lift Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 Profit Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Classification Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 Data Mining Prediction Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 DMX Prediction Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 DMX Prediction Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Data Mining and Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 Data Mining Object Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 Data Mining Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
Part III
14
Microsoft SQL Server 2008 Integration Services for Developers Architectural Components of Microsoft SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Overview of Integration Services Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Tools and Utilities for Developing, Deploying, and Executing Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 The Integration Services Object Model and Components . . . . . . . . . . . . . . . . 442 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .444 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 Event Handlers and Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
xiii
xiv
Table of Contents
The Integration Services Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 The Integration Services Data Flow Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Data Flow Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 Synchronous Data Flow Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 Asynchronous Data Flow Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Log Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Deploying Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 Package Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 Package Deployment Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
15
Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio . . . . 463 Integration Services in Visual Studio 2008 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Creating New SSIS Projects with the Integration Services Project Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Viewing an SSIS Project in Solution Explorer . . . . . . . . . . . . . . . . . . . . . . 466 Using the SSIS Package Designers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Working with the SSIS Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 Choosing from the SSIS Menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Standard Database Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . 473 Other Types of Connection Managers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Control Flow Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 Control Flow Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 Precedence Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 Data Flow Source Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 Destination Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Transformation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 Integration Services Data Viewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 Variables Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 Variable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 System Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 Variables and Default Values Within a Package . . . . . . . . . . . . . . . . . . . . 494 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Table of Contents
16 Advanced Features in Microsoft SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Error Handling in Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Events, Logs, Debugging, and Transactions in SSIS . . . . . . . . . . . . . . . . . . . . . . 499 Logging and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 Debugging Integration Services Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 Checkpoints and Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 Configuring Package Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Best Practices for Designing Integration Services Packages . . . . . . . . . . . . . . . 509 Data Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 515 ETL for Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 Loading OLAP Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516 Using Integration Services to Check Data Quality . . . . . . . . . . . . . . . . . . 516 Transforming Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 Using a Staging Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Data Lineage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Moving to Star Schema Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Loading Dimension Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 Loading Fact Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 Fact Table Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 Dimension Table Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 ETL for Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Initial Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 Data Mining Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
18 Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 Solution and Project Structures in Integration Services . . . . . . . . . . . . . . . . . . 539 Source Code Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 Using Visual SourceSafe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 The Deployment Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 Package Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
xv
xvi
Table of Contents
Copy File Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 BIDS Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 Deployment with the Deployment Utility . . . . . . . . . . . . . . . . . . . . . . . . . 556 SQL Server Agent and Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 Introduction to SSIS Package Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 Handling Sensitive Data and Proxy Execution Accounts . . . . . . . . . . . . . 563 Security: The Two Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 The SSIS Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
19
Extending and Integrating SQL Server 2008 Integration Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 Introduction to SSIS Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 Visual Studio Tools for Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 The Script Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568 The Dts Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 Debugging Script Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 The Script Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 The ComponentMetaData Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 Source, Transformation, and Destination . . . . . . . . . . . . . . . . . . . . . . . . . . 582 Debugging Script Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 Overview of Custom SSIS Task and Component Development . . . . . . . . . . . . 587 Control Flow Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 Data Flow Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Other Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 Overview of SSIS Integration in Custom Applications . . . . . . . . . . . . . . . . . . . . 596 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
Part IV
20
Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence Creating Reports in SQL Server 2008 Reporting Services . . . . . 603 Understanding the Architecture of Reporting Services . . . . . . . . . . . . . . . . . . . 603 Installing and Configuring Reporting Services . . . . . . . . . . . . . . . . . . . . . . . . . . 606 HTTP Listener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 Report Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Report Server Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610 Background Processing (Job Manager) . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Table of Contents
Creating Reports with BIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 Other Types of Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 Sample Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 Deploying Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
21
Building Reports for SQL Server 2008 Reporting Services . . . . 627 Using the Query Designers for Analysis Services . . . . . . . . . . . . . . . . . . . . . . . . 627 MDX Query Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628 Setting Parameters in Your Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 DMX Query Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 Working with the Report Designer in BIDS . . . . . . . . . . . . . . . . . . . . . . . . 635 Understanding Report Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 List and Rectangle Report Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Tablix Data Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Using Report Builder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
22
Advanced SQL Server 2008 Reporting Services . . . . . . . . . . . . . 647 Adding Custom Code to SSRS Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 Viewing Reports in Word or Excel 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 URL Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 Embedding Custom ReportViewer Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 About Report Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 About Security Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 About the SOAP API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 What Happened to Report Models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660 Deployment—Scalability and Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662 Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663 Advanced Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 Scaling Out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 Administrative Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667 Using WMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
23
Using Microsoft Excel 2007 as an OLAP Cube Client . . . . . . . . 671 Using the Data Connection Wizard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671 Working with the Import Data Dialog Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Understanding the PivotTable Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Creating a Sample PivotTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
xvii
xviii
Table of Contents
Offline OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 Excel OLAP Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 Extending Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
24
Microsoft Office 2007 as a Data Mining Client . . . . . . . . . . . . . 687 Installing Data Mining Add-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687 Data Mining Integration with Excel 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 Using the Table Analysis Tools Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690 Using the Data Mining Tab in Excel 2007 . . . . . . . . . . . . . . . . . . . . . . . . . 700 Data Mining Integration in Visio 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714 Client Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718 Data Mining in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723 Excel Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723 Basic Architecture of Excel Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724 Immutability of Excel Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726 Introductory Sample Excel Services Worksheet . . . . . . . . . . . . . . . . . . . . 726 Publishing Parameterized Excel Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . 729 Excel Services: The Web Services API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732 A Real-World Excel Services Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733 SQL Server Reporting Services with Office SharePoint Server 2007 . . . . . . . . 736 Configuring SQL Server Reporting Services with Office SharePoint Server 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 Authoring and Deploying a Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738 Using the Report in Office SharePoint Server 2007: Native Mode . . . . 740 Using the Report in Office SharePoint Server 2007: SharePoint Integrated Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742 Using the Report Center Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 PerformancePoint Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
What do you think of this book? We want to hear from you! Microsoft is interested in hearing your feedback so we can continually improve our books and learning resources for you. To participate in a brief online survey, please visit:
www.microsoft.com/learning/booksurvey/
Foreword When Lynn Langit’s name appears in my inbox or RSS feeds, I never know what to expect— only that it will be interesting! She may be inviting me to share a technical webcast, passing pithy comments about a conference speaker, or recalling the sight of swimming elephants in Zambia, where Lynn tirelessly promotes information technology as a force for improving health care. On this occasion, it was an invitation to write a foreword for this, her latest book, Smart Business Intelligence Solutions with Microsoft SQL Server 2008. As so often, when Lynn asks, the only possible response is, “Of course—I’d be happy to!” When it comes to business intelligence, Lynn is a compulsive communicator. As a Developer Evangelist at Microsoft, this is part of her job, but Lynn’s enthusiasm for the technologies and their implications goes way beyond that. Her commitment is clear in her presentations and webcasts, in her personal engagements with customers across continents, and in her writing. Thinking of this, I am more than pleased to see this new book, especially to see that it tackles the SQL Server business intelligence (BI) technologies in their broad scope. Business intelligence is never about one technology solving one problem. In fact, a good BI solution can address many problems at many levels—tactical, strategic, and even operational. Part I, “Business Intelligence for Business Decision Makers and Architects,” explores these business scenarios. To solve these problems, you will find that your raw data is rarely sufficient. The BI devel oper must apply business logic to enrich the data with analytical insights for business users. Without this additional business logic, your system may only tell the users what they already know. Part II, “Microsoft SQL Server 2008 Analysis Services for Developers,” takes a deep look at using Analysis Services to create OLAP cubes and data mining models. By their nature, these problems often require you to integrate data from across your busi ness. SQL Server 2008 Integration Services is the platform for this work, and in Part III, “Microsoft SQL Server 2008 Integration Services for Developers,” Lynn tackles this technol ogy. She not only covers the details of building single workloads, but also sets this work in its important architectural context, covering management and deployment of the integration solutions. Finally, in Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence,” there is a detailed exploration of the options for designing and pub lishing reports. This section also covers other popular “clients”—the applications through which business users interact with your BI solution. So, even if you are a Microsoft Office Excel user, there is valuable information here. When all of these elements—integration, analysis, and reporting—come together, you know you are implementing a “smart solution,” the essence of this most helpful book. xix
xx
Foreword
I know from my own work at Microsoft, presenting and writing about BI, how difficult it is to find good symmetry between technology and the business case. I also know how important it is. Architects may build smart technology solutions, but enterprise decision makers put the business into BI. For these readers, Lynn makes very few assumptions. She quickly, yet quite thoroughly, takes the reader through a basic taxonomy of the moving parts of a BI solution. However, this book is more than a basic introduction—it gets down to the details you need to build effective solutions. Even experienced users will find useful insights and infor mation here. For example, all OLAP developers work with Analysis Services data source views. However, many of them do not even know about the useful data preview feature. In Chapter 7, “Designing OLAP Cubes Using BIDS,” Lynn not only describes the feature, but also includes a good example of its use for simple validation and profiling. It is, for me, a good measure of a book that it finds new things to say even about the most familiar features. For scenarios that may be less familiar to you, such as data mining, Lynn carefully sets out the business cases, the practical steps to take, and the traps to avoid. Having spent many hours teaching and evangelizing about data mining myself, I really admire how Lynn navigates through the subject. In one chapter, she starts from the highest level (“Why would I use data mining?”) to the most detailed (“What is the CLUSTERING_METHOD parameter for?”), retain ing a pleasant and easy logical flow. It is a privilege to work at Microsoft with Lynn. She clearly loves working with her customers and the community. This book captures much of her enthusiasm and knowledge in print. You will enjoy it, and I will not be surprised if you keep it close at hand on your desk whenever you work with SQL Server 2008. Donald Farmer Principal Program Manager, US-SQL Analysis Services Microsoft Corporation
Acknowledgments Many people contributed to making this book. The authors would like to acknowledge those people and the people who support them.
Lynn Langit Thanks to all those who supported my efforts on this book. First I’d like to thank my inspiration and the one who kept me going during the many months of writing this book—Mahnaz Javid—and the work of her Mona Foundation. Please prioritize caring for the world’s needy children and take the time to contribute to organizations that do a good job with this important work. A portion of the proceeds of this book will be donated to the Mona Foundation. For more information, go to http://www.monafoundation.org. Thanks to my colleagues at Microsoft Press: Ken Jones, Sally Stickney, Maureen Zimmerman; to my Microsoft colleagues: Woody Pewitt, Glen Gordon, Mithun Dhar, Bruno Terkaly, Joey Snow, Greg Visscher, and Scott Kerfoot; and to the SQL Team: Donald Farmer, Francois Ajenstadt, and Zack Owens. Thanks to my co-writers and community reviewers: Davide Mauri, Sahil Malik, Kevin Goff, Kim Schmidt, Mathew Roche, Ted Malone, and Karen Henderson. Thanks especially to my technical reviewer, John Welch. John, I wish I hadn’t made you work so hard! Thanks to my friends and family for understanding the demands that writing makes on my time and sanity: Lynn C, Teri, Chrys, Esther, Asli, Anton, and, most especially, to my mom and my daughter.
Davide Mauri A very big thanks to my wife Olga, who always supports me in everything I do; and to Gianluca, Andrea, and Fernando, who allowed me to the realize one of my many dreams!
Sahil Malik Instead of an acknowledgment, I’d like to pray for peace, harmony, wisdom, and inner happiness for everyone.
xxi
Introduction So, why write? What is it that makes typing in a cramped airline seat on an 11-hour flight over Africa so desirable? It’s probably because of a love of reading in general, and of learning in particular. It’s not by chance that my professional blog at http://blogs.msdn.com/SoCalDevGal is titled “Contagious Curiosity.” To understand why we wrote this particular book, you must start with a look at the current landscape of business intelligence (BI) using Microsoft SQL Server 2008. Business intelligence itself really isn’t new. Business intelligence—or data warehousing, as it has been traditionally called—has been used in particular industries, such as banking and retailing, for many years. What is new is the accessibility of BI solutions to a broader audience. Microsoft is leading this widening of the BI market by providing a set of world-class tools with SQL Server 2008. SQL Server 2008 includes the fourth generation of these tools in the box (for no additional fee) and their capabilities are truly impressive. As customers learn about the possibilities of BI, we see ever-greater critical mass adoption. We believe that within the next few years, it will be standard practice to implement both OLTP and (BI) OLAP/ data mining solutions for nearly every installation of SQL Server 2008. One of the most significant hindrances to previous adoption of BI technologies has not been the quality of technologies and tools available in SQL Server 2008 or its predecessors. Rather, what we have found (from our real-world experience) is that a general lack of understanding of BI capabilities is preventing wider adoption. We find that developers, in particular, lack understanding of BI core concepts such as OLAP (or dimensional) modeling and data mining algorithms. This knowledge gap also includes lack of understanding about the capabilities of the BI components and tools included with SQL Server 2008—SQL Server Analysis Services (SSAS), SQL Server Integration Services (SSIS), and SQL Server Reporting Services (SSRS). The gap is so significant, in fact, that it was one of the primary motivators for writing this book. Far too many times, we’ve seen customers who lack understanding of core BI concepts struggle to create BI solutions. Ironically, the BI tools included in SQL Server 2008 are in some ways too easy to use. As with many Microsoft products, a right-click in the right place nearly always starts a convenient wizard. So customers quickly succeed in building OLAP cubes and data mining structures; unfortunately, sometimes they have no idea what they’ve actually created. Often these solutions do not reveal their flawed underlying design until after they’ve been deployed and are being run with production levels of data. Because the SQL Server 2008 BI tools are designed to be intuitive, BI project implementation is pleasantly simple, as long as what you build properly implements standard BI concepts. If we’ve met our writing goals, you’ll have enough of both conceptual and procedural knowledge after reading this book that you can successfully envision, design, develop, and deploy a BI project built using SQL Server 2008. xxiii
xxiv
Introduction
Who This Book Is For This book has more than one audience. The primary audience is professional developers who want to start work on a BI project using SSAS, SSIS, and SSRS. Our approach is one of inclusiveness—we have provided content targeted at both beginning and intermediate BI developers. We have also included information for business decision makers who wish to understand the capabilities of the technology and Microsoft’s associated tools. Because we believe that appropriate architecture is the underpinning of all successful projects, we’ve also included information for that audience. We assume that our readers have production experience with a relational database. We also assume that they understand relational database queries, tables, normalization and joins, and other terms and concepts common to relational database implementations. Although we’ve included some core information about administration of BI solutions, we consider IT pros (or BI administrators) to be a secondary audience for this book.
What This Book Is About This book starts by helping the reader develop an intuitive understanding of the complexity and capabilities of BI as implemented using SQL Server 2008, and then it moves to a more formal understanding of the concepts, architecture, and modeling. Next, it presents a more process-focused discussion of the implementation of BI objects, such as OLAP cubes and data mining structures, using the tools included in SQL Server 2008. Unlike many other data warehousing books we’ve seen on the market, we’ve attempted to attain an appropriate balance between theory and practical implementation. Another difference between our book and others is that we feel that data mining is a core part of a BI solution. Because of this we’ve interwoven information about data mining throughout the book and have provided three chapters dedicated to its implementation.
Part I, “Business Intelligence for Business Decision Makers and Architects” The goal of this part of the book is to answer these questions: ■■
Why use BI?
■■
What can BI do?
■■
How do I get started?
In this first part, we address the business case for BI. We also introduce BI tools, methods, skills, and techniques. This section is written for developers, business decision makers, and
Introduction
xxv
architects. Another way to look at our goal for this section is that we’ve tried to include all of the information you’ll need to understand before you start developing BI solutions using SQL Server Analysis Services in the Business Intelligence Development Studio (BIDS) toolset. Chapter 1, “Business Intelligence Basics” In this chapter, we provide a practical definition of exactly what BI is as implemented in SQL Server 2008. Here we define concepts such as OLAP, dimensional modeling, and more. Also, we discuss tools and terms such as BIDS, MDX, and more. Our aim is to provide you with a foundation for learning more advanced concepts. Chapter 2, “Visualizing Business Intelligence Results” In this chapter, we look at BI from an end user’s perspective using built-in BI client functionality in Microsoft Office Excel 2007. Here we attempt to help you visualize the results of BI projects—namely, OLAP cubes and data mining models. Chapter 3, “Building Effective Business Intelligence Processes” In this chapter, we examine software development life-cycle processes that we use when envisioning, designing, developing, and deploying BI projects. Here we take a closer look at Microsoft Solutions Framework (and other software development life cycles) as applied to BI projects. Chapter 4, “Physical Architecture in Business Intelligence Solutions” In this chapter, we examine best practices for establishing baselines in your intended production BI environment. We cover tools, such as SQL Server Profiler and more, that can help you prepare to begin a BI project. We also talk about physical servers—especially, number and placement. We include an introduction to security concepts. We close by discussing considerations for setting up a BI development environment. Chapter 5, “Logical OLAP Design Concepts for Architects” In this chapter, we take a close look at core OLAP modeling concepts—namely, dimensional modeling. Here we take a look at star schemas, fact tables, dimensional hierarchy modeling, and more.
Part II, “Microsoft SQL Server 2008 Analysis Services for Developers” This part provides you with detailed information about how to use SSAS to build OLAP cubes and data mining models. Most of this section is focused on using BIDS by working on a detailed drill-down of all the features included. As we’ll do with each part of the book, the initial chapters look at architecture and a simple implementation. Subsequent chapters are where we drill into intermediate and, occasionally, advanced concepts. Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler” In this chapter, we look at OLAP cubes in SQL Server Management Studio and in SQL Server Profiler. We start here because we want you to understand how to script, maintain, and move objects that you’ve created for your BI solution. Also, SQL Server Profiler is a key tool to help you understand underlying MDX or DMX queries from client applications to SSAS structures.
xxvi
Introduction
Chapter 7, “Designing OLAP Cubes Using BIDS” In this chapter, we begin the work of developing an OLAP cube. Here we start working with BIDS, beginning with the sample SSAS database Adventure Works 2008 DW. Chapter 8, “Refining Cubes and Dimensions” In this chapter, we dig deeper into the details of building OLAP cubes and dimensions using BIDS. Topics include dimensional hierarchies, key performance indicators (KPIs), MDX calculations, and cube actions. We explore both the cube and dimension designers in BIDS in great detail in this chapter. Chapter 9, “Processing Cubes and Dimensions” In this chapter, we take a look at cube metadata and data storage modes. Here we discuss multidimensional OLAP (MOLAP), hybrid OLAP (HOLAP), and relational OLAP (ROLAP). We also look at the aggregation designer and discuss aggregation strategies in general. We also examine proactive caching. Chapter 10, “Introduction to MDX” In this chapter, we depart from using BIDS and present a tutorial on querying by using MDX. We present core language features and teach via many code examples in this chapter. Chapter 11, “Advanced MDX” In this chapter, we move beyond core language features to MDX queries to cover more advanced language features. We also take a look at how the MDX language is used throughout the BI suite in SQL Server 2008—that is, in BIDS for SSAS and SSRS. Chapter 12, “Understanding Data Mining Structures” In this chapter, we take a look at the data mining algorithms that are included in SSAS. We examine each algorithm in detail, including presenting configurable properties, so that you can gain an understanding of what is possible with SQL Server 2008 data mining. Chapter 13, “Implementing Data Mining Structures” In this chapter, we focus on practical implementation of data mining models using SSAS in BIDS. We work through each tab of the data mining model designer, following data mining implementation from planning to development, testing, and deployment.
Part III, “Microsoft SQL Server 2008 Integration Services for Developers” The goal of this part is to give you detailed information about how to use SSIS to develop extract, transform, and load (ETL) packages. You’ll use these packages to load your OLAP cubes and data mining structures. Again, we’ll focus on using BIDS while working on a detailed drill-down of all the features included. As with each part of the book, the initial chapters look at architecture and start with a simple implementation. Subsequent chapters are where we drill into intermediate and, occasionally, advanced concepts.
Introduction
xxvii
Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration Services” In this chapter, we examine the architecture of SSIS. Here we take a look at the data flow pipeline and more. Chapter 15, “Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio” In this chapter, we explain the mechanics of package creation using BIDS. Here we present the control flow tasks and then continue by explaining data flow sources, destinations, and transformations. We continue working through the BIDS interface by covering variables, expressions, and the rest of the BIDS interface. Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services” In this chapter, we begin by taking a look at the error handling, logging, and auditing features in SSIS. Next we look at some common techniques for assessing data quality, including using the new Data Profiling control flow task. Chapter 17, “Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions” In this chapter, we take a look at extract, transform, and load processes and best practices associated with SSIS when it’s used as a tool to create packages for data warehouse loading. We look at this using both OLAP cubes and data mining models. Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services” In this chapter, we drill into the details of SSIS package deployment and management. Here we look at using Visual SourceSafe (VSS) and other source control solutions to manage distributed package deployment. Chapter 19, “Extending and Integrating SQL Server 2008 Integration Services” In this chapter, we provide an explanation about the details of extending the functionality of SSIS packages using .NET-based scripts.
Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence” The goal of this part is to give you detailed information about how to select and implement client interfaces for OLAP cubes and data mining structures. We’ll look in great detail at SSRS. In addition, we’ll examine using Excel, Visio, or Office SharePoint Server 2007 as your BI client of choice. We’ll look at SSRS architecture, then at designing reports using BIDS and other tools. Then we’ll move to a detailed look at implementing other clients, including a discussion of the process for embedding results in a custom Windows Form or Web Form application. As we do with each part of the book, our first chapters look at architecture, after which we start with simple implementation. Subsequent chapters are where we drill into intermediate and, occasionally, advanced concepts.
xxviii
Introduction
Chapter 20, “Creating Reports in SQL Server 2008 Reporting Services” In this chapter, we present the architecture of SQL Server Reporting Services. We cover the various parts and pieces that you’ll have to implement to make SSRS a part of your BI solution. Chapter 21, “Building Reports for SQL Server 2008 Reporting Services” In this chapter, we drill into the detail of building reports using BIDS. We take a look at the redesigned interface and then look at the details of designing reports for OLAP cubes and data mining models. Chapter 22, “Advanced SQL Server 2008 Reporting Services” In this chapter, we look at programmatically extending SSRS as well as other advanced uses of SSRS in a BI project. Here we look at using the ReportViewer control in custom SSRS clients. We also take a look at the new integration between SSRS and Excel and Word 2007. Chapter 23, “Using Microsoft Excel 2007 as an OLAP Cube Client” In this chapter, we walk through the capabilities included in Excel 2007 as an OLAP cube client. We take a detailed look at the PivotTable functionality and also examine the PivotChart as a client interface for OLAP cubes. Chapter 24, “Microsoft Office 2007 as a Data Mining Client” In this chapter, we look at using the SQL Server 2008 Data Mining Add-ins for Office 2007. These add-ins enable Excel 2007 to act as a client to SSAS data mining. We look at connecting to existing models on the server as well creating temporary models in the Excel session using Excel source data. We also examine the new tools that appear on the Excel 2007 Ribbon after installing the add-ins. Chapter 25, “SQL Server 2008 Business Intelligence and Microsoft Office SharePoint Server 2007” In this chapter, we look at integration between SQL Server Reporting Services and SharePoint technologies. We focus on integration between SSRS and Office SharePoint Server 2007. Here we detail the integrated mode option for SSRS and Office SharePoint Server 2007. We also look at the Report Center template included in Office SharePoint Server 2007 and detail just how it integrates with SSRS. We have also included information about Excel Services.
Prerelease Software This book was written and tested against the release to manufacturing (RTM) 2008 version of SQL Server Enterprise software. Microsoft released the final version of Microsoft SQL Server 2008 (build number 10.0.1600.22) in August 2008. We did review and test our examples against the final release of the software. However, you might find minor differences between the production release and the examples, text, and screen shots in this book. We made every attempt to update all of the samples shown to reflect the RTM; however, minor variances in screen shots or text between the community technology preview (CTP) samples and the RTM samples might still remain.
Introduction
xxix
Hardware and Software Requirements You’ll need the following hardware and software to work with the information and examples provided in this book: ■■
Microsoft Windows Server 2003 Standard edition or later. Microsoft Windows Server 2008 Enterprise is preferred. The Enterprise edition of the operating system is required if you want to install the Enterprise edition of SQL Server 2008.
■■
Microsoft SQL Server 2008 Standard edition or later. Enterprise edition is required for using all features discussed in this book. Installed components needed are SQL Server Analysis Services, SQL Server Integration Services, and SQL Server Reporting Services.
■■
SQL Server 2008 Report Builder 2.0.
■■
Visual Studio 2008 (Team System is used to show the examples).
■■
Office SharePoint Server 2007 (Enterprise Edition or Windows SharePoint Services 3.0).
■■
Office 2007 Professional edition or better, including Excel 2007 and Visio 2007.
■■
SQL Server 2008 Data Mining Add-ins for Office 2007.
■■
1.6 GHz Pentium III+ processor or faster.
■■
1 GB of available, physical RAM.
■■
10 GB of hard disk space for SQL Server and all samples
■■
Video (800 by 600 or higher resolution) monitor with at least 256 colors.
■■
CD-ROM or DVD-ROM drive.
■■
Microsoft mouse or compatible pointing device.
Find Additional Content Online As new or updated material becomes available that complements this book, it will be posted online on the Microsoft Press Online Developer Tools Web site. The type of material you might find includes updates to book content, articles, links to companion content, errata, sample chapters, and more. This Web site is located at www.microsoft.com/learning/books/ online/developer and is updated periodically. Lynn Langit is recording a companion screencast series named “How Do I BI?” Find this series via her blog at http://blogs.msdn.com/SoCalDevGal.
xxx
Introduction
Support for This Book Every effort has been made to ensure the accuracy of this book. As corrections or changes are collected, they will be added to a Microsoft Knowledge Base article. Microsoft Press provides support for books at the following Web site: http://www.microsoft.com/learning/support/books/
Questions and Comments If you have comments, questions, or ideas regarding the book, or questions that are not answered by visiting the site above, please send them to Microsoft Press via e-mail to
[email protected] Or via postal mail to Microsoft Press Attn: Smart Business Intelligence Solutions with Microsoft SQL Server 2008 Editor One Microsoft Way Redmond, WA 98052-6399 Please note that Microsoft software product support is not offered through the above addresses.
Part I
Business Intelligence for Business Decision Makers and Architects
1
Chapter 1
Business Intelligence Basics Many real-world business intelligence (BI) implementations have been delayed or even derailed because key decision makers involved in the projects lacked even a general understanding of the potential of the product stack. In this chapter, we provide you with a conceptual foundation for understanding the broad potential of the BI technologies within Microsoft SQL Server 2008 so that you won’t have to be in that position. We define some of the basic terminology of business intelligence, including OLTP and OLAP, and go over the components, both core and optional, of Microsoft BI solutions. We also introduce you to the development languages involved in BI projects, including MDX, DMX, XMLA, and RDL. If you already know these basic concepts, you can skip to Chapter 2, “Visualizing Business Intelligence Results,” which talks about some of the common business problems that BI addresses.
Business Intelligence and Data Modeling You’ll see the term business intelligence defined in many different ways and in various contexts. Some vendors manufacture a definition that shows their tools in the best possible light. You’ll sometimes hear BI summed up as “efficient reporting.” With the BI tools included in SQL Server 2008, business intelligence is much more than an overhyped, supercharged reporting system. For the purposes of this book, we define business intelligence in the same way Microsoft does: Business intelligence solutions include effective storage and presentation of key enterprise data so that authorized users can quickly and easily access and interpret it. The BI tools in SQL Server 2008 allow enterprises to manage their business at a new level, whether to understand why a particular venture got the results it did, to decide on courses of action based on past data, or to accurately forecast future results on the basis of historical data. You can customize the display of BI data so that it is appropriate for each type of user. For example, analysts can drill into detailed data, executives can see timely high-level summaries, and middle managers can request data presented at the level of detail they need to make good day-to-day business decisions. Microsoft BI usually uses data structures (called cubes or data mining structures) that are optimized to provide fast, easy-to-query decision support. This BI data is presented to users via various types of reporting interfaces. These formats can include custom applications for Microsoft Windows, the Web, or mobile devices as well as Microsoft BI client tools, such as Microsoft Office Excel or SQL Server Reporting Services. 3
4
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 1-1 shows a conceptual view of a BI solution. In this figure, multiple types of source data are consolidated into a centralized data storage facility. For a formal implementation of a BI solution, the final destination container is most commonly called a cube. This consolidation can be physical—that is, all the source data is physically combined onto one or more servers—or logical, by using a type of a view. We consider BI conceptual modeling in more detail in Chapter 5, “Logical OLAP Design Concepts for Architects.”
Database Servers Relational Data
Mainframes
Clients with Access, Excel, Word Files, etc.
BI Cluster Other Servers Web Services
BI Servers FIgure 1-1 Business intelligence solutions present a consolidated view of enterprise data. This view can be a
physical or logical consolidation, or a combination of both.
Although it’s possible to place all components of a BI solution on a single physical server, it’s more typical to use multiple physical servers to implement a BI solution. Microsoft Windows Server 2008 includes tremendous improvements in virtualization, so the number of physical servers involved in a BI solution can be greatly reduced if you are running this version. We talk more about physical modeling for BI solutions in Chapter 5. Before we examine other common BI terms and components, let’s review two core concepts in data modeling: OLTP and OLAP.
Chapter 1
Business Intelligence Basics
5
OLTP and OLAP You’ve probably heard the terms OLTP and OLAP in the context of data storage. When planning SQL Server 2008 BI solutions, you need to have a solid understanding of these systems as well as the implications of using them for your particular requirements.
Online Transactional Processing OLTP stands for online transactional processing and is used to describe a relational data store that is designed and optimized for transactional activities. Transactional activities are defined as inserts, updates, and deletes to rows in tables. A typical design for this type of data storage system is to create a large number of normalized tables in a single source database.
relational vs. Nonrelational Data SQL Server 2008 BI solutions support both relational and nonrelational source data. Relational data usually originates from a relational database management system (RDBMS) such as SQL Server 2008 (or an earlier version of SQL Server) or an RDBMS built by a different vendor, such as Oracle or IBM. Relational databases generally consist of a collection of related tables. They can also contain other objects, such as views or stored procedures. Nonrelational data can originate from a variety of sources, including Windows Communication Foundation (WCF) or Web services, mainframes, and file-based applications, such as Microsoft Office Word or Excel. Nonrelational data can be presented in many formats. Some of the more common formats are XML, TXT, CSV, and various binary formats.
Normalization in relational data stores is usually implemented by creating a primary-key-toforeign-key relationship between the rows in one table (often called the parent table) and the rows in another table (often called the child table). Typically (though not always), the rows in the parent table have a one-to-many relationship with the rows in the child table. A common example of this relationship is a Customer table and one or more related [Customer] Orders tables. In the real world, examples are rarely this simple. Variations that include one-to-one or many-to-many relationships, for example, are possible. These relationships often involve many source tables. Figure 1-2 shows the many tables that can result when data stores modeled for OLTP are normalized. The tables are related by keys.
6
Part I
Business Intelligence for Business Decision Makers and Architects
FIgure 1-2 Sample from AdventureWorks OLTP database
The primary reasons to model a data store in this way (that is, normalized) are to reduce the total amount of data that needs to be stored and to improve the efficiency of performing inserts, updates, and deletes by reducing the number of times the same data needs to be added, changed, or removed. Extending the example in Figure 1-2, if you inserted a second order for an existing customer and the customer’s information hadn’t changed, no new information would have to be inserted into the Customer table; instead, only one or more rows would have to be inserted into the related Orders tables, using the customer identifier (usually a key value), to associate the order information with a particular customer. Although this type of modeling is efficient for these activities (that is, inserting, updating, and deleting data), the challenge occurs when you need to perform extensive reading of these types of data stores. To retrieve meaningful information from the list of Customers and the particular Order information shown in Figure 1-2, you’d first have to select the rows meeting the report criteria from multiple tables and then sort and match (or join) those rows to create the information you need. Also, because a common business requirement is viewing aggregated information, you might want to see the total sales dollar amount purchased for each customer, for example. This requirement places additional load on the query processing engine of your OLTP data store. In addition to selecting, fetching, sorting, and matching the rows, the engine also has to aggregate the results. If the query you make involves only a few tables (for example, the Customer table and the related SalesOrderHeader tables shown in Figure 1-2), and if these tables contain a small
Chapter 1
Business Intelligence Basics
7
number of rows, the overhead incurred probably would be minimal. (In this context, the definition of a “small” number is relative to each implementation and is best determined by baseline performance testing during the development phase of the project.) You might be able to use a highly normalized OLTP data store to support both CRUD (create, retrieve, update, delete) and read-only (decision support or reporting) activities. The processing speed depends on your hardware resources and the configuration settings of your database server. You also need to consider the number of users who need to access the information simultaneously. These days, the OLTP data stores you are querying often contain hundreds or even thousands of source tables. The associated query processors must filter, sort, and aggregate millions of rows from the related tables. Your developers need to be fluent in the data store query language so that they can write efficient queries against such a complex structure, and they also need to know how to capture and translate every business requirement for reporting. You might need to take additional measures to improve query (and resulting report) performance, including rewriting queries to use an optimal query syntax, analyzing the query execution plan, providing hints to force certain execution paths, and adding indexes to the relational source tables. Although these strategies can be effective, implementing them requires significant skill and time on the part of your developers. Figure 1-3 shows an example of a typical reporting query—not the complexity of the statement, but the number of tables that can be involved.
FIgure 1-3 Sample reporting query against a normalized data store
8
Part I
Business Intelligence for Business Decision Makers and Architects
SQL Server 2008 includes a powerful tool, the Database Engine Tuning Advisor, that can assist you in manual tuning efforts, though the amount of time needed to implement and maintain manual query tuning can become significant. Other costs are also involved with OLTP query optimization, the most significant of which is the need for additional storage space and maintenance tasks as indexes are added. A good way to think of the move from OLTP alone to a combined solution that includes both an OLTP store and an OLAP store is as a continuum that goes from OLTP (single RDBMS) relational to relational copy to OLAP (cube) nonrelational. In particular, if you’re already making a copy of your OLTP source data to create a more efficient data structure from which to query for reporting and to reduce the load on your production OLTP servers, you’re a prime candidate to move to a more formalized OLAP solution based on the dedicated BI tools included in SQL Server 2008.
Online Analytical Processing OLAP stands for online analytical processing and is used to describe a data structure that is designed and optimized for analytical activities. Analytical activities are defined as those that focus on the best use of data for the purpose of reading it rather than optimizing it so that changes can be made in the most efficient way. In fact, many OLAP data stores are implemented as read-only. Other common terms for data structures set up and optimized for OLAP are decision support systems, reporting databases, data warehouses, or cubes. As with OLTP, definitions of an OLAP data store vary depending on who you’re talking to. At a minimum, most professionals would agree that an OLAP data store is modeled in a denormalized way. When denormalizing, you use very wide relational tables (those containing many columns) with deliberately duplicated information. This approach reduces the number of tables that must be joined to provide query results and lets you add indexes. Reducing the size of surface area that is queried results in faster query execution. Data stores that are modeled for OLAP are usually denormalized using a specific type of denormalization modeling called a star schema. We cover this technique extensively in Chapter 4, “Physical Architecture in Business Intelligence Solutions.” Although using a denormalized relational store addresses some of the challenges encountered when trying to query an OLTP (or normalized) data store, a denormalized relational store is still based on relational tables, so the bottlenecks are only partially mitigated. Developers still must write complex queries for each business reporting requirement and then manually denormalize the copy of the OLTP store, add indexes, and further tune queries as performance demands. The work involved in performing these tasks can become excessive. Figure 1-4 shows a portion of a denormalized relational data store. We are working with the AdventureWorksDW sample database, which is freely available for download and is designed to help you understand database modeling for loading OLAP cubes. Notice the numerous columns in each table.
Chapter 1
Business Intelligence Basics
9
FIgure 1-4 Portion of AdventureWorksDW
Another way to implement OLAP is to use a cube rather than a group of tables. A cube is one large store that holds all associated data in a single structure. The structure contains not only source data but also pre-aggregated values. A cube is also called a multidimensional data store.
Aggregation Aggregation is the application of some type of mathematic function to data. Although aggregation can be as simple as doing a SUM on numeric values, in BI solutions, it is often much more complex. In most relational data stores, the number of aggregate functions available is relatively small. For SQL Server 2008, for example, Transact-SQL contains 12 aggregate functions: AVG, MIN, CHECKSUM_AGG, SUM, COUNT, STDEV, COUNT_BIG, STDEVP, GROUPING, VAR, MAX, and VARP. Contrast this with the number of built-in functions available in the SQL Server cube store: over 150. The number and type of aggregate functions available in the cube store are more similar to those available in Excel than to those in SQL Server 2008.
So, what exactly does a cube look like? It can be tricky to visualize a cube because most of us can imagine only in three dimensions. Cubes are n-dimensional structures because they can store data in an infinite number of dimensions. A conceptual rendering of a cube is shown in
10
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 1-5. Cubes contain two major aspects: facts and dimensions. Facts are often numeric and additive, although that isn’t a requirement. Facts are sometimes called measures. An example of a fact is “gross sales amount.” Dimensions give meaning to facts. For example, you might need to be able to examine gross sales amount by time, product, customer, and employee. All of the “by xxx” values are dimensions. Dimensional information is accessed via a hierarchy of information along each dimensional axis. You’ll also hear the term Unified Dimensional Model (UDM) to describe an OLAP cube because, in effect, such a cube “unifies” a group of dimensions. We discuss this type of modeling in much greater detail in Chapter 5.
Ground Route Nonground
Eastern Hemisphere Source
Sea Air
Africa
190 215 160 240 Feb-17-99 Apr-22-99 Sep-07-99 Dec-01-99
Asia
510 600 520 780 Mar-19-99 May-31-99 Sep-18-99 Dec-22-99
Australia
210 240 300 410 Mar-05-99 May-19-99 Aug-09-99 Nov-27-99
Europe
Western Hemisphere
Rail Road
500
470
4644
696
Mar-07-99 Jun-w0-99 Sep-11-99 Dec-15-99
3056 4050 4360 5112 North America Mar-30-99 Jun-28-99 Sep-30-99 Dec-29-99 600 490 315 580 South America Feb-27-99 Jun-03-99 Aug-21-99 Nov-30-99 1st Quarter
2nd Quarter
3rd Quarter
4th Quarter
Measures Packages
1st Half
2nd Half
Last Time FIgure 1-5 Sample cube structure
You might be wondering why a cube is preferable to a denormalized relational data store. The answer is efficiency—in terms of scalability, performance, and ease of use. In the case of using the business intelligence toolset in SQL Server 2008, you’ll also get a query processing engine that is specifically optimized to deliver fast queries, particularly for queries that involve aggregation. We’ll review the business case for using a dedicated OLAP solution in more detail in Chapter 2. Also, we’ll look at real-world examples throughout the book. The next phase of understanding an OLAP cube is translating that n-dimensional structure to a two-dimensional screen so that you can visualize what users will see when working with an OLAP cube. The standard viewer for a cube is a pivot table interface, a sample of which
Chapter 1
Business Intelligence Basics
11
is shown in Figure 1-6. The built-in viewer in the developer and administrative interfaces for SQL Server 2008 OLAP cubes both use a type of pivot table to allow developers and administrators to visualize the cubes they are working with. Excel 2007 PivotTables are also a common user interface for SQL Server 2008 cubes.
FIgure 1-6 SQL Server 2008 cube presented in a PivotTable
Common BI Terminology A number of other conceptual terms are important to understand when you’re planning a BI solution. In this section, we’ll talk about several of these: data warehouses; data marts; cubes; decision support systems; data mining systems; extract, transform, and load systems; report processing systems; and key performance indicators.
Data Warehouses A data warehouse is a single structure that usually consists of one or more cubes. Figure 1-7 shows the various data sources that contribute to an OLAP cube. Data warehouses are used to hold an aggregated, or rolled-up (and most commonly) read-only, view of the majority of an organization’s data. This structure includes client query tools. When planning and implementing your company’s data warehouse, you need to decide which data to include and at what level of detail (or granularity). We explore this concept in more detail in “Extract, Transform, and Load Systems” later in this chapter.
12
Part I
Business Intelligence for Business Decision Makers and Architects
RDBS: SQL Server, Oracle, DB2, etc.
WCF or Web Services
File-Based Relational Data: Access, Excel, etc.
Mainframe Data
Semi-Structured XML
OLAP Cube FIgure 1-7 Conceptual OLAP cube
The terms OLAP and data warehousing are sometimes used interchangeably. However, this is a bit of an oversimplification because an OLAP store is modeled as a cube or multidimensionally, whereas a data warehouse can use either denormalized OLTP data or OLAP. OLAP and data warehousing are not new technologies. The first Microsoft OLAP tools were part of SQL Server 7. What is new in SQL Server 2008 is the inclusion of powerful tools that allow you to implement a data warehouse using an OLAP cube (or cubes). Implementing BI solutions built on OLAP is much easier because of improved tooling, performance, administration, and usability, which reduces total cost of ownership (TCO).
Pioneers of Data Warehousing Data warehousing has been available, usually implemented via specialized tools, since the early 1980s. Two principal thought leaders of data warehousing theory are Ralph Kimball and Bill Inmon. Both have written many articles and books and have popular Web sites talking about their extensive experience with data warehousing solutions using products from many different vendors. To read more about Ralph Kimball’s ideas on data warehouse design modeling, go to http://www.ralphkimball.com. I prefer the Kimball approach to modeling and have had good success implementing Kimball’s methods in production BI projects. For a simple explanation of the Kimball approach, see http://en.wikipedia.org/wiki/Ralph_Kimball.
Chapter 1
Business Intelligence Basics
13
Data Marts A data mart is a defined subset of enterprise data, often a single cube from a group of cubes, that is intended to be consolidated into a data warehouse. The single cube represents one business unit (for example, marketing) from a greater whole (for example, the entire company). Data marts were the basic units of organization in the OLAP tools that were included in earlier versions of SQL Server BI solutions because of restrictions in the tools themselves. The majority of these restrictions were removed in SQL Server 2005. Because of this, data warehouses built using the tools provided by SQL Server 2008 often consist of one huge cube. This is not the case with many competitive OLAP products. There are, of course, exceptions to this single-cube design. However, limits in the product stack are not what determines this type of design, rather, it is determined by OLAP modeler or developer preferences.
Cubes As described earlier in the chapter, a BI cube is a data structure used by classic data warehousing products (including SQL Server 2008) in place of many relational tables. Rather than containing tables with rows and columns, cubes consist of dimensions and measures (or facts). Cubes can also contain data that is pre-aggregated (usually summed) rather than included as individual items (or rows). In some cases, cubes contain a complete copy of production data; in other cases, they contain subsets of source data. In SQL Server 2008, cubes are more scalable and perform better than in previous versions of SQL Server, so you can include data with much more detail than you could include when using previous versions of the SQL Server 2008 OLAP tool, with many fewer adverse effects on scalability and performance. As in previous versions, when you are using SQL Server 2008, you will copy the source data from any number of disparate source systems to the destination OLAP cubes via extract, transform, and load (ETL) processes. (You’ll find out more about ETL shortly.) We talk a lot about cubes in this book, from their physical and logical design in Chapters 5 and 6 to the use of the cube-building tools that come with SQL Server 2008 in Part II, “Microsoft SQL Server 2008 Analysis Services for Developers.”
Decision Support Systems The term decision support system can mean anything from a read-only copy of an OLTP data store to a group of OLAP cubes, or even a mixture of both. If the data source consists of only an OLTP data store, then this type of store can be limited in its effectiveness because of the challenges discussed earlier in this chapter, such as the difficulty of efficient querying and the overhead required for indexes. Another way to think about a decision support system is some type of data structure (such as a table or a cube) that is being used as a basis for developing end-user reporting. End-user in this context means all types or categories of users. These
14
Part I
Business Intelligence for Business Decision Makers and Architects
usually include business decision makers, middle managers, and general knowledge workers. It is critical that your solution be able to provide data summarized at a level of detail that is useful to these various end-user communities. The best BI solutions are intuitive for the various end-user communities to work with—little or no end-user training is needed. In this book, we focus on using the more efficient OLAP data store (or cube) as a source for a decision support system.
Data Mining Systems Data mining can be understood as a complementary technique to OLAP. Whereas OLAP is used to provide decision support or the data to prove a particular hypothesis, data mining is used in situations in which you have no solid hypothesis about the data. For example, you could use an OLAP cube to verify that customers who purchased a certain product during a certain timeframe had certain characteristics. Specifically, you could prove that customers who purchased cars during December 2007 chose red-colored cars twice as often as they picked black-colored cars if those customers shopped at locations in postal codes 90201 to 90207. You could use a data mining store to automatically correlate purchase factors into buckets, or groups, so that decision makers could explore correlations and then form more specific hypotheses based on their investigation. For example, they could decide to group or cluster all customers segmented into “car purchasers” and “non-car-purchasers” categories. They could further examine the clusters to find that “car purchasers” had the following traits most closely correlated, in order of priority: home owners (versus non–home owners), married (versus single), and so on. Another scenario for which data mining is frequently used is one where your business requirements include the need to predict one or more future target values in a dataset. An example of this would be the rate of sale—that is, the number of items predicted to be sold over a certain rate of time. We explore the data mining support included in SQL Server 2008 in greater detail in Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” in the context of logical modeling. We also cover it in several subsequent chapters dedicated to implementing solutions that use data mining.
Extract, Transform, and Load Systems Commonly expressed as ETL, extract, transform, and load refers to a set of services that facilitate the extraction, transformation, and loading of the various types of source data (for example, relational, semi-structured, and unstructured) into OLAP cubes or data mining structures. SQL Server 2008 includes a sophisticated set of tools to accomplish the ETL processes associated with the initial loading of data into cubes as well as to process subsequent
Chapter 1
Business Intelligence Basics
15
incremental inserts of data into cubes, updates to data in cubes, and deletions of data from cubes. ETL is explored in detail in Part III, “Microsoft SQL Server 2008 Integration Services for Developers.” A common error made in BI solutions is underestimating the effort that will be involved in the ETL processes for both the initial OLAP cube and the data mining structure loads as well as the effort involved in ongoing maintenance, which mostly consists of inserting new data but can also include updating and deleting data. It is not an exaggeration to say that up to 75 percent of the project time for the initial work on a BI project can be attributed to the ETL portion of the project. The “dirtiness,” complexity, and general incomprehensibility of the data originating from various source systems are factors that are often overlooked in the planning phase. By “dirtiness,” we mean issues such as invalid data. This can include data of an incorrect type, format, length, and so on.
Report Processing Systems Most BI solutions use more than one type of reporting client because of the different needs of the various users who need to interact with the cube data. An important part of planning any BI solution is to carefully consider all possible reporting tools. A common production mistake is to under-represent the various user populations or to clump them together when a more thorough segmentation would reveal very different reporting needs for each population. SQL Server 2008 includes a report processing system designed to support OLAP data sources. In Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence,” we explore the included tools—such as SQL Server Reporting Services, Office SharePoint Server 2007, and PerformancePoint Server—as well as other products that are part of the Microsoft product suite for reporting.
Key Performance Indicators Key performance indicators (KPIs) are generally used to indicate a goal that consists of several values—actual, target, variance to target, and trend. Most often, KPIs are expressed and displayed graphically—for example, as different colored traffic lights (red, yellow, or green). KPIs usually include drill-down capabilities that allow interested decision makers to review the data behind the KPI. KPIs can be implemented as part of an OLAP system, and they are often part of reporting systems, which are most typically found on reporting dashboards or scorecards. It is quite common for business requirements to include the latter as part of a centralized performance management strategy. Microsoft’s BI tools include the ability to create and view KPIs. You can create KPIs from nearly any type of data source, such as OLAP cubes, Excel workbooks, or SharePoint lists.
16
Part I
Business Intelligence for Business Decision Makers and Architects
Core Components of a Microsoft BI Solution The core components of a Microsoft BI solution are SQL Server Analysis Services (SSAS), SQL Server Reporting Services (SSRS), and SQL Server 2008 itself. SQL Server is often used as a data source or an intermediate repository for data as it is being validated in preparation for loading into an OLAP cube. The ETL toolset, called SQL Server Integration Services (SSIS), also requires a SQL Server 2008 license. Each of these core services comes with its own set of management interface tools. The SSAS development interface is the Business Intelligence Development Studio (BIDS); SSIS and SSRS use the same development interface. The administrative interface all three use is SQL Server Management Studio (SSMS). We typically use SQL Server Reporting Services in our solutions. Sometimes a tool other than SSRS will be used as the report engine for data stores developed using SSAS. However, you will typically use, at a minimum, the components just listed for a BI solution built using SQL Server 2008. For example, you could purchase or build a custom application that doesn’t use SSRS to produce reports. In that case, the application would include data queries as well as calls to the SSAS structures via the application programming interfaces (APIs) included for it.
SQL Server 2008 Analysis Services SSAS is the core service in a Microsoft BI solution. It provides the storage and query mechanisms for the data used in OLAP cubes for the data warehouse. It also includes sophisticated OLAP cube developer and administrative interfaces. SSAS is usually installed on at least one dedicated physical server. You can install both SQL Server 2008 and SSAS on the same physical server, but this is done mostly in test environments. Figure 1-8 shows the primary tool you’ll use to develop cubes for SSAS, the Business Intelligence Development Studio. BIDS opens in a Microsoft Visual Studio environment. You don’t need a full Visual Studio installation to develop cubes for SSAS. If Visual Studio is not on your development machine, when you install SSAS, BIDS installs it as a stand-alone component. If Visual Studio is on your development machine, BIDS installs as a component (really, a set of templates) in your existing Visual Studio instance.
Chapter 1
Business Intelligence Basics
FIgure 1-8 AdventureWorksDW in the OLAP cube view within BIDS
Note If you’re running a full version of Visual Studio 2008 on the same machine where you intend to work with SSAS, you must install Service Pack 1 (SP1) for Visual Studio 2008.
AdventureWorksDW AdventureWorksDW is the sample data and metadata that you can use while learning about the tools and capabilities of the SQL Server 2008 BI tools. We provide more information about how to work with this sample later in this chapter. All screen shots in this book show this sample being used. The samples include metadata and data so that you can build OLAP cubes and mining structures, SSIS packages, and SSRS reports. These samples are also available on Microsoft’s public, shared-source Web site: CodePlex at http://codeplex.com/SqlServerSamples. Here you’ll find the specific locations from which you can download these samples. Be sure to download samples that match your version (for example, 2008 or 2005) and platform (x86 or x64). When running the samples, be sure to use the sample for your edition of SQL Server.
17
18
Part I
Business Intelligence for Business Decision Makers and Architects
Data Mining with Analysis Services 2008 SSAS also includes a component that allows you to create data mining structures that include data mining models. Data mining models are objects that contain source data (either relational or multidimensional) that has been processed using one or more data mining algorithms. These algorithms either classify (group) data or classify and predict one or more column values. Although data mining has been available since SSAS 2000, its capabilities have been significantly enhanced in the SQL Server 2008 release. Performance is improved, and additional configuration capabilities are available. Figure 1-9 shows a data mining model visualizer that comes with SQL Server 2008. Data mining visualizers are included in the data mining development environment (BIDS), as well as in some client tools, such as Excel. Chapter 12, “Understanding Data Mining Structures,” and Chapter 13, “Implementing Data Mining Structures,” cover the data mining capabilities in SSAS in more detail.
FIgure 1-9 Business Intelligence Development Studio (BIDS) data mining visualizer
Chapter 1
Business Intelligence Basics
19
SQL Server 2008 Reporting Services Another key component in many BI solutions is SQL Server Reporting Services (SSRS). When working with SQL Server 2008 to perform SSRS administrative tasks, you can use a variety of included tools such as SSMS, a reporting Web site, or a command-line tool. The enhancements made in SQL Server 2008 Reporting Services make it an attractive part of any BI solution. The SSRS report designer in BIDS includes a visual query designer for SSAS cubes, which facilitates rapid report creation by reducing the need to write manual queries against OLAP cube data. SSRS includes another report-creation component, Report Builder, which is intended to be used by analysts, rather than developers, to design reports. SSRS also includes several client tools: a Web interface (illustrated in Figure 1-10), Web Parts for Microsoft Office SharePoint Server, and client components for Windows Forms applications. We discuss all flavors of reporting clients in Part IV.
FIgure 1-10 SSRS reports can be displayed by using the default Web site interface.
SQL Server 2008 In addition to being a preferred staging source for BI data, SQL Server 2008 RDBMS data is often a portion of the source data for BI solutions. As we mentioned earlier in this chapter, data can be and often is retrieved from a variety of relational source data stores (for example, Oracle, DB2, and so forth). To be clear, data from any source for which there is a provider can be used as source data for an OLAP (SSAS) cube, which means data from all versions of SQL Server along with data from other RDBMS systems. A SQL Server 2008 installation isn’t strictly required to implement a BI solution; however, because of the integration of some key toolsets that are part of nearly all BI solutions, such as SQL Server Integration Services—which is usually used to perform the ETL processes for the OLAP cubes and data mining structures—most BI solutions should include at least one SQL Server 2008 installation. As we said earlier, although the SQL Server 2008 installation can be on the same physical server where SSAS is installed, it is more common to use a dedicated server. You use the SQL Server Management Studio to administer OLTP databases, SSAS (OLAP) cubes and data mining models, and SSIS packages. The SSMS interface showing only the Object Explorer is shown in Figure 1-11.
20
Part I
Business Intelligence for Business Decision Makers and Architects
FIgure 1-11 SSMS in SQL Server 2008
SQL Server 2008 Integration Services SSIS is a key component in most BI solutions. This toolset is used to import, cleanse, and validate data prior to making the data available to SSAS for OLAP cubes or data mining structures. It is typical to use data from many disparate sources (for example, relational, flat file, XML, and so on) as source data to a data warehouse. For this reason, a sophisticated toolset such as SSIS facilitates the complex data loads (ETL) that are common to BI solutions. The units of execution in SSIS are called packages. They are a type of XML files that you can consider to be a set of instructions that are designed using visual tools in BIDS. We discuss planning, implementation, and many other considerations for SSIS packages in Part III. You’ll use BIDS to design, develop, execute, and debug SSIS packages. The BIDS SSIS package design environment is shown in Figure 1-12.
Chapter 1
Business Intelligence Basics
21
FIgure 1-12 BIDS SSIS package designer
Optional Components of a Microsoft BI Solution In addition to the components that are included with SQL Server 2008, a number of other Microsoft products can be used as part of your BI solution. Most of these products allow you to deliver the reports generated from Analysis Services OLAP cubes and data mining structures in formats customized for different audiences, such as complex reports for business analysts and summary reports for executives. Here is a partial list of Microsoft products that include integration with Analysis Services OLAP cubes and data mining models: ■■
Microsoft Office Excel 2007 Many companies already own Office 2007, so using Excel as a BI client is often attractive for its low cost and relatively low training curve. In addition to being a client for SSAS OLAP cubes through the use of the PivotTable feature or the Data Mining Add-ins for SQL Server 2008, Excel can also be a client for data mining structures. (Note that connecting to an OLAP data source from Excel 2003 only does require that MS Query be installed. MS Query is listed under optional components on the Office installation DVD.)
■■
Microsoft Word 2007 SSRS reports can be exported as Word documents that are compatible with Office 2003 or 2007.
■■
Microsoft Visio 2007 Using the Data Mining Add-ins for SQL Server 2008, you can create customized views of data mining structures.
■■
Microsoft Office SharePoint Server 2007 Office SharePoint Server 2007 contains both specialized Web site templates designed to show reports (with the most common one being the Report Center) as well as Web Parts that can be used to display individual reports on Office SharePoint Server 2007 Web pages. A Web Part is a pluggable user interface (UI) showing some bit of content. It is installed globally in the SharePoint
22
Part I
Business Intelligence for Business Decision Makers and Architects
Portal Server Web site and can be added to an Office SharePoint Server 2007 portal Web page by any user with appropriate permissions. ■■
Microsoft PerformancePoint Server PerformancePoint Server allows you to quickly create a centralized Web site with all of your company’s performance metrics. The environment is designed to allow business analysts to create sophisticated dashboards that are hosted in a SharePoint environment. These dashboards can contain SSRS reports and visualizations of data from OLAP cubes as well as other data sources. It also has a strong set of products that support business forecasting. PerformancePoint Server includes the functionality of the Business Scorecard Manager and ProClarity Analytics Server. Its purpose is to facilitate the design and hosting of enterprise-level scorecards via rich data-visualization options such as charts and reports available in Reporting Services, Excel, and Visio. PerformancePoint Server also includes some custom visualizers, such as the Strategy Map.
Note We are sometimes asked what happened to ProClarity, a company that had provided a specialized client for OLAP cubes. Its target customer was the business analyst. Microsoft acquired ProClarity in 2006 and has folded features of its products into PerformancePoint Server. Microsoft also offers other products—such as Dynamics, Project Server, and BizTalk Server— that use the Analysis Services storage mechanism and query engine. In addition to Microsoft products that are designed to integrate with the BI tools available in SQL Server 2008, you might elect to use some other developer products to improve productivity if your project’s requirements call for .NET coding. Recall that the primary development tool is BIDS and that BIDS does not require a Visual Studio installation. Microsoft has found that BI developers are frequently also .NET developers, so most of them already have Visual Studio 2008. As was mentioned, in this situation, installing SSAS, SSIS, or SSRS installs the associated developer templates into the default Visual Studio installation. Another consideration is the management of source code for large or distributed BI development teams. In this situation, you can also elect to add Visual Studio Team System (VSTS) for source control, automated testing, and architectural planning. The data that you integrate into your BI solution might originate from relational sources. These sources can, of course, include SQL Server 2008. They can also include nearly any type of relational data—SQL Server (all versions), Oracle, DB2, Informix, and so forth. It is also common to include nonrelational data in BI solutions. Sources for this data can include Microsoft Access databases, Excel spreadsheets, and so forth. It is also common to include text data (often from mainframes). This data is sometimes made available as XML. This XML might or might not include schema and mapping information. If complex XML processing is part of your requirements, you can elect to use BizTalk Server to facilitate flexible mapping and loading of this XML data.
Chapter 1
Business Intelligence Basics
23
You might be thinking at this point, “Wow, that’s a big list! Am I required to buy (or upgrade to) all of those Microsoft products in order to implement a BI solution for my company?” The answer is no. The only service that is required for an OLAP BI solution is SSAS. Also, many companies provide tools that can be used as part of a Microsoft BI solution. Although we occasionally refer to some third-party products in this book, we’ll focus primarily on using Microsoft’s products and tools to build a BI solution.
Query Languages used in BI Solutions When working with BI solutions built on SSAS cubes and data mining structures, you use several query languages. The primary query language for OLAP cubes is MDX. SSAS also includes the ability to build data mining structures. To query the data in these structures, you use DMX. XMLA is a specialized administrative scripting language used with SSAS objects (OLAP cubes, SSIS packages, and data mining structures). Finally, RDL is the XML dialect behind SSRS reports. In the following sections, we briefly describe each language and provide a sample.
MDX MDX, which stands for Multidimensional Expressions, is the language used to query OLAP cubes. Although MDX is officially an open standard and some vendors outside of Microsoft have adopted parts of it in their BI products, the reality is that comparatively few working .NET developers are proficient in MDX. A mitigating factor is that the need to manually write MDX in a BI solution can be relatively small—with not nearly as much Transact-SQL as you would manually write for a typical OLTP database. However, retaining developers who have at least a basic knowledge of MDX is an important consideration in planning a BI project. We review core techniques as well as best practices for working with MDX in Chapter 10, “Introduction to MDX,” and Chapter 11, “Advanced MDX.” The MDX query language is used to retrieve data from SSAS cubes. A simple MDX query is shown in Figure 1-13. Although MDX has an SQL-like structure, it is far more difficult to master because of the complexity of the SSAS source data structures—which are multidimensional OLAP cubes.
FIgure 1-13 A sample MDX query
24
Part I
Business Intelligence for Business Decision Makers and Architects
DMX Data Mining Extensions (DMX) is used to query Analysis Services data mining structures. (We devote several future chapters to design and implementation of SSAS data mining structures.) Although this language is based loosely on Transact-SQL, it contains many elements that are unique to the world of data mining. As with MDX, very few working .NET developers are proficient in DMX. However, the need for DMX in BI solutions is relatively small because the SSAS data mining structure in BIDS provides tools and wizards that automatically generate DMX when you create those structures. Depending on the scope of your solution, retaining developers who have at least a basic knowledge of DMX might be an important consideration in planning a BI project that includes a large amount of data mining. A simple DMX query is shown in Figure 1-14.
FIgure 1-14 Sample DMX query
XMLA XML for Analysis (XMLA) is used to perform administrative tasks in Analysis Services. It is an XML dialect. Examples of XMLA tasks include viewing metadata, copying, backing up databases, and so on. As with MDX and DMX, this language is officially an open standard, and some vendors outside of Microsoft have chosen to adopt parts of it in their BI products. Again, the reality is that very few developers are proficient in XMLA. However, you will seldom author any XMLA from scratch; rather, you’ll use the tools and wizards inside SQL Server 2008 to generate this metadata. In SSMS, when connected to SSAS, you can right-click on any SSAS object and generate XMLA scripts using the graphical user interface (GUI). XMLA is used to define SSAS OLAP cubes and data mining structures.
RDL RDL, or the Report Definition Language, is another XML dialect that is used to create Reporting Services reports. As with the other BI languages, RDL is officially an open standard,
Chapter 1
Business Intelligence Basics
25
and some vendors outside of Microsoft have chosen to adopt parts of it in their BI products. You rarely need to manually write RDL in a BI solution because it is generated for you automatically when you design a report using the visual tools in BIDS. We’ll review core techniques as well as best practices for working with RDL in future chapters.
Summary In this chapter, we covered basic data warehousing terms and concepts, including BI, OLTP, OLAP, dimensions, and facts (or measures). We defined each term so that you can better understand the possibilities you should consider when planning a BI solution for your company. We then introduced the BI tools included with SQL Server 2008. These include SSAS, SSIS, SSRS, and Data Mining Add-ins. For each of these BI tools, we defined what parts of a BI solution’s functionality that particular tool could provide. Next, we discussed other Microsoft products that are designed to be integrated with BI solutions built using SSAS OLAP cubes or data mining structures. These included Excel, Word, and Office SharePoint Server 2007. We also touched on the integration of SSAS into PerformancePoint Server. We concluded our conceptual discussion with a list and description of the languages involved in BI projects. In Chapter 2, we work with the sample database AdventureWorksDW, which is included in SQL Server 2008, so that you get a quick prototype SSAS OLAP cube and data mining structure up and running. This is a great way to begin turning the conceptual knowledge you’ve gained from reading this chapter into practical understanding.
Chapter 2
Visualizing Business Intelligence Results As you learn more about the business intelligence (BI) tools available in Microsoft SQL Server 2008 and other Microsoft products that integrate BI capabilities (such as Microsoft Office Excel 2007), you can begin to match features available in these products to current challenges your organization faces in its BI implementation and in future BI solutions you’re planning. In this chapter, we summarize the most common business challenges and typical BI solutions using components available in SQL Server 2008 or BI-enabled Microsoft products. We also preview the most common way we help clients visualize the results of their BI projects. If you’re new to business intelligence and are still digesting the components and concepts covered in Chapter 1, “Business Intelligence Basics,” you should definitely read this chapter. If you have a good foundation of business intelligence knowledge but are wondering how to best translate your technical and conceptual knowledge into language that business decision makers can understand so that they will support your project vision, you should also find value in this chapter.
Matching Business Cases to BI Solutions Now that you’ve begun to think about the broad types of business challenges that BI solutions can provide solutions for, the next step is to understand specific business case scenarios that can benefit from SQL Server 2008 BI solutions. A good starting place is http://microsoft.com/casestudies. There you can find numerous Microsoft case studies—from industries as varied as health care, finance, education, manufacturing, and so on. As you are furthering your conceptual understanding of Microsoft’s BI tools, we suggest that you search for and carefully review any case studies that are available for your particular field or business model. Although this approach might seem nontechnical, even fluffy, for some BI developers and architects, we find it to be a beneficial step in our own real-world projects. We’ve also found that business decision makers are able to better understand the possibilities of a BI solution based on SQL Server 2008 after they have reviewed vertically aligned case studies. Finding a specific reference case that aligns with your business goals can save you and your developers a lot of time and effort. For example, we’ve architected some BI solutions with SQL Server 2005 and 2008 in the health care industry and we’ve used the case study for Clalit Health Care HMO (Israel) several
27
28
Part I
Business Intelligence for Business Decision Makers and Architects
times with health care clients. This case study is found at http://www.microsoft.com/industry/ healthcare/casestudylibrary.mspx?casestudyid=4000002453. A good place to start when digesting case studies is with the largest case study that Microsoft has published. It’s called Project Real and is based on the implementation of the complete SQL Server 2005 BI toolset at Barnes and Noble Bookstores. Microsoft is currently in the process of documenting the upgrade to SQL Server 2008 BI in the public case study documentation. This case study is a typical retail BI study in that it includes standard retail metrics, such as inventory management and turnover, as well as standard profit and loss metrics. It also covers commonly used retail calendars, such as standard, fiscal, and 4-5-4. You can find the case study and more information about it at http://www.microsoft.com/technet/ prodtechnol/sql/2005/projreal.mspx. This case study version is also interesting because Microsoft has worked with Barnes and Noble on this enterprise-level BI project through three releases of SQL Server (2000, 2005, and 2008). There is a case study for each version. It’s also interesting because the scale of the project is huge—in the multiterabyte range. Microsoft has also written a number of drill-down white papers based on this particular case study. We’ve used information from these white papers to help us plan the physical and logical storage for customers that we’ve worked with who had cubes in the 50-GB or greater range. Also, several teams from Microsoft Consulting Services have developed a BI software life cycle using the Barnes and Noble Bookstores example. We have used the life cycle approach described in the Project Real case study in several of our own projects and have found it to be quite useful. We’ve found it so useful, in fact, that we’ll detail in Chapter 3, “Building Effective Business Intelligence Processes,” exactly how we’ve implemented this life cycle approach throughout the phases of a BI project. Another good case study to review is the one describing the use of the BI features of SQL Server 2008 within Microsoft internally. This case study can be found at http://www.micro soft.com/casestudies/casestudy.aspx?casestudyid=4000001180. In addition to providing a proof-of-concept model for a very large data warehouse (which is estimated to grow to 10 TB in the first year of implementation), this case study also shows how new features—such as backup compression, change data capture, and more—were important for an implementation of this size to be effective. Your goal in reviewing these reference implementations is to begin to think about the scope of the first version of your BI project implementation. By scope, we mean not only the types of data and the total amount of data you’ll include in your project, but also which services and which features of those services you’ll use.
Chapter 2
Visualizing Business Intelligence Results
29
We’ll also briefly describe some real-world projects we’ve been involved with so that you can get a sense of the ways to apply BI. Our experience ranges from working with very small organizations to working with medium-sized organizations. We’ve noticed that the cost to get started with BI using SQL Server 2008 can be relatively modest (using the Standard edition of SQL Server 2008) for smaller organizations jumping into the BI pool. We’ve also worked with public (that is, government) and private sector businesses, as well as with nonprofit organizations. In the nonprofit arena, we’ve built cubes for organizations that want to gain insight into their contributor bases and the effectiveness of their marketing campaigns. A trend we’ve seen emerging over the past couple of years is the need to use BI for businesses that operate in multiple countries, and we’ve worked with one project where we had to localize both metadata (cube, dimension, and so on) names and, importantly, measures (the most complicated of which was currency). Of course, depending on the project, localization of measures can also include converting U.S. standard measures to metric. We worked with a manufacturing company whose initial implementation involved U.S. measurement standards only; however, the company asked that our plan include the capability to localize measures for cubes in its next version. Another interesting area in which we’ve done work is in the law enforcement field. Although we’re not at liberty to name specific clients, we’ve done work with local police to create more efficient, usable offender information for officers in the field. We’ve also done work at the state and federal government levels, again assisting clients to develop more efficient access to information about criminal offenders or potential offenders. In somewhat similar projects, we worked with social services organizations to assist them in identifying ineffective or even abusive foster care providers. Sales and marketing is, of course, a mainstay of BI. We’ve done quite a bit of work with clients who were interested in customer profiling for the purpose of increasing revenues. Interestingly, because of our broad experience in BI, we’ve sometimes helped our bricks-andmortar clients to understand that BI can also be used to improve operational efficiencies, which can include operational elements such as time-to-process, time-to-ship, loss prevention information, and much more. Our point in sharing our range of experience with you is to make clear to you that BI truly is for every type of business. Because BI is so powerful and so broad, explaining it in a way that stakeholders can understand is critical. Another tool that we’ve evolved over time is what we call our Top 10 BI Scoping Questions. Before we move into the visualization preview, we’ll list these questions to get you thinking about which questions you’ll want to have answered at the beginning of your project.
30
Part I
Business Intelligence for Business Decision Makers and Architects
Top 10 BI Scoping Questions Without any further fanfare, here is the Top 10 BI Scoping Questions list: 1. What are our current pain points with regard to reporting? That is, is the process slow, is data missing, is the process too rigid, and so on? 2. What data source are we currently unable to get the information from? 3. Who in our organization needs access to which data? 4. What type of growth—that is, merger, new business, and so on—will affect our reporting needs over the next 12 months? 5. What client tools are currently being used for reporting? How effective are these tools? 6. How could our forecasting process be improved? For example, do we need more information, more flexibility, or more people having access? 7. What information do we have that doesn’t seem to be used at all? 8. Which end-user groups currently have no access to key datasets or limited access to them? 9. How satisfied are we with our ability to execute “What if” and other types of forecasting scenarios? 10. Are we using our data proactively? We’re aware that we haven’t provided you with enough information yet to design and build solutions to answer these questions. The foundation of all great BI solutions is formed by asking the right questions at the beginning of the project so that you can build the solution needed by your organization. Because we’re laying much of the groundwork for your project in this chapter, we’ll mention here the question you’ll invariably face early in your project cycle: “If BI is so great, why isn’t everybody using it?” We specifically addressed this question in the introduction to this book. If you’re at a loss to answer it, you might want to review our more complete answer there. Here’s the short answer: “BI was too difficult, complicated, and expensive to be practical for all but the largest businesses prior to Microsoft’s offerings, starting with SQL Server 2005.” We have one last section to present before we take a closer look at BI visualization. It introduces some of the pieces we’ll be exploring in this book. It can be quite tricky to correctly position (or align) BI capabilities into an organization’s existing IT infrastructure at the beginning of a project, particularly when Microsoft’s BI is new to a company. Although it’s true BI is a dream to use after it has been set up correctly, it’s also true that there are many parts that
Chapter 2
Visualizing Business Intelligence Results
31
you, the developer, must understand and set up correctly to get the desired result. This can result in the classic dilemma of overpromising and underdelivering. Although we think it’s critical for you to present BI to your business decision makers (BDMs) in terms of end-user visualizations so that those BDMs can understand what they are getting, it’s equally important for you to understand just what you’re getting yourself into! To that end, we’ve included the next section to briefly show you the pieces of a BI solution.
Components of BI Solutions Figure 2-1 shows that a BI solution built on all possible components available in SQL Server 2008 can have many pieces. We’ll spend the rest of this book examining the architectural components shown in Figure 2-1. In addition to introducing each major component and discussing how the components relate to each other, in Part I, “Business Intelligence for Business Decision Makers and Architects,” we discuss the design goals of the BI platform and how its architecture realizes these goals. We do this in the context of the major component parts—that is, SQL Server Analysis Services (SSAS), SQL Server Integration Services (SSIS), and SQL Server Reporting Services (SSRS). Each component has an entire section of the book dedicated to its implementation details. To combat the complexity of implementing solutions built using BI concepts (OLAP cubes or data mining models), Microsoft has placed great emphasis on usability for BI developers and administrators in SQL Server 2008. Although SSAS, SSIS, and SSRS already included a large number of tools and wizards to facilitate intelligent and quick implementation of each in SQL Server 2005, nearly all of these tools and wizards have been reviewed, reworked, and just generally improved in SQL Server 2008. In addition, there are new tools and utilities to improve usability for developers. One important example is the Analysis Management Objects (AMO) warnings. This feature (shown in Figure 2-2) displays warnings in the Business Intelligence Development Studio (BIDS) Error List window when developers implement OLAP cube designs that are contrary to best practices. The tool is configurable so that you can turn off warning types that you want to ignore. As we continue on our journey of understanding BI, we’ll not yet get into an architectural discussion of the BI suite; rather, we’ll use an approach that we’ve had quite a bit of success with in the real world. That is, we’ll examine the results of a BI solution through the eyes of a typical end user. You might be surprised by this approach, thinking, “Hey, wait. I’m a developer!” Bear with us on this one; seeing BI from an end-user perspective is the most effective way for anyone involved in a BI project to learn it.
32
Part I
Business Intelligence for Business Decision Makers and Architects Relational Database
File
Web Service
Instance Storage Engine
Native Data Mining Algorithms
Formula Engine
AdomdServer
Monitoring Manager (Trace, Profile, Event)
Data Mining Engine
Resource Manager (Memory, Data, Lock)
Plug-in Data Mining Algorithms
Stored Procedures Protocol Manager SML for Analysis Listener (HTTP/TCP)
Local Analysis Services Engine
AdomdClient MSOLAP90 Provider ADO.NET
Local Storage
ADO MD
Askeymgmt Utility
AMO
Deployment Engine
DSO
Deployment Utility
Business Intelligence Development Studio
SQL Profiler
SQL Server Management Studio
FIgure 2-1 BI component architecture from SQL Server Books Online
Client Application
Chapter 2
Visualizing Business Intelligence Results
33
FIgure 2-2 AMO warnings for an OLAP cube in BIDS
What we mean by this is that whether you’re a developer who is new to BI, or even one who has had some experience with BI, we find over and over that the more you understand about the end-user perspective, the better you can propose and scope your envisioned BI project, and ultimately gain appropriate levels of management support for it. Also, like it or not, part of your work in a BI project will be to teach BDMs, stakeholders, and other end users just what BI can do for them. We have found this need to be universal to every BI project we have implemented to date. Because of the importance of understanding current visualization capabilities in Microsoft’s BI products, we’re going to devote the remainder of this chapter to this topic. If you’re looking for a more detailed explanation of exactly how you’ll implement these (and other) client solutions, hold tight. We devote an entire section of this book to that topic. First you have to see the BI world from the end-user’s perspective so that you can serve as an effective “translator” at the beginning of your project. Tip For each BI component you want to include in your particular BI project, select or design at least one method for envisioning the resulting information for the particular type of end-user group involved—that is, analysts, executives, and so on. Components include OLAP cubes, data mining structures, and more.
34
Part I
Business Intelligence for Business Decision Makers and Architects
understanding Business Intelligence from a user’s Perspective As we mentioned already, visualizing the results of BI solutions is one of the more difficult aspects of designing and building BI solutions. One of the reasons for this is that we really don’t have the ability to “see” multidimensional structures (such as OLAP cubes) that we build with SSAS. Figure 2-3 is somewhat helpful because it shows data arranged in a cube. However, understand that this visualization reflects only a small subset of what is possible. The figure shows an OLAP cube with only three dimensions (or aspects) and only one measure (or fact). That is, this cube provides information about the following question: “How many packages were shipped via what route, to which location, at what time?” Real-world cubes are far more complex than this. They often contain dozens or hundreds of dimensions and facts. This is why OLAP cubes are called n-dimensional structures. Also, OLAP cubes are not really cube-shaped because they include the efficiency of allocating no storage space for null intersection points. That is, if there is no value at a certain intersection of dimensional values—for example, no packages were shipped to Africa via air on June 3, 1999—no storage space is needed for that intersection. This, in effect, condenses cubes to more of a blob-like structure. Another way to think about it is to understand that if data from a relational source is stored in the native cube storage format (called MOLAP), it is condensed to about onethird of the original space. We’ll talk more about storage in Chapter 9, “Processing Cubes and Dimensions.” So how do you visualize an n-dimensional structure? Currently, most viewers implement some form of a pivot-table-like interface. We’ll show you a sample of what you get in Excel shortly. Note that although a pivot-table-like interface might suffice for your project, both Microsoft and independent software vendors (ISVs) are putting a lot of effort into improving BI visualization. Although there are some enhanced visualization components already available in the market, we do expect major improvements in this critical area over the next two to three years. An example of a commercial component is CubeSlice. As with most, but not all, commercial components, CubeSlice can be integrated into Excel. If your BI project involves huge, complex cubes, it will be particularly important for you to continue to monitor the market to take advantage of newly released solutions for BI visualization. One area we pay particular attention to is the work that is publicly available from Microsoft Research (MSR). MSR includes one dedicated group—Visualization and Interaction for Business and Entertainment (VIBE)—whose purpose is to invent or improve data visualization tools in general. This group sometimes releases its viewers for public consumption on its Web site at http://research.microsoft.com/vibe. Another source from which we get insight into the future of BI data visualization is the annual TED (Technology, Entertainment, Design) conference. One particular talk we recommend is that of Hans Rosling, which shows global health information in an interesting viewer. You can watch his speech at http://www.ted.com/ index.php/talks/hans_rosling_reveals_new_insights_on_poverty.html.
Chapter 2
Ground Route Nonground
Eastern Hemisphere Source
35
Rail Road
Sea Air
Africa
190 215 160 240 Feb-17-99 Apr-22-99 Sep-07-99 Dec-01-99
Asia
510 600 520 780 Mar-19-99 May-31-99 Sep-18-99 Dec-22-99
Australia
210 240 300 410 Mar-05-99 May-19-99 Aug-09-99 Nov-27-99
Europe
Western Hemisphere
Visualizing Business Intelligence Results
500
470
4644
696
Mar-07-99 Jun-w0-99 Sep-11-99 Dec-15-99
3056 4050 4360 5112 North America Mar-30-99 Jun-28-99 Sep-30-99 Dec-29-99 600 490 315 580 South America Feb-27-99 Jun-03-99 Aug-21-99 Nov-30-99 1st Quarter
2nd Quarter
3rd Quarter
4th Quarter
Measures Packages Last
1st Half
2nd Half Time
FIgure 2-3 Conceptual picture of an OLAP cube (from SQL Server Books Online)
As with OLAP cubes, data mining structures are difficult to visualize. It’s so difficult to visualize these structures that SQL Server Books Online provides no conceptual picture. Data mining structures also have a “shape” that is not relational. We’ll take a closer look at data mining mechanics (including storage) in Chapter 12, “Understanding Data Mining Structures,” and Chapter 13, “Implementing Data Mining Structures.” As we mentioned, BIDS itself provides viewers for both OLAP cubes and data mining structures. You might wonder why these viewers are provided, when after all, only you, the developer, will look at them. There are a couple of reasons for this. The first reason is to help you, the developer, reduce the visualization problems associated with building both types of structures. Figure 2-4 provides an example of one of these visualizers, which shows the results of viewing a sample data mining model (part of a structure) using one the built-in tools in BIDS. What you see in the figure is the result of a data mining algorithm that shows you time series forecasting—that is, what quantity of something (in this case, a bicycle) is predicted to be sold over a period of time. The result comes from applying the time series algorithm to the existing source data and then putting out a prediction for x number of future time periods.
36
Part I
Business Intelligence for Business Decision Makers and Architects
FIgure 2-4 A visualizer in BIDS for a data mining time series algorithm
The second reason BIDS provides these visualizers is because you, the developer, can directly use them in some client implementations. As you’ll see shortly, the viewers included in BIDS and SSMS are nearly identical to those included in Excel 2007. These viewers are also available as embeddable controls for custom Windows Forms applications. Another way to deepen your understanding of BI in SQL Server 2008 is for you to load the sample applications and to review all the built-in visualizers for both OLAP cubes and data mining structures that are included with BIDS. We’re going to take this concept a bit further in the next section, showing you what to look at in the samples in the most commonly used end-user interface for SSAS, Excel 2007.
Demonstrating the Power of BI Using Excel 2007 As you learn about the power and potential of BI, a logical early step is to install the samples and then take a look at what is shown in BIDS. Following that, we recommend that you
Chapter 2
Visualizing Business Intelligence Results
37
connect to both OLAP cubes and data mining models using Excel 2007. The reason you should view things in this order is to get some sense of what client tools can or should look like for your particular project. We’re not saying that all Microsoft BI solutions must use Excel as a client. In our experience, most solutions will use Excel. However, that might not be appropriate for your particular business needs. We always find using Excel to support quick prototyping and scope checking to be valuable. Even if you elect to write your own client application—that is, by using Windows Forms or Web Forms—you should familiarize yourself with the Excel viewers, because most of these controls can be used as embeddable controls in your custom applications. We’ll focus on getting this sample set up as quickly as possible in the next section. To that end, we will not be providing a detailed explanation of why you’re performing particular steps—that will come later in the book. Next, we’ll give you the steps to get the included samples up and running. At this point, we’re going to focus simply on clicks—that is, we’ll limit the explanation to “Click here to do this” and similar phrasing. The remaining chapters explain in detail what all this clicking actually does and why you click where you’re clicking.
Building the First Sample—Using AdventureWorksDW2008 To use the SQL Server 2008 AdventureWorksDW2008 sample database as the basis for building an SSAS OLAP cube, you need to have at least one machine with SQL Server 2008 and SSAS installed on it. While installing these applications, make note of the edition of SQL Server you’re using—(you can use either the Developer, Standard, or Enterprise edition) because you’ll need to know the particular edition when you install the sample cube files. All screens and directions in this chapter apply to the Enterprise (or Developer) edition of SQL Server 2008. The Enterprise and Developer editions have equivalent features. As we dig deeper into each of the component parts of data warehousing—that is SSAS, SSIS, and SSRS—we’ll discuss feature differences by edition (that is, Standard, Enterprise, and so on). Tip In the real world, we’ve often set up a test configuration using Virtual PC. A key reason we chose to use Virtual PC is for its handy Undo feature. We find being able to demonstrate a test configuration for a client and then roll back to a clean install state by simply closing Virtual PC and selecting Do Not Save Changes is very useful and saves time. Virtual PC is a free download from this URL: http://www.microsoft.com/downloads/details.aspx?familyid=04D264023199 48A3AFA22DC0B40A73B6&displaylang=en. The Do Not Save Changes feature requires that you install the virtual machine additions, which are part of the free download but are not installed by default. If you’re installing SQL Server, remember that the sample databases are not installed by default. A quick way to get the latest samples is to download (and install) them from the
38
Part I
Business Intelligence for Business Decision Makers and Architects
CodePlex Web site. Be sure to download the samples for the correct version and edition of SQL Server. What is CodePlex? Located at http://www.codeplex.com, the CodePlex Web site is a code repository for many types of projects. The SQL Server samples are one of thousands of opensource projects that Microsoft hosts via this site. The site itself uses a Team Foundation Server to store and manage project source code. The site includes a Web-based user interface. Use of CodePlex is free, for both downloading and posting open-source code. Note that you’ll need two .msi files for building the sample SSAS project, which can be downloaded from http://www.codeplex.com/MSFTDBProdSamples/Release/ProjectReleases.aspx? ReleaseId=16040. You download two files for whatever hardware you’re working with—that is, x86, IA64, or x64 installations. The names of the files are SQL2008.AdventureWorks DW BI v2008.xNN.msi and SQL2008.AdventureWorks All DB Scripts.xNN.msi. (Replace the letters NN with the type of hardware you’re using—that is, replace it with 86 if you’re using x86, with 64 if you’re using x64, and so on.) After you install these two files, you’ll get a folder that contains several files named AdventureWorks Analysis Services Project. The sample files for SSAS are located at C:\Program Files\Microsoft SQL Server\100\Tools \Samples\ AdventureWorks 2008 Analysis Services Project\Enterprise. Note There are two versions of the sample files. These are unpacked into two separate folders: Enterprise and Standard. We’ll work with the Enterprise folder in this book.
The file you’ll use is the Visual Studio solution file called Adventure Works.sln. Before you double-click it to open BIDS, you must perform one more setup step. The sample database you’ll use for building the sample project is AdventureWorksDW2008. You’ll use AdventureWorksDW2008, rather than AdventureWorks2008, as the source database for your first SSAS OLAP cube because it’s modeled in a way that’s most conducive to creating cubes easily. In Chapter 5, “Logical OLAP Design Concepts for Architects,” and Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” we’ll discuss in detail what best-practice modeling for SSAS cubes consists of and how you can apply these modeling techniques to your own data. To set up AdventureWorksDW2008, double-click on this .msi file: SQL2008.AdventureWorks All DB Scripts.xNN.msi. This unpacks the scripts for both the OLTP sample database called AdventureWorks and the dimensionally modeled sample database called AdventureWorksDW to your computer. After this has succeeded, run BuildAdventureWorks.cmd (located in C:\Program Files\Microsoft SQL Server\100\Tools\Samples) from the command prompt to restore the databases. This will run a series of scripts through SSMS.
Chapter 2
Visualizing Business Intelligence Results
39
Note You can optionally choose to install another sample database, called AdventureWorksLT, for other testing purposes. This requires a separate download from CodePlex. AdventureWorksLT is a lightweight (or simple) OLTP database and is generally not used for data warehouse testing and modeling. Verify that AdventureWorksDW2008 has installed correctly by opening SQL Server Management Studio and connecting to the SQL Server instance where you’ve installed it. You’ll see this database listed in the tree view list of objects, called Object Explorer, as shown in Figure 2-5. You might notice that the table names have a particular naming convention— most names start with the prefixes “Dim” or ”Fact.” If you guessed that these names have to do with dimensions and facts, you’d be correct! We’ll explore details of OLAP data source modeling such as these later in the book.
FIgure 2-5 AdventureWorksDW2008 in SSMS
Opening the Sample in BIDS Now you’re ready to open the sample in BIDS. As mentioned previously, at this point in our discussion we are not looking to explain the BIDS interface. An entire section of this book is devoted to that. For now, we want to simply open the sample, process (or build) the OLAP cube and data mining structures, and then connect to what we’ve built using the included tools in Excel 2007.
40
Part I
Business Intelligence for Business Decision Makers and Architects
The steps to do this are remarkably quick and easy. Simply navigate to the sample folder location listed earlier in this section, and then double-click on the Visual Studio solution file called Adventure Works.sln. This opens BIDS and loads the OLAP cube and data mining structure metadata into the Solution Explorer window in BIDS. You’ll see a number of files in this window. These files show the objects included in the sample. The most important of these files are two sample OLAP cubes and five sample data mining structures, which are shown in Figure 2-6. The screen shot shows the sample named Adventure Works DW 2008, which was the name of the sample for all beta or community technology preview (CTP) versions of SQL Server 2008. The final, or release-to-manufacturing (RTM), sample name is Adventure Works 2008. We’ve updated screen shots to reflect the final sample in this book. However, where there were no changes between the CTP and RTM samples, we used the original screen shots, so you can disregard the project name discrepancies.
FIgure 2-6 Cubes and data mining structures in BIDS
Chapter 2
Visualizing Business Intelligence Results
41
Note You might be wondering whether BIDS is the same thing as Visual Studio and, if so, does that mean the data warehouse development requires an installation of full-blown Visual Studio on each developer’s machine? The answer is no—BIDS is a subset of Visual Studio. If Visual Studio is already installed, BIDS installs itself as a series of templates into the Visual Studio instance. If Visual Studio is not installed, the mini version of Visual Studio that includes only BIDS functionality is installed onto the developer’s computer. After you’ve opened the project in BIDS, the only other step you need to perform to make these samples available for client applications is to deploy these objects to the SSAS server. Although you can do this in BIDS by simply right-clicking the solution name (Adventure Works DW) in the Solution Explorer window and then choosing Deploy from the menu, there is quite a bit of complexity behind this command. If you expand the Deployment Progress window in BIDS, you’ll get a sense of what is happening when you deploy a BIDS project. This window shows detailed information about each step in the build and deploy process for an OLAP cube. Roughly, metadata for the objects is validated against a schema, with dimensions being processed first. Next, cubes and mining models are processed, and then data from the source system (in this case, the AdventureWorksDW database) is loaded into the structures that have been created on SSAS. Tip If you get a “deployment failed” warning in the Deployment Progress window, check your connection strings (known as data sources in BIDS). The sample connects to an instance of SQL Server at localhost with credentials as configured by the default string. You might need to update these connection string values to reflect your testing installation of SSAS. You might also need to update the project settings to point to your instance of SSAS by right-clicking on the project in Solution Explorer and choosing Properties. In the resulting dialog box, choose the Deployment option and verify that the Server option points to the correct instance of SSAS. Now you’re ready to validate the sample OLAP cube and data mining structures using the built-in browsers in BIDS. The browser for the OLAP cube looks much like a pivot table. As we mentioned, this browser is included so that you, as a cube developer, can review your work prior to allowing end users to connect to the cube using client BI tools. Most client tools contain some type of pivot table component, so the included browsers in BIDS are useful tools for you. To be able to view the sample cube using the built-in cube browser in BIDS, doubleclick on the Adventure Works cube object in the Cubes folder of the Solution Explorer window of BIDS. To take a look at the cube using the browser in BIDS, click on the Browser tab on the cube designer surface of BIDS. This opens the OLAP cube browser. Depending on the hardware capabilities of your test machine, this step might take a couple of seconds to complete. The designer surface looks much like a pivot table. To start, select the measures and dimensions you’d like displayed on the designer surface. You do this by selecting them and dragging
42
Part I
Business Intelligence for Business Decision Makers and Architects
them to the appropriate position on the designer surface. The interface guides you, issuing a warning if you drag a type of item to an area that cannot display it. Just to get you started, we’ve included Figure 2-7, which shows you the results of dragging the Internet Average Sales Amount measure from the Internet Sales measure group (folder) to the center of the pivot table design area. Next we dragged the Customer Geography hierarchy from the Customer dimension to the rows axis (left side) position. Finally, we dragged the Product Categories hierarchy from the Product dimension to the columns axis (top side) position. The results are shown in Figure 2-7.
FIgure 2-7 BIDS OLAP cube browser
Tip To remove an item from the browser, simply click on the column or row header of the item and then drag it back over the tree listing of available objects. You’ll see the cursor change to an x to indicate that this value will be deleted from the browser. Now that you’ve got this set up, you might want to explore a bit more, dragging and dropping values from the tree listing on the left side of the cube browser design area to see what’s possible. We encourage you to do this. What is happening when you drag and drop each object from the tree view on the left side of BIDS to the design area is that an MDX query is being generated by your activity and is automatically executed against the SSAS OLAP cube. Note also that you can drag row headers to the columns area and vice versa—this is called pivoting the cube. Also, spend some time examining the Filter Expression section at the top of the browser. In case you were wondering, when it comes time for you to look at and understand the MDX queries being generated, there are many ways to view those queries. At this point in our explanation, however, we are not yet ready to look at the query syntax. Note that many meaningful report-type result sets can be generated by simply clicking and dragging, rather than by you or another developer manually writing the query syntax for each query. And this is exactly the point we want to make at this juncture of teaching you about OLAP cubes.
Chapter 2
Visualizing Business Intelligence Results
43
Note In case you were wondering whether you can view sample data mining structures in BIDS, the answer is yes. The AdventureWorks samples include data mining structures. Each structure contains one or more data mining models. Each mining model has one or more viewers available in BIDS. Later in this chapter, we’ll talk a bit more about how to see mining model results using the data mining viewers. For now, however, we’ll continue our exploration of OLAP cubes, moving on to the topic of using Excel 2007 as a client tool.
Connecting to the Sample Cube Using Excel 2007 Now that you’ve set up and deployed the sample cubes, it’s important for you to be able to see things from an end user’s perspective. An easy way to do this is by using a PivotTable view in Excel 2007. Open Excel 2007 and set up a connection to an SSAS OLAP cube using the Data tab of the Ribbon. In the Get External Data group, click From Other Sources (shown in Figure 2-8), and then click the associated From Analysis Services button.
FIgure 2-8 The Get External Data group on the Data tab of the Excel 2007 Ribbon
After you click the From Analysis Services button, a multistep wizard opens. Enter the name of your SSAS instance, your connection credentials, and the location on the worksheet where you want to place the resulting PivotTable and the data from the OLAP cube. After you click Finish, the wizard closes and the newly designed PivotTable view opens on the Excel workbook page. Note The PivotTable view has been redesigned in Excel 2007. This redesign was based on Microsoft’s observations of users of PivotTable views in previous versions of Excel. The focus in Excel 2007 is on ease of use and automatic discoverability. This design improvement is particularly compelling if you’re considering using an Excel PivotTable view as one of the client interfaces for your BI solution, because usability is a key part of end-user acceptance criteria. Because you’ve already spent some time exploring the sample OLAP cubes in the built-in cube browser included in BIDS, you’ll probably find a great deal of similarity with the items in the PivotTable Field List in Excel. This similarity is by design. To get started, set up the same cube view that you saw earlier in the BIDS browser. To do this, filter the available fields to show only fields related to Internet Sales by selecting that value in the drop-down list in the PivotTable Field List area. Next select the fields of
44
Part I
Business Intelligence for Business Decision Makers and Architects
interest—in this case, the Internet Sales Amount item, Product Categories item (expand the Product group to see this value), and Sales Territory item (expand the Sales Territory group). Note that when you click in the check boxes to make your selections, the measure field (Internet Sales Amount) is automatically added to the Values section of the PivotTable view, the other two items are automatically added to the Row Labels section, and both are placed on the rows axis of the PivotTable view. This is shown in Figure 2-9.
FIgure 2-9 PivotTable Field List in Excel 2007
To pivot the view of the data, you have several options. Probably the easiest way to accomplish this is to drag whichever fields you want to pivot between the different areas in the bottom section of the PivotTable Field List. In this case, if you wanted to alter the layout we created earlier using the BIDS cube browser, you could simply drag the Product Categories button from the Row Labels section to the Column Labels section and then that change would be reflected on the PivotTable designer surface. You might also want to create a PivotChart view. Some people simply prefer to get information via graphs or charts rather than by rows and columns of numbers. As you begin to design your BI solution, you must consider the needs of all the different types of users of your solution. To create a PivotChart view, simply click anywhere on the PivotTable view. Then click on the Options tab of the Ribbon, and finally, click on the PivotChart button. Select a chart type in the resulting Insert Chart window, and click OK to insert it. A possible result is shown in Figure 2-10.
Chapter 2
Visualizing Business Intelligence Results
45
FIgure 2-10 A PivotChart in Excel 2007
As we mentioned earlier (when discussing using the OLAP sample cube with the built-in viewer in the BIDS developer interface), we encourage you to play around with PivotTable and PivotChart capabilities to view and manipulate SSAS OLAP cube data using Excel 2007. The better you, as a BI solution architect and developer, understand and are able to visualize what is possible with OLAP cubes (and relate that to BDMs and end users), the more effectively you’ll be able to interpret business requirements and translate those requirements into an effective BI solution. The next section of the chapter covers another important area: visualizing the results of data mining structures.
Understanding Data Mining via the Excel Add-ins The next part of your investigation of the included samples is to familiarize yourself with the experience that end users have with the data mining structures that ship as part of the AdventureWorksDW2008 sample. In our real-world experience, we find that less than 5 percent of our BI clients understand the potential of SSAS data mining. We believe that the path to providing the best solutions to clients is for us to first educate you, the developer, about the possibilities and then to arm you with the techniques to teach your BI clients about what is possible. To that end, here’s some basic information to get you started. Each data mining structure includes one or more data mining models. Each data mining model is based on a particular data mining algorithm. Each algorithm performs a particular type of activity on the source data, such as grouping, clustering, predicting, or a combination of these. We’ll cover data mining implementation for you in greater detail in Chapters 12 and 13.
46
Part I
Business Intelligence for Business Decision Makers and Architects
You can start by opening and reviewing the samples in BIDS. BIDS conveniently includes one or more viewers for each data mining model. Note the following five structures: ■■
Targeted Mailing
■■
Market Basket
■■
Sequence Clustering
■■
Forecasting
■■
Customer Mining
Each model is an implementation of one of the data mining algorithms included in SSAS. Figure 2-11 shows the data mining structures included in Solution Explorer in BIDS. Note also that mining models are grouped into a data mining structure (based on a common data source).
FIgure 2-11 AdventureWorksDW2008 data mining structures
As you did with the OLAP cube browser in BIDS, you should spend a bit of time using the included mining model viewers in BIDS to look at each of the models. To do this, simply click on the data mining structure you want to explore from the list shown in Solution Explorer. Doing so opens that particular mining structure in the BIDS designer. You then double-click the Mining Model Viewer tab to use the included data mining model viewers. Note that you can often further refine your view by configuring available parameters. Also note that the type of viewer or visualizer changes depending on which algorithm was used to build the data mining model. Realizing this helps you understand the capabilities of the included data mining algorithms. A key difference between Microsoft’s implementation of data mining and all other competitors is that Microsoft’s focus is on making data mining accessible to a broader segment of developers than would traditionally use data mining. The authors of this book can be included in this target group. None of us has received formal training in data mining; however, we’ve all been able to successfully implement data mining in production BI projects because of the approach Microsoft has taken in its tooling in BIDS and Excel 2007.
Chapter 2
Visualizing Business Intelligence Results
47
As with the OLAP cube viewer, the data mining viewers in BIDS are not meant to be used by end users; rather, they are included for developer use. Figure 2-12 shows the Dependency Network view for the Targeted Mailing sample mining structure (looking at the first included mining model, which was built using the Microsoft Decision Trees algorithm). We’ve adjusted the view by dragging the slider underneath the All Links section on the left to the center position so that you’re looking at factors more closely correlated to the targeted value, which in this case is whether or not a customer has purchased a bicycle. We particularly like this view because it presents a complex mathematical algorithm in an effective visual manner—one that we’ve been able to translate to nearly every client we’ve ever worked with, whether they were developers, analysts, BDMs, or other types of end users.
FIgure 2-12 BIDS mining model Dependency Network view
Viewing Data Mining Structures Using Excel 2007 To view these sample models using Excel 2007, you must first download and install the SQL Server 2008 Data Mining Add-ins for Office 2007 from http://www.sqlserverdatamining.com/ ssdm/Default.aspx?tabid=102&Id=374. The add-ins include three types of support for viewing data mining models in Excel and Visio 2007. Specifically, after being installed, the add-ins
48
Part I
Business Intelligence for Business Decision Makers and Architects
make modifications to the Ribbons in both Excel and Visio. In Excel, two new tabs are added: Data Mining and Table Tools Analyze. In Visio, one new template is added: Data Mining. After downloading the add-ins, install them by following the steps described in the next paragraph to get the samples up and running. Run SQLServer2008_DMAddin.msi to install the Data Mining Add-ins. After the add-ins have successfully installed, navigate to the Microsoft SQL Server 2008 Data Mining Add-Ins menu and click the Server Configuration Utility item to start the wizard. Here you’ll configure the connection to the Analysis Services instance by listing the name of your SSAS instance (probably localhost) and then specifying whether you’ll allow the creation of temporary mining models. Then specify the name of the database that will hold metadata. When you enable the creation of temporary (or session) mining models, authorized users can create session mining models from within Excel 2007. These models will be available only for that particular user’s session—they won’t be saved onto the SSAS server instance. Click Finish to complete the configuration. Next open the included Excel sample workbook located at C:\Program Files\Microsoft SQL Server 2008 DM Add-Ins. It’s called DMAddins_SampleData.xlsx. After successfully installing the add-ins, you’ll notice that a Data Mining tab has been added to the Excel Ribbon, as shown in Figure 2-13.
FIgure 2-13 The Data Mining tab on the Excel 2007 Ribbon
As we did when discussing the OLAP cubes using Excel 2007’s PivotTable view, we encourage you to explore the data mining samples in depth. Let’s review what you’re looking at here. For starters, if you click on the sample workbook page called Associate, you can quickly and easily build a temporary mining model to give you an idea of what the end user’s experience could be like. To do this, click on any cell that contains data on the Associate page, and then click on the Associate button in the Data Modeling group shown in Figure 2-14. This launches the Association Wizard. If you just accept all the defaults on the wizard, you’ll build a permanent data mining model using one of the included Microsoft data mining algorithms. A permanent model is one that is physically created on the SSAS server instance. You also have the option to create temporary mining models, if you select that option on setup. Temporary mining models are created in the Excel session only. So what’s the best way to see exactly what you’ve done? And, more importantly, how do you learn what can be done? We find that the best way is to use the included mining model views, which (as we mentioned) mirror the views that are included in BIDS. Figure 2-15 shows one of the views for this particular model—the Dependency Network for Microsoft
Chapter 2
Visualizing Business Intelligence Results
49
Association. Did you notice the one small difference between this view in BIDS and in Excel? In Excel, the view includes a small Copy To Excel button in its lower left corner.
FIgure 2-14 Associate button in the Data Modeling group
FIgure 2-15 Dependency Network view for Microsoft Association in Excel 2007
50
Part I
Business Intelligence for Business Decision Makers and Architects
If you find yourself fascinated with the possibilities of the Data Mining tab on the Excel Ribbon, you’re in good company! Exploring this functionality is not only fun, it’s also a great way for you as a BI developer to really understand what data mining can do for your company. We devote two entire chapters (Chapter 23, “Using Microsoft Excel 2007 as an OLAP Cube Client,” and Chapter 24, “Microsoft Office 2007 as a Data Mining Client“) to covering in detail all the functionality available when using Excel as both an OLAP cube and data mining model client.
Building a Sample with Your Own Data A common result after running through the initial exercise of viewing the included samples is for you to immediately want to get started building a prototype cube or mining model with your own data. This is a good thing! Very shortly (in Chapters 5 and 6), we’ll explore how to build physical and logical cubes and mining models so that you can do just that. You’ll be pleasantly surprised at how quickly and easily you can build prototypes—SSAS has many features included to support this type of building. Remember, however, that quick prototype building is just that—prototyping. We’ve seen quite a few quick prototypes deployed to production—with very bad results. We spend quite a bit of time in later chapters differentiating between quick modeling and building production structures so that you can avoid the costly mistake of mistaking prototypes for working models. You might be able to build a quick cube or mining model in a day or less, depending on the complexity and quality of your source data. It’s quite a bit of fun to do this and then show off the possibilities for your organization if it was to envision, design, build, test, and deploy production cubes and mining models. Now that we’ve taken a look at what an end user sees, let’s return to our discussion of why you’d want to use SQL Server 2008 BI solutions. Specifically, we’ll look at component implementation considerations and return on investment (ROI) considerations. Then we’ll finish with a summary that we’ve found useful when we’ve made the pitch for BI to clients.
elements of a Complete BI Solution In the world of BI solutions, a complete solution consists of much more than the OLAP cubes and data mining structures built using SSAS. We would go so far as to say that to the ultimate consumers of the BI project—the various end-user communities—a well-designed solution should have data store sources that are nearly invisible. This allows those users to work in a familiar and natural way with their enterprise data. This means, of course, that the selection of clients tools, or reporting tools, is absolutely critical to the successful adoption of the results of your BI project.
Chapter 2
Visualizing Business Intelligence Results
51
There are other considerations as well, such as data load, preparation, and more. Let’s talk a bit more about reporting first.
Reporting—Deciding Who Will Use the Solution A key aspect when thinking about end-user client tools (or reporting interfaces) for SSAS OLAP cubes and data mining structures is to review and determine what types of audiences you propose to include in your solution. For example, you might select a more sophisticated reporting tool for a dedicated segment of your staff, such as financial analysts, while you might choose a simpler interface for another segment, such as help desk analysts. We’ve found that if a dedicated BI solution is entirely new to your enterprise, it’s important to focus on simplicity and appropriateness for particular end-user audiences. Tip We’ve had good success profiling our end-user types during the early phases of our project. To do this, we interview representative users from each target group, as well as their supervisors. We also take a look at what type of tools these users work with already so that we can get a sense of the type of environment they’re comfortable working in. We document these end-user types and have subject matter experts validate our findings. We then propose reporting solutions that are tailored to each end-user group type—that is, we implement an Excel PivotTable view for district managers, implement Microsoft Office SharePoint Server 2007 dashboards for regional directors, and so on. Because of the importance of developing appropriate client interfaces, we’ve devoted an entire section of this book to a comprehensive discussion of reporting clients. This discussion includes looking at using Excel, Office SharePoint Server 2007, and PerformancePoint Server. It also includes examining the concerns about implementing or developing custom clients, such as Windows Forms or Web Forms applications. It’s our experience that most enterprise BI solutions use a selection of reporting interfaces. Usually, there are two or more different types of interfaces. We’ve even worked with some clients for which there was a business need for more than five different types of client reporting interfaces. Just to give you a taste of what is to come in our discussion about client reporting interfaces, we’ve included an architectural diagram from SQL Server Books Online that details connecting to an OLAP cube or data mining structure via a browser (a thin-client scenario). This is just one of many possible scenarios. Notice in Figure 2-16 that there are many choices for how to establish this type of connection. This diagram is meant to get you thinking about the many possibilities of client reporting interfaces for BI solutions.
52
Part I
Business Intelligence for Business Decision Makers and Architects
Browser
Browser
Other Thin Clients
Browser
Web
Internet Information Services (IIS) ASP
Win32 Applications for OLAP and/or Data Mining
ASP.NET
COM-Based Client Applications for OLAP and/or Data Mining
ASP, ASP.NET, etc.
.NET Client Applications for OLAP and/or Data Mining
ADO MD
OLE DB for OLAP
ADO MD.NET
Any Application for OLAP and Data Mining
XMLA over TCP/IP Instance of SQL Server 2008 Analysis Services FIgure 2-16 Client architecture for IIS from SQL Server Books Online
ETL—Getting the Solution Implemented One often underestimated concern in a BI project is the effort needed to consolidate, validate, and clean all the source data that will be used in the SSAS OLAP cubes and data mining structures. Of course, it’s not an absolute requirement to use SSIS for ETL (extract, transform, and load) processes. However, in our experience, not only have we used this powerful tool for
Chapter 2
Visualizing Business Intelligence Results
53
100 percent of the solutions we’ve designed, but also we’ve used it extensively in those solutions. As mentioned in Chapter 1, often 50 to 75 percent of the time spent on initial project implementation can revolve around the ETL planning and implementation. Because of the importance of SSIS in the initial load phase of your BI project, as well as in the ongoing maintenance of OLAP cubes and data mining structures, we’ve devoted an entire section of this book to understanding and using SSIS effectively. It has been our experience that a general lack of understanding of SSIS combined with a tendency for database administrators (DBAs) and developers to underestimate the dirtiness (or data quality and complexity) of the source data has led to many a BI project delay. Tip For many of our projects, we elected to hire one or more SSIS experts to quickly perform the heavy lifting in projects that included either many data sources, data that was particularly dirty, or both. We have found this to be money well spent, and it has helped us to deliver projects on time and on budget. In our entire section (six chapters) on SSIS, we guide you through tool usage, as well as provide you with many best practices and lessons learned from our production experience with this very important and powerful tool.
Data Mining—Don’t Leave It Out As we saw by taking a closer look at the included samples earlier in this chapter, data mining functionality is a core part of Microsoft’s BI offering. Including it should be part of the majority of your BI solutions. This might require that you educate the development team— from BDMs to developers. We recommend bringing in outside experts to assist in this educational process for some BI projects. All internal parties will significantly benefit if they have the opportunity to review reference implementations that are verticals (or industry-specific implementations) if at all possible. Microsoft is also working to provide samples to meet this need for reference information; to that end, you should keep your eye on its case study Web site for more case studies that deal with data mining solutions. We have not seen broad implementation of data mining in production BI solutions built using SQL Server 2005 or 2008. We attribute this to a lack of understanding of the business value of this tool set. Generally, you can think of appropriate use of data mining as being a proactive use of your organization’s data—that is, allowing SSAS to discover patterns and trends in your data, and to predict important values for you. You can then act on those predictions in a proactive manor. This use is in contrast to the typical use of OLAP cubes, which is decision support. Another way to understand decision support is to think of cubes as being used to validate a hypothesis, whereas mining structures are used in situations where you have data but have not yet formed a hypothesis to test.
54
Part I
Business Intelligence for Business Decision Makers and Architects
We’ve made a conscious effort to include data mining throughout this book—with subjects such as physical modeling, logical modeling, design, development phases, and so on—rather than to give it the common treatment that most BI books use, which is to devote a chapter to it at the end of the book. We feel that the data mining functionality is a core part of the SSAS offering and exploring (and using) it is part of the majority of BI solutions built using SQL Server 2008.
Common Business Challenges and BI Solutions Let’s summarize business challenges and BI solution strengths and then discuss translating these abilities into ROI for your company. You might want to refer back to the Top 10 Questions List earlier in this chapter, as you now are beginning to get enough information to be able to pull the answers to those questions together with the capabilities of the BI product suite. As you examine business needs and product capabilities, you’ll move toward envisioning the scope of your particular implementation. Here are some ways to meet challenges you might face as you envision and develop BI solutions: ■■
Slow-to-execute queries Use OLAP cubes rather than OLTP (normalized) data sources as sources for reports. OLAP cubes built using SQL Server Analysis Services are optimized for read-only queries and can be 1000 percent faster in returning query results than OLTP databases. This performance improvement comes from the efficiency of the SSAS engine and storage mechanisms. This is a particularly effective solution if a large amount of data aggregation (or consolidation) is required.
■■
General OLTP source system slowdowns Query against OLAP cubes or data mining models rather than original OLTP databases. This approach greatly reduces locking overhead from OLTP source systems. (OLAP systems don’t use locks except during processing.) Also, OLAP cubes remove overhead from OLTP production source systems by moving the reporting mechanism to a different data store and query engine. In other words, the Analysis Services engine is processing reporting queries, so the query processing load on the OLTP source systems is reduced.
■■
Manual query writing Allow end users to click to query (such as by using click and drag on pivot tables or by using other types of end-user client tools). Providing this functionality eliminates the wait time associated with traditional OLTP reporting. Typically, new or custom queries against OLTP databases require that end users request the particular reports, which then results in developers needing to manually write queries against source OLTP systems. An example is the need to manually write complex Transact-SQL queries if an RDBMS relational data source were being queried. Also, these Transact-SQL queries often need to be manually tuned by developers or administrators because of the processing load they might add to the production data stores. This tuning can involve query rewriting to improve performance and also can involve
Chapter 2
Visualizing Business Intelligence Results
55
index creation, which adds overhead to the source system and can take significant administrative effort to implement and maintain. ■■
Disparate data sources Combine data into central repositories (OLAP cubes) using ETL packages created with SQL Server Integration Services. These packages can be automated to run on a regular basis. Prior to implementing BI, we’ve often seen end users, particularly analysts, spending large amounts of time manually combining information. Analysis Services cubes and mining structures used in combination with Integration Services packages can automate these processes.
■■
Invalid or inconsistent report data Create ETL packages via SSIS to clean and validate data (prior to loading cubes or mining structures). Cubes provide a consistent and unified view of enterprise data across the enterprise. As mentioned, we’ve often noted that a large amount of knowledge workers’ time is spent finding and then manually cleansing disparate or abnormal data prior to the implementation of a dedicated BI solution. Inconsistent results can be embarrassing at the least and even sometimes costly to businesses, as these type of issues can result in product or service quality problems. We’ve even seen legal action taken as a result of incorrect data use in business situations.
■■
Data is not available to all users BI repositories—OLAP cubes and data mining structures—are designed to be accessed by all business users. Unlike many other vendors’ BI products, Microsoft has integrated BI repository support into many of its end-user products. These include Microsoft Office Word, Excel, and Visio 2007; Office SharePoint Server 2007 (via the Report Center template); and many others. This inclusion extends the reach of a BI solution to more users in your business. It’s exciting to consider how effective BI project implementation can make more data available to more of your employees.
■■
Too much data Data mining is particularly suited to addressing this business issue, as the included algorithms automatically find patterns in huge amounts of data. SSAS data mining contains nine types of data mining algorithms that group (or cluster), and (optionally) correlate and predict data values. It’s most common to implement data mining when your company has data that is not currently being used for business analysis because of sheer volume, complexity, or both.
■■
Lack of common top-level metrics Key performance indicators (KPIs) are particularly effective in helping you define the most important metrics for your particular business. SSAS OLAP cubes support the definition and inclusion of KPIs via wizards that generate the MDX code. MDX code can also be written manually to create KPIs. Across the BI suite of tools and products, the notion of KPIs is supported. This is because it’s a common requirement to have a dashboard-style view of the most important business metrics.
56
Part I
Business Intelligence for Business Decision Makers and Architects
Now let’s translate these common business problems into a general statement about the capabilities of BI and tie those capabilities to ROI for businesses that implement BI solutions.
Measuring the rOI of BI Solutions BI is comprehensive and flexible. A single, correctly designed cube can actually contain all of an organization’s data, and importantly, this cube will present that data to end users consistently. The ability to be comprehensive is best expressed by the various enhancements in SSAS OLAP cubes related to scalability. These include storage and backup compression, query optimization, and many more. These days, it’s common to see multiterabyte cubes in production. This will result in reduced data storage and maintenance costs, as well more accurate and timely business information. It will also result in improved information worker productivity, as end users spend less time getting the right or needed information. To better understand the concept of flexibility, think about the Adventure Works sample OLAP cube as displayed using the Excel PivotTable view. One example of flexibility in this sample is that multiple types of measures (both Internet and Retail Sales) have been combined into one structure. Most dimensions apply to both groups of measures, but not all do. For example, there is no relationship between the Employee dimensions and any of the measures in the Internet Sales group—because there are no employees involved in these types of sales. Cube modeling is now flexible enough to allow you to reflect business reality in a single cube. In previous versions of SSAS, and in other vendors’ products, you would’ve been forced to make compromises—creating multiple cubes or being limited by structural requirements. This lack of flexibility in the past often translated into limitations and complexity in the client tools as well. The enhanced flexibility in Microsoft BI applications will result in improved ROI from reporting systems being more agile because of the “click to query” model. Rather than requesting a new type of report—sending the report request to an administrator for approval and then to a developer to code the database query and possibly to a database administrator to tune the query—by using OLAP, the end user can instantly perform a drag and drop operation to query or view the information in whatever format is most useful. BI is accessible (that is, intuitive for all end users to view and manipulate). To better understand this aspect of BI, we suggest that you try demonstrating the pivot table based on the SSAS sample cube to others in your organization. They will usually quickly understand and be quite impressed (some will even get excited!) as they begin to see the potential reach for BI solutions in your company. Pivot table interfaces reflect the way many users think about data—that is, “What are the measures (or numbers), and what attributes (or factors) created these numbers?”
Chapter 2
Visualizing Business Intelligence Results
57
Some users might request a simpler interface than a pivot table (that is, a type of canned or prefab report). Microsoft provides client tools—for example, SSRS—that facilitate that type of implementation. It’s important for you to balance the benefits of honoring this type of request, which entails manual report writing by you, against the benefits available to end users who can use pivot tables. It has been our experience that most BI solutions include a pivot table training component for end users who haven’t worked much with pivot tables. This training results in improved ROI because more information will be useful for more end users, and that will result in better decision making at your business. BI is fast to query. After the initial setup is done, queries can easily run 1000 percent faster in an OLAP database than in an OLTP database. Your sample won’t necessarily demonstrate the speed of the query itself. However, it’s helpful to understand that the SSAS server is highly optimized to provide a query experience that is far superior to, say, a typical relational database. It’s superior because the SSAS engine itself is designed to quickly fetch or calculate aggregated values. This will result in improved ROI because end users will spend less time waiting for reports to process. We’ll dive into the details of this topic in Chapters 20, 21, and 22. BI is simple to query. End users simply drag items into and around the PivotTable area, and developers write very little query code manually. It’s important to understand that SSAS clients (such as Excel) automatically generate MDX queries when users drag and drop dimensions and measures onto the designer surfaces. This is a tremendous advantage compared to traditional OLTP reporting solutions, where Transact-SQL developers must manually write all the queries. This simplicity will result in improved ROI because of the ability to execute dynamic reporting and the increased agility on the part of the end users. They will be better able to ask the right questions of the data in a timely way as market conditions change. BI provides accurate, near realtime, summarized information. This functionality will improve the quality of business decisions. Also, with some of the new features available in SSAS, particularly proactive caching, cubes can have latency that is only a number of minutes or even seconds. This will result in improved ROI because the usefulness of the results will improve as they are delivered in a more timely way. We’ll discuss configuring real-time cubes in Chapter 9. Also, by drilling down into the information, users who need to see the detail—that is, the numbers behind the numbers—can do so. The ability to drill down is, of course, implemented in pivot tables via the simple “+” interface that is available for all (summed) aggregations in the Adventure Works sample cube. This drill-down functionality results in improved ROI by making results more actionable and by enabling users to quickly get the level of detail they need to make decisions and to take action. BI includes data mining. Data mining allows you to turn huge amounts of information into actionable knowledge by applying the included data mining algorithms. These can group (or cluster) related information together. Some of the algorithms can group and predict one
58
Part I
Business Intelligence for Business Decision Makers and Architects
or more values in the data that you’re examining. This will result in improved ROI because end users are presented with patterns, groupings, and predictions that they might not have anticipated, which will enable them to make better decisions faster. For the many reasons just mentioned, BI solutions built using SQL Server 2008, if implemented intelligently, will result in significant ROI gains for your company. Most companies have all the information they need—the core problem is that the information is not accessible in formats that are useful for the people in those companies to use as a basis for decision making in a timely way. It’s really just that straightforward: OLAP and data mining solutions simply give businesses a significant competitive advantage by making more data available to more end users so that those users can make better decisions in a more timely way. What’s so exciting about BI is that Microsoft has made it possible for many companies that couldn’t previously afford to implement any type of BI solution to participate in this space. Microsoft has done this by including with SQL Server 2008 all the core BI tools and technologies needed to implement cubes. Although it’s possible to implement both SQL Server and SSAS on the same physical server (and this is a common approach for development environments), in production situations we generally see at least one physical server (if not more) dedicated to SSAS. Also, it’s important to understand which BI features require the Enterprise edition of SQL Server or SSAS. We’ll review feature difference by edition in detail throughout this book. In addition to broadening BI’s reach by including some BI features in both the Standard and Enterprise editions of SQL Server, Microsoft is also providing some much-needed competition at the enterprise level. They have done this by including some extremely powerful BI features in the Enterprise editions of SQL Server and SSAS. We’ll talk more about these features as they apply to the specific components—that is, SSAS, SSIS, and SSRS—in the respective sections where we drill down into the implementation details of each of those components.
Summary In this chapter, we took a closer look at OLAP and data mining concepts by considering the most common business problems that the SQL Server 2008 BI toolset can alleviate and by exploring the included samples contained in the AdventureWorksDW2008 database available on the CodePlex Web site. We next took a look at these samples from an end user’s perspective. We did this so that we could experience OLAP cubes and data mining models using the commonly used client tools available in Excel 2007. There are, of course, many other options in terms of client interfaces. We’ll explore many of the clients in Part III, “Microsoft SQL Server 2008 Integration Services for Developers.” You should carefully consider the various client interfaces at the beginning
Chapter 2
Visualizing Business Intelligence Results
59
of your BI project. Acceptance and usages depends on the suitability of these tools to the particular end-user group types. We completed our discussion by recapping common ROI areas associated with BI projects. Securing executive stakeholder support (and, of course, funding!) is critical to the success of all BI projects. In the next chapter, we’ll take a look at the softer side of BI projects. This includes practical advice regarding software development life cycle methodologies that we have found to work (and not to work). Also, we’ll discuss the composition of your BI project team and the skills needed on the team. We’ve seen many projects that were delayed or even completely derailed because of a lack of attention to these soft areas, so we’ll pass along our tips from the real world in the next chapter.
Chapter 3
Building Effective Business Intelligence Processes You might be wondering why we’re devoting an entire chapter to what is referred to as the softer side of business intelligence (BI) project implementations—the business processes and staffing issues at the heart of any BI solution. Most BI projects are complex, involving many people, numerous processes, and a lot of data. To ignore the potential process and people challenges inherent in BI projects is to risk jeopardizing your entire project. A lack of understanding and planning around these issues can lead to delays, cost overruns, and even project cancellation. In this chapter, we share lessons we’ve learned and best practices we follow when dealing with the process and people issues that inevitably crop up in BI projects. We’ve had many years of real-world experience implementing BI projects and have found that using known and proven processes as you envision, plan, build, stabilize, and deploy your projects reduces their complexity and lessens your overall risk. We start by explaining the standard software development life cycle for business intelligence, including some of the formal models you can use to implement it: Microsoft Solutions Framework (MSF) and MSF for Agile Software Development. We then examine what it takes to build an effective project team: the skills various team members need and the options you have for organizing the team. Project leaders might need to educate team members so that they understand that these processes aren’t recommended just to add bureaucratic overhead to an already complex undertaking. In fact, our guiding principle is “as simple as is practical.” We all want to deliver solutions that are per specification, on time, and on budget. The processes we describe in this chapter are the ones we use on every BI project to ensure that we can deliver consistently excellent results.
Software Development Life Cycle for BI Projects BI projects usually involve building one or more back-end data stores (OLAP cubes, data mining models, or both). These cubes and mining models often have to be designed from scratch, which can be especially difficult when these types of data stores are new to the developers and to the enterprise. Next, appropriate user interfaces must be selected, configured, and sometimes developed. In addition to that, concerns about access levels (security in general), auditing, performance, scalability, and availability further complicate implementation. Also, the process of locating and validating disparate source data and combining it into the new data models is fraught with complexity. Last but not least, for many organizations, 61
62
Part I
Business Intelligence for Business Decision Makers and Architects
the BI toolset—SQL Server Analysis Services (SSAS), SQL Server Integration Services (SSIS), and SQL Server Reporting Services (SSRS)—is new and has to be learned and mastered. So how can you reduce the complexity of your project and improve the chances it will be delivered per the specification, on time, and on budget? Our answer is simple: follow—or use as a framework—a proven software development life cycle model, such as the Microsoft Solutions Framework or its Agile version. Note What if you don’t want to use MSF? We offer MSF (in both its classic and newer Agile forms) as a sample software development life cycle for BI projects. It is a method that has worked for us, in the sense that the framework proved flexible enough to be useful, yet also structured enough to add value in terms of predictability and, ultimately, the on-time and on-budget delivery of results. If you have a different software development life cycle that you and your team are more comfortable with, by all means, use it. We simply want to emphasize that, given the complexity and scale of most BI projects, we feel it’s quite important to use some sort of structured process to improve the outcome of your project.
Microsoft Solutions Framework Microsoft Solutions Framework (MSF) is a flexible software development life cycle that we’ve successfully applied to various BI projects. It consists of known phases and milestones, which are characteristic of the waterfall method, but it also has the iterations (or versions) found in a spiral method. The combination of structure and flexibility in the MSF makes it well-suited to BI projects. Such projects are usually mission critical, so milestones are expected, but they are often iterative as well because of the scope of changes required as data is discovered and cleaned in the extract, transform, and load (ETL) processes. Figure 3-1 shows the MSF software development life cycle model. Tip Another aspect of BI projects is the level of knowledge of stakeholders as to what is possible to be built with the BI suite. We often find that as stakeholders understand the applications of BI to their particular business, they will increase the scope of the project. One example of this is the decision to add data mining to an existing BI project based on a demonstration of a pilot project. Another example of iteration is a request for a new OLAP cube—often the second cube we build is for the finance department, with the first one usually being for the sales or marketing department.
As shown in Figure 3-1, MSF has distinct project phases (envision, plan, build, stabilize, and deploy) and project milestones (vision/scope approved, project plans approved, scope complete, release readiness approved, and deployment complete). MSF further advocates for particular roles and responsibilities in the software development life cycle, a topic we cover in detail later in this chapter.
Chapter 3
Building Effective Business Intelligence Processes
63
Deploy Deployment Complete
Release Readiness Approved
Envision Stabilize Release 1
Vision/Scope Approved
Scope Complete
Plan Build Project Plans Approved FIgure 3-1 Phases and milestones in Microsoft Solutions Framework
For more information about MSF in general, go to http://www.microsoft.com/technet/ solutionaccelerators/msf/default.mspx. This site includes detailed explanations about the MSF process phases, deliverables, roles, and more. It also includes case studies and sample deliverable templates. All of this information is generic by design (one of the strengths of MSF), so it can easily be adapted to any type of project. We’ll talk a bit about how we’ve done just that in the next section.
Microsoft Solutions Framework for Agile Software Development Microsoft Solutions Framework for Agile Software Development, or MSF version 4, is even more generic and flexible (or agile) than the original version of MSF. From the MSF Agile process guidance, here is the definition: Microsoft Solutions Framework (MSF) for Agile Software Development is a scenario-driven, context-based, agile software development process for building .NET and other object-oriented applications. MSF for Agile Software Development directly incorporates practices for handling quality of service requirements such as performance and security. It is also context-based and uses a context-driven approach to determine how to operate the project. This approach helps create an adaptive process that overcomes the boundary conditions of most agile software development processes while achieving the objectives set out in the vision of the project.
64
Part I
Business Intelligence for Business Decision Makers and Architects
The scenario-driven (or use-case driven) and context-based approaches of MSF Agile are particularly suited to BI projects for three reasons: ■■
Because of the scoping challenges we mentioned in the preceding section
■■
Because of the need to provide the context-specific (or vertical-specific) approach for the stakeholders so that they can see what BI can do for their particular type of business
■■
Because of their inherent agility
Every BI project we’ve worked on has been iterative—that is, multiversioned—and the final scope of this version of the project varied significantly from the original specification. The spiral method can be particularly tricky to grasp when you’re beginning to work with MSF Agile. This quote from the MSF will help you understand this complex but important concept: The smooth integration of MSF for Agile Software Development in Visual Studio Team System supports rapid iterative development with continuous learning and refinement. Product definition, development, and testing occur in overlapping iterations resulting in incremental completion of the project. Different iterations have different focus as the project approaches release. Small iterations allow you to reduce the margin of error in your estimates and provide fast feedback about the accuracy of your project plans. Each iteration should result in a stable portion of the overall system. Figure 3-2 shows how the combination of iterations along with implementations of projectspecific implementation phases—such as setup, planning, and so on—works.
Iteration 0
Project Setup Plan
Iteration 1
Plan Develop & Test Feedback
Repeat as Needed
Iteration n
Develop & Test Release Product Plan Develop & Test Feedback
FIgure 3-2 Cycles and iterations in MSF Agile
Chapter 3
Building Effective Business Intelligence Processes
65
We have used a variant of the MSF software development life cycle, either MSF 3.0 (standard) or MSF 4.0 (agile), for every BI project we’ve implemented. We firmly believe that MSF is a solidly useful set of guidance that results in more predictable results. Note MSF Agile is built into Microsoft Visual Studio Team System. If your organization uses Visual Studio Team System, you can use the free MSF Agile templates and guidance available at http://www.microsoft.com/downloads/details.aspx?familyid=EA75784E-3A3F-48FB-824E828BF593C34D&displaylang=en. Using Visual Studio Team System makes it easier to address a number of important project issues—such as code source control, work item assignments, and monitoring—but it isn’t required if you choose to use MSF as a software development life cycle methodology for your BI project.
Applying MSF to BI Projects As you saw in Figure 3-1, MSF has five distinct project phases (envision, plan, build, stabilize, and deploy) and five project milestones (vision/scope approved, project plans approved, scope complete, release readiness approved, and deployment complete). Now let’s drill down and apply MSF to specific situations you will encounter in BI projects. We again want to remind you that the failure to adopt some kind of software development life cycle usually has negative results for BI projects. Although in more classic types of software projects, such as an ASP.NET Web site, you might be able to get away with development on the fly, with BI projects you’re asking for trouble if you skip right to building. We’ve seen BI projects delayed for months and halted and restarted midstream, pitfalls that could have been avoided if a more formal software development life cycle model had been followed. To get started, we’ll dive into the specific phases of the MSF as applied to BI projects.
Phases and Deliverables in the Microsoft Solutions Framework To get started, we’ll walk through the phases and deliverables. Our approach refers to both MSF for Capability Maturity Model Integration (CMMI) and MSF for Agile Software Development. Generally, you can assume that the type of MSF software development life cycle you use will vary depending on the type of organization you work in and its culture. MSF can be more formal (that is, it can have phases and iterations plus more public milestones). Another way to define this more formal implementation is as “using the CMMI methodology.” Or MSF can be more informal (or agile, meaning fewer formal milestones and more iterations). We’ll start by taking a deeper look at the general framework, that is, phases and so on, then later in this chapter we’ll drill into applying this framework to BI projects.
Envisioning The best guidance we can give related to BI-specific projects is this: at the envisioning phase, your team should be thinking big and broad. For example, we suggest that you automatically
66
Part I
Business Intelligence for Business Decision Makers and Architects
include data mining in every project. Another way to think of this approach is the “If in doubt, leave it in” theory of design. In particular, make sure to include all source data and plan for some type of access to some subset of data for everyone in the company. As we said in Chapter 1, “Business Intelligence Basics,” the BI toolset in SQL Server 2008 was designed to support this idea of BI for everyone. We’ve often seen fallout from the “complex data for the analysts only” mindset, which unnecessarily limits the scope of a BI project. At a later point in your BI project, your team will trim the big and broad ideas of this phase into an appropriate scope for the particular iteration of your project. During this phase, be sure to include at least one representative (preferably more) from all possible user group types in the envisioning discussions. This group might include executives, analysts, help desk employees, IT staff, front-line workers, and so on. Allotting enough time and resources to accurately discover the current situation is also important. Make sure your team finds out what reports don’t exist or aren’t used (or are thought impossible) because of current data limitations. Another critical discovery step is to find and document the locations of all possible source data, including relational data stores such as Microsoft SQL Server, Oracle, and DB2; nonrelational data such as .CSV files from mainframes, XML, Microsoft Office Excel, Microsoft Access; and so on. Tip When you’re hunting for data, be sure to check people’s local machines. We’ve found many a key Excel spreadsheet in a My Documents folder!
After you’ve gathered all the information, you can begin to prioritize the business problems you’re trying to affect with this project and then match those problems to product features that will solve them. (Refer to Chapter 1 and Chapter 2, “Visualizing Business Intelligence Results,” for a detailed discussion of matching particular business problems to the BI features of SQL Server 2008.) An example of this is a situation where your company has recently purchased a competitor and you now have the challenge of quickly integrating a huge amount of disparate data. Typical BI solutions for this type of problem include extensive use of SSIS to clean, validate, and consolidate the data, and then creating data mining structures to explore the data for patterns, and sometimes also building an OLAP cube to make sense of the data via pivot table reports against that cube data. We see quite a few mistakes made in the envisioning phase. Most often they are because of a simple lack of resources allocated for discovery and planning. Avoid the “rush to build (or code)” mentality that plagues so many software projects. With that said, however, we do not mean that you should refrain from building anything during envisioning. Quite the contrary, one of the great strengths of SSAS is that you can easily create quick prototype OLAP cubes and data mining structures. We saw in Chapter 2 that a few clicks are all you need to enable the AdventureWorks samples. In some ways, the developer user interface is almost too easy because inexperienced BI developers can easily create quick data structure prototypes using the built-in GUI wizards and designers, but the
Chapter 3
Building Effective Business Intelligence Processes
67
structures they build will be unacceptably inefficient when loaded with real-world quantities of data. Our caution here is twofold: ■■
Developers of OLAP cubes, data mining models, SSIS packages, and SSRS reports can and should create quick prototypes to help explain BI concepts to team members and business decision makers (BDMs).
■■
Developers should understand that these prototypes should not be used in production situations, because during the rapid prototyping process they are not optimized for performance (scalability, availability), usability (securability, ease of use), and so on.
The goal of the envisioning phase is to move the team toward a common vision and scope document. The level of detail and size of this document will vary depending on the size of the organization and the overall complexity of the project. This document will be used to set the tone for the project. It will help to keep the team focused on the key business goals, and it will serve as a basis for agreement with the stakeholders. The document will also be useful as tool to trace business needs to specific product features in later phases of the software development life cycle process.
Planning The goal of the planning phase is to create as detailed a design plan as is possible and appropriate for your particular BI project. Also, during this phase the development and test environments should be set up. Activities during this phase include the following: ■■
Selecting a modeling tool (or tools) that works for you and your team This tool will be used to model the OLAP cubes and mining models. You can use a traditional data modeling tool, such as Visio or ERwin for this process, or you can use SSAS itself. We’ll discuss the latter option in detail in Chapter 5, “Logical OLAP Design Concepts for Architects.”
■■
Documenting taxonomies as they are actually used (rather than accepting them as they currently exist) This process is similar to creating a data dictionary. In other words, the result is a “this-means-that” list, with notes and definitions included as appropriate. You can use something as simple as an Excel spreadsheet, or you can use a database or any other tool you’re comfortable with. Your taxonomy capture should include the following information: ❏■
Natural language of the business Capture this information by conducting interviews with representatives from the various role groups—that is, executives, analysts, and so on. Your questions should take a format similar to this: “How do you refer to x (that is, your customer, client, and so on) in your organization?”
68
Part I
Business Intelligence for Business Decision Makers and Architects ❏■
■■
Data source structure names This information includes translating database table and column names into a common taxonomy. It also includes translating unstructured metadata, such as XML element or attribute names.
Capturing the current state of all source data This process includes capturing server names, server locations, IP addresses, and database names. It also includes capturing accessibility windows for source data. What this means is identifying the times when source data can be extracted on a regular basis. This information should include the frequency, load windows (or window of time for extracting data), and credentials to be used to access the source data. It can also include a discussion of the security requirements for the data, data obfuscation or encryption, and possibly auditing of data access. A common mistake that you should avoid is focusing too narrowly when completing this step. We’ve seen many companies simply document a single relational database and insist that was the sum total of the important data, only later to return with data from a large (and disparate) number of sources. These late additions have included Excel workbooks, Access databases, mainframe data, Windows Communication Foundation (WCF) or Web service data, XML data, and more.
Prototypes built during this phase usually include sample OLAP cubes and data mining models, and they also often include report mockups. These prototypes can be hand drawn or mocked up using whatever easy-to-use software you have available. We’ve used Microsoft Office PowerPoint, Excel, Word, and SSRS itself to create quick report mockups. You can also choose to create report mockups using Microsoft Office SharePoint Server 2007 or PerformancePoint Server. The result of the planning phase is a specification document. This document should include designs for the OLAP cubes, data mining models, or both. It should also include some kind of data map to be used in the development of SSIS packages, report environments (that is, SSRS, Office SharePoint Server, a custom application), and report requirements. This document also should contain a list of resources—people, servers, and software. From this resource list, a budget and work schedule can be drawn up. This document also can serve as a contract if the BI team is external to the company.
Building In the building phase of a BI project, there is quite a bit of complexity to managing the schedule. It’s common that multiple developers will be involved because it’s rare for a single developer to understand all the technologies involved in BI projects—OLAP, data mining, SSIS, SSRS, SharePoint, and so on. Data structures can be built in one of two ways—either by starting with them empty and then filling them with data from source structures, or by building and loading them at the same time. We’ve found that either approach will work. The one you choose for your project will depend on the number of people you have working on the project and the dirtiness (that is, the amount of invalid types, length, characters, and so on) of the source data.
Chapter 3
Building Effective Business Intelligence Processes
69
A common mistake is to underestimate the amount of resources that need to be allocated to cleaning, validating, and consolidating source data for the initial cube and mining structure loads. It’s often 50 to 75 percent of the project time and cost for the initial setup when doing this in the SSIS package creation phase. Although this might seem like a prohibitive cost, consider the alternative—basing key business decisions on incorrect data! Figure 3-3 shows the software development life cycle for data mining solutions. You can envision a similar software development life cycle for building OLAP cubes. The process arrows show the iterative nature of preparing and exploring data and then building and validating models. This illustration captures a key challenge in implementing BI solutions—that is, you can’t proceed until you know what you have (in the source data), and you can’t completely understand the source data all at once. There’s simply too much of it. So SSIS is used to progressively clean the data, and then, in this case, data mining is used to understand the cleansed data. This process is repeated until you get a meaningful result. Data mining model validation is built into the product so that you can more easily access the results of your model building.
Defining the Problem
Integration Services Preparing Data
Integration Services Deploying and Updating Models
Data Mining Designer
Validating Models
Exploring Data
Data Source View
Building Models
FIgure 3-3 Software development life cycle for BI data mining projects (from SQL Server Books Online)
This process results in SSIS developers and SSAS developers needing to work together very closely. One analogy is the traditional concept of pair programming. We’ve found that tools such as instant messaging and Live Meeting are quite helpful in this type of working environment. Another variation of the traditional developer role in the world of SQL Server 2008 BI is that the majority of the development you perform takes place in a GUI environment, rather than in a code-writing window. Although many languages, as previously presented, are involved
70
Part I
Business Intelligence for Business Decision Makers and Architects
under the hood in BI, the most effective BI developers are masters of their particular GUI. For SSAS, that means thoroughly understanding the OLAP cube and data mining structure designers in BIDS. For this reason, we devote several future chapters to this rich and complex integrated development environment. For SSIS, this means mastering the SSIS package designer in BIDS. We devote an entire section of this book to exploring every nook and cranny of that rich environment. For SSRS designers, this means mastering the SSRS designer in BIDS. Also, other report-hosting environments can be used, including Office SharePoint Server 2007, PerformancePoint Server, custom Windows Forms, Web Forms, or mobile forms, and more. For maximum productivity during the development phase, consider these tips: ■■
Ensure great communication between the developer leads for SSIS, SSAS, and SSRS. It’s common to have quick, daily recaps at a minimum. These can be as short as 15 minutes. If developers are geographically dispersed, encourage use of IM tools.
■■
Validate developers’ skills with BIDS prior to beginning the development phase. Expecting .NET developers to simply pick up the interface to BIDS is not realistic. Also, traditional coders tend to try to use more manual code than is needed, and this can slow project progress.
■■
Establish a daily communication cycle, which optimally includes a daily build.
■■
Establish source control mechanisms, tools, and vehicles.
Note Pay attention to security requirements during the build phase. A common mistake we’ve seen is that BI developers work with some set of production data—sometimes a complete copy— with little or no security in place in the development environment. Best practice is to follow production security requirements during all phases of the BI project. If this includes obscuring, encrypting, or securing the data in production, those processes should be in place at the beginning of the development phase of your BI project. SQL Server 2008 contains some great features, such as transparent encryption, that can make compliance to security requirements much less arduous. For more information about native SQL Server 2008 data encryptions go to http://edge.technet.com/Media/580. The goal of the build phase is to produce OLAP cubes and data mining structures that contain production data (often only a subset) per the specification document produced in the designing phase. Also, reporting structures—which can include SSRS, Office SharePoint Server 2007, and more—should be complete and available for testing.
Stabilizing We’ve seen many BI project results rushed into production with little or no testing. This is a mistake you must avoid! As mentioned earlier, by using the wizards in BIDS, you can create
Chapter 3
Building Effective Business Intelligence Processes
71
OLAP cubes, data mining structures, SSIS packages, and SSRS reports very quickly. This is fantastic for prototyping and iterative design; however, we’ve seen it have disastrous results if these quick structures are deployed into production without adequate testing (and subsequent tuning by the developers). This tuning most often involves skillful use of advanced properties in the various BIDS designers, but it can also include manual code tweaking and other steps. We’ll devote several future chapters to understanding these fine but important details. Suffice it to say at this point that your testing plan should take into account the following considerations: ■■
It’s critical to test your cubes and mining models with predicted, production-level loads (from at least the first year). One of the most common mistakes we’ve seen is developing these structures with very small amounts of data and doing no production-level testing. We call this the exploding problem of data warehousing.
■■
Your plan should include testing for usability (which should include end-user documentation testing). Usability testing should include query response time for cubes and mining models with production-level data loads put into the cubes and mining models.
■■
You should also test security for all access levels. If your requirements include security auditing, that should also be tested.
The goal of the testing phase is to gain approval for deployment into production from all stakeholders in the project and to obtain sign-off from those who will be conducting the deployment. This sign-off certifies that specification goals have been met or exceeded during stabilizing and that the production environment has been prepared and is ready to be used.
Deploying As your solution is moved into production, there should be a plan to create realistic service level agreements (SLAs) for security, performance (response time), and availability. Also, a plan for archiving data on a rolling basis should be implemented. SQL Server 2008 has many features that make archiving easy to implement. We’ll talk more about archiving in Chapter 9, “Processing Cubes and Dimensions.” Deployment includes monitoring for compliance with SLA terms over time. The most common challenge we’ve encountered in the deployment phase is that network administrators are unfamiliar with BI artifacts such as cubes, mining models, packages, and so on. The most effective way to mitigate this is to access the knowledge and skills of this group prior to deployment and to provide appropriate training. Microsoft has a large number of resources available for BI administrators at http://www.microsoft.com/sqlserver/2008/en/us/eventswebcasts.aspx#BusinessIntelligenceWebcast.
72
Part I
Business Intelligence for Business Decision Makers and Architects
Skills Necessary for BI Projects In this section, we discuss both the required and optional (but nice to have) skills the members of your BI team need. Required skills include concepts such as understanding the use of the BI development tools in SQL Server 2008—not only SSAS and SSRS, but also SSIS. This section also includes a brief discussion of skills that relate to some of the new features of SQL Server 2008. The reason we include this information is that you might choose to use SQL Server 2008 as a data repository for source data based on your understanding of these new features. Following the discussion of required skills, we cover optional skills for BI teams.
Required Skills The following is a set of skills that we have found to be mandatory for nearly all of our realworld BI projects. These skills are most typically spread across multiple people because of the breadth and depth of knowledge required in each of the skill areas. As you are gathering resources for your project, a critical decision point is how you choose to source resources with these skills. Do you have them on staff already? This is unusual. Do you have developers who are capable of and interested in learning these skills? How will you train them? Do you prefer to hire outside consultants who have expertise in these skill areas? One of the reasons we provide you with this information here is because if you want or need to hire outside consultants for your project, you can use the information in this section as a template for establishing hiring criteria.
Building the Data Storage Containers The first consideration is who will build the foundation (the data storage containers), which can consist of OLAP cubes, data mining structures, or both. A required qualification for performing either of these tasks is a high level of proficiency with the developer interfaces in BIDS for SSAS (both OLAP cubes and data mining structures), as detailed in the following list: ■■
For OLAP cubes We find that understanding best practices for OLAP cube modeling—that is, star schema (sometimes called dimensional modeling), dimensional hierarchies, aggregation, and storage design—is needed here. We cover these topics in Chapter 5. As mentioned earlier, also required is an understanding of the appropriate use of BIDS to build SSAS OLAP cubes. Also needed is an understanding of how to use SQL Server Management Studio (SSMS) to manage cubes. We also find that a basic understanding of MDX syntax and XMLA scripting are valuable.
■■
For data mining structures We find that understanding best practices for data mining structure modeling—that is DM modeling concepts, including a basic understanding of the capabilities of the data mining algorithms, the functions of input, and predictable columns—is needed for building a foundation using data mining structures. It’s also helpful if the person in this role understands the function of nested tables in modeling. We cover all these topics in future chapters in much more detail.
Chapter 3
Building Effective Business Intelligence Processes
73
Creating the User Interface The second consideration is related to the building of the user interface (UI). Sensitivity to UI issues is important here, as is having a full understanding of the current client environment— that is Excel, Office SharePoint Server 2007, and so on—so that the appropriate client UI can be selected and implemented. This includes understanding Excel interface capabilities, PivotTables, and mining model viewers. You should also understand Office SharePoint Server 2007 Report Center and other SharePoint objects such as Excel Services. Excel Services facilitates hosting Web-based workbooks via Office SharePoint Server 2007. As mentioned in the life cycle section earlier in the chapter, sensitivity for and appropriate use of natural business taxonomies in UI implementation can be significant in aiding end-user adoption. This is particularly true for BI projects because the volumes of data available for presentation are so massive. Here are considerations to keep in mind when creating the user interface: ■■
For reporting in SSRS The first requirement is that team members understand the appropriate use of BIDS to author SSRS reports based on OLAP cubes and data mining structures. This understanding should include an ability to use report query wizards, which generate MDX or DMX queries, as well as knowing how to use the report designer surface to display and format reports properly. Usually report developers will also need to have at least a basic knowledge of both query languages (MDX and DMX). Also needed is a basic understanding of Report Definition Language (RDL). It’s also common for report creators to use the redesigned Report Builder tool. We devote Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence,” to the appropriate use of report designing tools and interfaces.
■■
For reporting in Excel It’s common to expect report designers to be able to implement Excel PivotTables for reports using OLAP cubes as data sources. Also required is for report designers to use the Data Mining Add-ins for Excel to create reports that use SSAS data mining models in Excel.
Understanding Extract, Transform, and Load Processes A critical skill set for your BI developers to have is an understanding of the powerful SSIS ETL package designer. As mentioned earlier, it’s not uncommon for up to 75 percent of the time for the initial setup of a BI project to be around SSIS. This is because of the general messy state of source data. It’s common to underestimate the magnitude and complexity of this task, so the more skilled the SSIS developer is, the more rapidly the initial development phases can progress. It has been our experience that hiring master SSIS developers if none are present in-house is a good strategy for most new BI projects. For data preparation or ETL, developers should fully understand the BIDS interface that is used to build SSIS packages. This interface is deceptively simple looking. As mentioned previously, we recommend using experienced SSIS developers for most BI projects. Another important skill is knowing how to use SSMS to manage SSIS packages.
74
Part I
Business Intelligence for Business Decision Makers and Architects
Optimizing the Data Sources The skills needed for this last topic vary greatly. They depend on the quantity and quality of your data sources. For some projects, you’ll have only a single relational database as a source. If this is your situation, you probably don’t have much to worry about, especially if you already have a skilled database administrator (DBA). This person can optimize the data source for extracting, transforming, and loading (ETL) into the SSAS data structures. Although using SQL Server 2008 as a data source is actually not required, because you’ll be using SQL Server 2008 BI tools, you might want to upgrade your data sources. There are many new features built into SQL Server 2008, such as compression, encryption, easier archiving, and much more. These new features make using SQL Server 2008 very attractive for serving up huge amounts of data. We’ll cover these new features in greater depth in Chapter 9. Note Is SQL Server 2008 required as a data source for SSAS? The answer is no. You can use nearly any type of data source to feed data to SSAS. We mention SQL Server 2008 in this section because you might elect to upgrade an earlier version of SQL Server or migrate from a different vendor’s RDBMS at the beginning of your BI project. You might choose to upgrade or migrate because of the availability, scalability, and security enhancements in SQL Server 2008.
Optional Skills The following is a set of next-level skills we’ve found to be useful for a development team to have for nearly all of our real-world BI projects. These skills, even more than the basic-level skills, are most typically spread across multiple people because of the breadth and depth of knowledge required in each of the skill areas. They’re not absolutely needed for every BI project, and, in many cases, these are the types of skills you might contract out for during various phases of your particular BI project.
Building the Foundation For all types of foundational stores, advanced knowledge of the BIDS design interface for OLAP cubes and data mining models is quite powerful. This includes knowing how to apply advanced property configuration options, as follows: ■■
For OLAP cubes This skill set includes advanced use of BIDS to model SSAS OLAP cubes and to configure advanced cube and mining model properties either via the GUI design or via the object model (ADOMD.NET API). It also includes MDX native query and expression authoring skills, as well as advanced XMLA scripting skills.
■■
For data mining structures As with the preceding item, this skill set includes advanced configuration of data mining structures and modeling using BIDS. It also includes DMX native query and expression authoring skills.
Chapter 3
Building Effective Business Intelligence Processes
75
Creating the User Interface Advanced skills for UI creation usually revolve around the addition of more types of UI clients. While most every project we’ve worked on has used Excel and SSRS, it’s becoming more common to extend beyond these interfaces for reports, as summarized in the following list: ■■
For reporting in SSRS Advanced skills for SSRS include the use of Report Builder. (This also usually involves the ability to train business analysts or other similarly skilled users to use Report Builder to eventually create their own reports.) This skill set also includes advanced knowledge of the SSRS interface in BIDS, as well as the ability to use the advanced property configuration settings and perform manual generation of RDL.
■■
For reporting in Excel Advanced knowledge of Excel as an SSAS client includes understanding the native query functionality of OLAP cubes via extensions to the Excel query language. This skill set also includes a complete understanding of all the options available on the Data Mining tab, particularly those related to mining model validation and management from within Excel.
■■
For reporting in Office SharePoint Server 2007 Advanced OLAP skills in Office SharePoint Server 2007 entail understanding the capabilities of the built-in Report Center template and knowing the most effective ways to extend or create custom dashboards. Also required is an understanding of what type of data source to use to create dashboards with the greatest level of performance. Data sources to be used with this skill set can include OLAP cubes and Business Data Catalog (BDC)–connected data.
■■
For reporting in PerformancePoint Server If PerformancePoint Server is used as a client application for your BI solution, UI designers should be comfortable, at a minimum, with creating custom dashboards. Additionally, PerformancePoint Server is often used when financial forecasting is a requirement.
■■
For reporting in Microsoft Dynamics or other Customer Relationship Management (CRM) projects If you plan to customize the OLAP cubes or reports that are implemented by default in Microsoft Dynamics or other CRM products, you need to have a deep understanding of the source schemas for those cubes. Also, you need to carefully document any programmatic changes you make so that you can ensure a workable upgrade path when both SQL Server and Microsoft Dynamics release new product versions, service packs, and so on.
■■
For custom client reporting .NET programming experience is preferred if custom client development is part of your requirements. Skills needed include an understanding of the embeddable data mining controls as well as linked (or embedded) SSRS that use OLAP cubes as data sources.
76
Part I
Business Intelligence for Business Decision Makers and Architects
Understanding Extract, Transform, and Load Processes As mentioned several times previously, we’ve found the work involved in locating, preparing, cleaning, validating, and combing disparate source data to be quite substantial for nearly every BI project we’ve been involved with. For this reason, advanced SSIS skills, more than advanced skills for any other part of the BI stack (that is, SSAS, SSRS), are most critical to ensuring that BI projects are implemented on time and on budget. For data preparation or ETL, in addition to mastering all the complexities of the SSIS package creation interface in BIDS (which includes configuration of advanced control flow and data flow task and package properties), advanced SSIS developers need to be comfortable implementing SSIS scripting, SSIS debugging, and SSIS logging. They should also understand the SSIS object model. In some cases, other products, such as BizTalk Server (orchestrations) might also be used to automate and facilitate complex data movement.
Optimizing the Data Sources Your advanced understanding of optimization of underlying data sources is completely dependent on those particular stores. We’ll cover only one case here—SQL Server 2008. BI developers who have advanced skills with SQL Server 2008 used as a BI data source will understand query tuning and indexing, and they’ll be able to optimize both for the most efficient initial data load or loads, as well as for incremental updates. They’ll also understand and use new security features, such as transparent encryption to improve the security of sensitive data. Additionally, they’ll know about and understand the enhancements available to increase maintenance windows; these enhancements can include using backup compression, partitioning, and other functionality.
Forming Your Team With this very long and imposing skills list, you might be wondering just how you accomplish all this. And if you are thinking that, we’ve proved our point—you can’t. Microsoft has worked hard to make BI accessible; however, the reality remains that BI projects require certain skills in order for them to be successful. It’s common to bring in specialized consultants during various phases of a project’s life cycle to complement the skills you and your team bring to the table. Another common response is to include a training plan for the internal team as part of a BI project.
Roles and Responsibilities Needed When Working with MSF Once again, the flexible MSF guidance can be helpful as you assemble your team. MSF includes the notion of roles and responsibilities, which are tied to milestones and deliverables from the MSF process model. Figure 3-4 shows the guidance for roles. Don’t be misled
Chapter 3
Building Effective Business Intelligence Processes
77
into thinking that MSF advocates for a rigid number of people for each project—that is, one to fit each of the seven MSF roles shown in the diagram. Quite the opposite: the roles and responsibilities can be combined to scale to teams as small as three people, and they can be expanded to scale to teams with hundreds of members. It is important to note the connections between the roles, particularly the central node, which is labeled Team of Peers in the diagram in Figure 3-4. The important concept to consider here is that although all team members will contribute in different ways to the project using their various skill sets, each member of the team will make a critical contribution to the project. If any one member cannot perform at a high level, the project has a significant risk of not being delivered on time and on budget, and it might also not meet the quality bar established at the start of the project. Architect
Project Manager Program Management
Architecture
Business Analyst Product Management
Development Team of Peers
User Experience Business Analyst
Developer Database Developer
Test Tester Release Operations Release Manager Database Administrator
FIgure 3-4 The seven major cluster groups for roles in MSF Agile
MSF Agile further associates roles with particular project responsibilities. Figure 3-4 applies MSF Agile rather than classic MSF. Note that MSF Agile separates the Program Management and Architecture roles. This option has worked well for us in BI projects. Usually, internal staff manages the program (or project)—controlling the schedule and budget—and external consultants (the authors of this book, for example) perform architecture tasks. Downscaling is rather obvious: simply combine roles according to the skills of team members. Upscaling requires a bit of explanation. MSF advocates for teams that report to leads for each role; those teams are then responsible for particular feature sets of the solutions. We’ll spend the rest of this chapter examining the basic definitions for MSF roles and relating those definitions to typical BI responsibilities.
78
Part I
Business Intelligence for Business Decision Makers and Architects
Product Management The product manager initiates the project and communicates most project information to the external sponsors. These sponsors are external to the production team, but they’re usually part of the company in which the solution is being implemented. We’ve found exceptions to the latter statement in some larger enterprises, particularly government projects. These large enterprises often have stricter reporting requirements and also request more detailed envisioning and specification documents. In many organizations, the product manager is also the business analyst and acts as the communication bridge between the IT project and business teams. All BI projects need a product manager. In addition to initiating the project, which includes talking with the stakeholders to get their input before the project starts, the product manager is also responsible for getting the problem statement and then working with the team to formulate the vision/scope statement and the functional specification, budget estimates, and timeline estimates. The product manager is also the person who presents the vision/ scope and functional specification documents (which include budget and schedule) to the stakeholders for formal approval. The product manager is one of the primary external-facing team members. The product manager communicates the status of the project to the stakeholders and asks for more resources on the part of the team if the circumstances or scope of the project changes. The following list summarizes the duties of the product manager: ■■
Acts as an advocate for the stakeholders
■■
Drives the shared project vision and scope
■■
Manages the customer requirements definition
■■
Develops and maintains the business case
■■
Manages customer expectations
■■
Drives features versus schedule versus resources tradeoff decisions
■■
Manages marketing, evangelizing, and public relations
■■
Develops, maintains, and executes the communications plan
Architecture The duties of the BI project architect vary widely depending on the scope of the project. For smaller projects (particularly those using Agile software development life cycle methodologies), it’s common that the person who designs the project architecture is the same person who implements that architecture as a BI developer. Some of the skills that a qualified BI architect should have are the ability to translate business requirements into appropriate star schema and data mining structure models, the ability to model OLAP cubes (particularly dimensions), and the ability to model data for inclusion in one or more data mining models. For this last skill, an understanding of data mining algorithm requirements, such as supported data types, is useful.
Chapter 3
Building Effective Business Intelligence Processes
79
Also, BI architects should be comfortable using one or more methods of visually documenting their work. Of course, in the simplest scenarios, a whiteboard works just fine. For most projects, our architects use a more formal data modeling tool, such as Visio or ERwin.
Program Management In some implementations of MSF, the program manager is also the project architect. The Program Manager role is responsible for keeping the project on time and on budget. Alternatively, in smaller projects, the program manager can also perform some traditional project manager duties, such as schedule and budget management. In larger projects, the program manager brings a person with project management skills onto the team to perform this specialized function. In the latter case, the program manager is more of a project architect. As previously mentioned, in larger BI projects, separate people hold the Program Manager and Architect roles. The program manager is the glue that holds the team together. It’s the primary responsibility of this person to translate the business needs into specific project features. The program manager doesn’t work in a vacuum; rather, she must get input from the rest of the team. During the envisioning and developing phases, input from the developer and user experience leads is particularly important. During later project phases, input from the test and deployment leads takes center stage. The program manager communicates with the team and reports the project’s status to the product manager. The program manager is also the final decision maker if the team can’t reach consensus on key issues. Here is a brief list of the program manager’s duties: ■■
Drives the development process to ship the product on time and within budget
■■
Manages product specifications and is the primary project architect
■■
Facilitates communication and negotiation within the team
■■
Maintains the project schedule, and reports project status
■■
Drives implementation of critical tradeoff decisions
■■
Develops, maintains, and executes the project master plan and schedule
■■
Drives and manages risk assessment and risk management
Development The developer manager is either the lead BI developer (in smaller projects) or acts as a more traditional manager to a group of BI developers. These developers usually consist of both internal staff and external consultants. As mentioned earlier, this combination is most often used because of the complexity and density of skills required for BI developers.
80
Part I
Business Intelligence for Business Decision Makers and Architects
The role that the developer manager (sometimes called the dev lead) plays depends greatly on the size of the developer team. In smaller projects, there might be only a few developers. In those situations, the developer manager also performs development tasks. In larger teams, the developer manager supervises the work of the team’s developers. Here is a brief list of the developer manager’s duties: ■■
Specifies the features of physical design
■■
Estimates the time and effort to complete each feature
■■
Builds or supervises the building of features
■■
Prepares the product for deployment
■■
Provides technology subject matter expertise to the team
The most important consideration for your project team is determining the skills of your developers. We’ve seen a trend of companies underestimating the knowledge ramp for .NET developers to SSAS, SSIS, and SSRS competency. Although the primary tools used to work with these services are graphical—that is, BIDS and SSMS—the ideas underlying the GUIs are complex and often new to developers. Assuming that .NET developers can just pick up all the skills needed to master BI—particularly those needed for modeling the underlying data store structures of OLAP cubes and data mining models—is just not realistic. Also, the SSIS interface (and associated package-creation capabilities) is quite powerful, and few working, generalized .NET developers have had time to fully master this integrated development environment. Finally, in the area of reporting, the variety of client interfaces that are often associated with BI projects can demand that developers have a command of Excel, SSRS, Office SharePoint Server 2007, PerformancePoint Server, or even custom client creation. The skills a developer manager needs include the following: access (current skills), create (skills gap map), and assign (training of current developers or delegating tasks to other developers who have these skills). A skills gap map is a document that lists required project team member skills and then lists current team member skills. It then summarizes gaps between the skills the team currently has and those that the team needs. This type of document is often used to create a pre-project team training plan. Because BI skills are often new to traditional developers, it was a common part of our practice to design and implement team training prior to beginning work on a BI project. This training included BI concepts such as star schema modeling, as well as product-specific training, such as a step-by-step reviewing of how to build an OLAP cube using SSAS and more. In our experience, every BI project needs at least three types of developers: an SSAS developer, an SSIS developer, and an SSRS developer. In smaller shops, it’s common to hire outside contractors for some or all of these skills because that is more cost-effective than training the current developer teams.
Chapter 3
Building Effective Business Intelligence Processes
Who Is a BI Developer? It’s interesting to consider what types of prerequisite skills are needed to best be able to learn BI development. Much of the information that Microsoft releases about working with SQL Server 2008 BI is located on its TechNet Web site (http://technet.microsoft.com). This site is targeted at IT professionals (network and SQL administrators). We question whether this is actually the correct target audience. The key question to consider is this: Can an IT administrator effectively implement a BI solution? It has been our experience that traditional developers, particularly .NET developers, have the skills and approach to do this on a faster learning curve. We think this is because those developers bring an understanding of the BI development environment (BIDS), from having used Visual Studio. This gives them an advantage when learning the BI toolset. Also, we find that most of these developers have had experience with data modeling and data query. Of course their experience is usually with relational rather than multidimensional databases. However, we do find that teaching OLAP and data mining database development skills is quicker when the students bring some type of database experience to the table. Another tendency we’ve seen is for managers to assume that because the BIDS development environments are GUI driven (rather than code driven), nonprogrammers can implement BI solutions. Although this might be possible, we find that developers, rather than administrators, can gain a quicker understanding of core BI concepts and tools. Of course, to every guideline there are exceptions. We certainly have also worked with clients who ARE administrators, who have successfully implemented BI projects.
Note Although MSF is flexible in terms of role combinations for smaller teams, one best practice is worth noting here. MSF strongly states that the Developer and Tester roles should never be combined. Although this might seem like common sense, it’s surprising to see how often this basic best practice is violated in software projects. If developers were capable of finding their own bugs, they probably wouldn’t write them in the first place. Avoid this common problem!
Test The test manager is the lead tester in smaller projects and the test manager of a team in larger projects. The test manager has the following responsibilities: ■■
Ensure that all issues are known
■■
Develop testing strategy and plans
■■
Conduct testing
81
82
Part I
Business Intelligence for Business Decision Makers and Architects
Like all effective testing, BI testing should be written into the functional specification in the form of acceptance criteria—that is, “Be able to perform xxx type of query to the cube and receive a response within xxx seconds.” As mentioned previously in this chapter, effective testing is comprehensive and includes not only testing the cubes and mining models for compliance with the specifications, but also end-user testing (using the client tools), security testing, and, most important of all for BI projects, performance testing using a projected real-world load. Failing to complete these tasks can lead to cubes that work great in development, yet fail to be usable after deployment.
User Experience The user experience manager’s primary role is developing the user interface. As with test and developer managers, the user experience manager performs particular tasks in smaller teams and supervises the work of team members in larger ones. The following list summarizes the responsibilities of the user experience manager: ■■
Acts as user advocate
■■
Manages the user requirements definition
■■
Designs and develops performance support systems
■■
Drives usability and user performance-enhancement tradeoff decisions
■■
Provides specifications for help features and files
■■
Develops and provides user training
The user experience manager needs to be aware of the skills of the designers working on the user interface. Many BI projects suffer from ineffective and unintuitive user interfaces for the data stores (OLAP cubes and mining models). The problem of ineffective visualization of BI results for users often hinders broader adoption. With the massive amounts of data available in OLAP cubes, old metaphors for reporting, such as tables and charts, aren’t enough and important details, such as the business taxonomy, are often left out. We’ve actually seen an OLAP cube accessed by an Excel PivotTable that included a legend to translate the field views into plain English. A successful user interface for any BI project must have the following characteristics: ■■
As simple as is possible, including the appropriate level of detail for each audience group
■■
Visually appealing
■■
Written in the language of the user
■■
Linked to appropriate levels of detail
Chapter 3
Building Effective Business Intelligence Processes
83
■■
Can be manipulated by the users (for example, through ad hoc queries or pivot tables)
■■
Visually intuitive, particularly when related to data mining
In addition to trying to increase usability, you should include an element of fun in the interface. Unfortunately, all too often we see boring, overly detailed and rigid interfaces. Rather than these interfaces being created by designers, they’re created by developers, who are usually much more skilled in implementing business logic than in creating a beautiful UI design. In Part IV, which is devoted to UI (reporting), we’ll include many examples that are effective. Note Microsoft Research contains an entire group, the data visualization group, whose main function is to discover, create, and incorporate advanced visualizations into Microsoft products. This group has made significant contributions to SQL Server 2008 (as well as to previous versions of SQL Server), for business intelligence in particular. For example, they created some of the core algorithms for data mining as well some of the visualizers included for data mining models in BIDS. Go to http://research.microsoft.com/vibe/ to view some of their visualization tools.
Release Management The release/operations manager is responsible for a smooth deployment from the development environment into production and has the following responsibilities: ■■
Acts as an advocate for operations, support, and delivery channels
■■
Manages procurement
■■
Manages product deployment
■■
Drives manageability and supportability tradeoff decisions (also known as compromises)
■■
Manages operations, support, and delivery channel relationship
■■
Provides logistical support to the project team
Release/operations managers must be willing to learn specifics related to cube and mining model initial deployment, security, and, most important, maintenance. In our consulting work, we find that the maintenance phase of BI projects is often overlooked. Here are some questions whose answers will affect the maintenance strategy: ■■
What is the projected first-year size of the cubes and mining models?
■■
What is the archiving strategy?
■■
What is the physical deployment strategy?
■■
What is the security auditing strategy?
84
Part I
Business Intelligence for Business Decision Makers and Architects
Summary In this chapter, we looked at two areas that we’ve seen trip up more than one wellintentioned BI project: managing the software development life cycle and building the BI team. We provided you with some guidance based on our experience in dealing with the complexity of BI project implementation. In doing that, we reviewed the software development life cycle that we use as a basis for our projects—MSF. After taking a look at generic MSF, we next applied MSF to BI projects and mentioned techniques and tips particular to the BI project space. We then looked at MSF’s guidance regarding team building. After reviewing the basic guidance, which includes roles and responsibilities, we again applied this information to BI projects. In Chapter 4, “Physical Architecture in Business Intelligence Solutions,” we turn our attention to the modeling processes associated with BI projects. We look in detail at architectural considerations regarding physical modeling. The discussion includes describing physical servers and logic servers needed to begin planning, prototyping, and building your BI project. In Chapter 4, we’ll also detail best practices for setting up your development environment, including a brief discussion of the security of data.
Chapter 4
Physical Architecture in Business Intelligence Solutions In this chapter, we turn to the nitty-gritty details of preparing the physical environment for developing your business intelligence (BI) project. We cover both physical server and service installation recommendations. Our goal is to help your team get ready to develop your project. We also look at setup considerations for test and production environments. Then we include a discussion of the critical topic of appropriate security for these environments. So often in our real-world experience, we’ve seen inadequate attention given to many of these topics (especially security), particularly in the early phases of BI project implementations. We conclude the chapter with a discussion of best practices regarding source control in a team development environment.
Planning for Physical Infrastructure Change As you and your team move forward with your BI project, you’ll need to consider the allocation and placement of key physical servers and the installation of services for your development, test, and production environments. Although you can install all the components of Microsoft SQL Server 2008 BI onto a single physical server, other than for evaluation or development purposes, this is rarely done in production environments. The first step in the installation process is for you to conduct a comprehensive survey of your existing network environment. You need to do this so that you can implement change in your environment in a way that is planned and predictable and that will be successful. Also, having this complete survey in hand facilitates rollback if that need arises during your BI project cycle.
Creating Accurate Baseline Surveys To get started, you need to conduct a comprehensive survey of the existing environment. If you’re lucky, this information is already available. In our experience, however, most existing information is incomplete, inaccurate, or missing. You can use any convenient method of documenting your findings. We typically use Microsoft Office Excel to do this. This activity should include gathering detailed information about the following topics: ■■
Physical servers name, actual locations, IP addresses of all network interface cards (NICs), domain membership Servers documented should include authentication servers, such as domain controllers; Web servers; and data source servers, such as file servers (for file-based data sources, such as Excel and Microsoft Access) and RDBMS 85
86
Part I
Business Intelligence for Business Decision Makers and Architects
servers. If your BI solution must be accessible outside of your local network, include perimeter devices, such as proxy servers or firewalls in your documentation. You should also include documentation of open ports. ■■
Operating system configuration of each physical server operating system version, service packs installed, administrator logon credentials, core operating system services installed (such as IIS), and effective Group Policy object (GPO) settings If your solution includes SQL Server Reporting Services (SSRS), take particular care in documenting Internet Information Services (IIS) configuration settings. Note IIS 6.0 (included in Windows Server 2003) and IIS 7.0 (included in Windows Vista and Windows Server 2008) have substantial feature differences. These differences as they relate to SSRS installations will be discussed in much greater detail in Chapter 20, “Creating Reports in SQL Server 2008 Reporting Services.” For a good general reference on IIS, go to http://www.iis.net.
■■
Logical servers and services installed on each physical server This includes the name of each service (such as SQL Server, Analysis Services, and so on) and the version and service packs installed for each service. It should also include logon credentials and configuration settings for each service (such as collation settings for SQL Server). You should also include the management tools installed as part of the service installation. Examples of these are Business Intelligence Development Studio (BIDS), SQL Server Management Studio (SSMS), SQL Profiler, and so on for SQL Server. Services documented should include SQL Server, SQL Server Analysis Services (SSAS), SQL Server Integration Services (SSIS), and SSRS.
■■
Development tools This includes the names, versions, and configuration information for all the installed development tools, such as Microsoft Visual Studio. Visual Studio is not strictly required for BI development; however, in many cases you’ll find its features, such as Microsoft IntelliSense for .NET code, useful for advanced BI development.
■■
Samples and optional downloads As mentioned previously in this book, all samples for SQL Server 2008 are now available from http://www.codeplex.com. Samples are not part of the installation DVD. We do not recommend installing samples in a production environment; however, we do generally install the samples in development environments. We also find a large number of useful tools and utilities on CodePlex. Documenting the use of such tools facilitates continuity in a team development environment. An example of these types of tools is the MDX Script Performance Analyzer, found at http://www.codeplex.com/mdxscriptperf.
After you’ve gathered this information, you should plan to store the information in both electronic and paper versions. Storing the information in this way not only aids you in planning for change to your environment but is a best practice for disaster-recovery preparedness. The
Chapter 4
Physical Architecture in Business Intelligence Solutions
87
information in the preceding list is the minimum amount you’ll need. In some environments, you’ll also need to include information from various log files, such as the Windows Events Viewers, IIS logs, and other custom logs (which often include security logs).
Assessing Current Service Level Agreements Service level agreements (SLAs) are being increasingly used to provide higher levels of predictability in IT infrastructures. If your organization already uses SLAs, reviewing the standards written into them should be part of your baseline survey. Your goal, of course, is to improve report (query) performance by introducing BI solutions into your environment. If your company does not use SLAs, consider attempting to include the creation of a BI-specific SLA in your BI project. An important reason for doing this is to create a basis for project acceptance early in your project. This also creates a high-level set of test criteria. A simple example is to assess appropriate query responsiveness time and to include that metric. For example, you can state the criteria in phrases like the following: “Under normal load conditions (no more than 1000 concurrent connections), query response time will be no more than 3 seconds.” Another common component of SLAs is availability, which is usually expressed in terms of the nines (or “9’s”). For example, an uptime measurement of “five nines” is roughly equivalent to 99.999 percent, or up to about 5 total hours of unplanned downtime per year. By comparison, “four nines” (or 99.99 percent) is roughly equivalent to about 55 total hours of unplanned downtime per year. We’ll talk a bit more about availability strategies later in this chapter. Note For an example of a simple, downloadable (for a fee) SLA, see http://www.sla-world.com/ sladownload.htm.
What if you do not have or plan to use SLAs? You can and should still assess your current operating environments to create meaningful baseline information. This assessment should, at a minimum, consist of a simple list of pain points regarding data and reporting. You’ll use this information at the beginning of your project for inclusion in the problem statement portion of your specification documents. Alleviating or at least significantly reducing this pain is, of course, the ultimate goal of your BI project. We recommend using the Windows Reliability and Performance Monitor tool to capture current conditions. After you’ve installed SSAS, if you want to collect information about SSAS itself, several groups of counters are specific to it. These counter names all have the format of MSAS 2008:
. After you install SSIS or SSRS, additional performance counters specific to those services also become available. The default view of this tool in Windows Server 2008 is shown in Figure 4-1.
88
Part I
Business Intelligence for Business Decision Makers and Architects
FIgure 4-1 Windows Server 2008 Reliability and Performance Monitor tool
Considerations here can include the following possible problems: ■■
Slow report rendering as a result of current, unoptimized OLTP sources.
■■
Slow query execution as a result of unoptimized or underoptimized OLTP queries.
■■
Excessive overhead on core OLTP services as a result of contention between OLTP and reporting activities running concurrently. This overhead can be related to CPU, memory, disk access contention, or any combination of these.
■■
Short OLTP maintenance windows because of long-running backup jobs and other maintenance tasks (such as index defragmentation).
After you’ve completed your baseline environment survey, your next preparatory step is to consider the number of physical servers you’ll need to set up your initial development environment. There are two ways to approach this step. The first is to simply plan to set up a working development environment with the intention of creating or upgrading the production environment later. The second is to plan and then build out all working environments, which could include development, test, and production. The approach that is taken usually depends on the size and maturity of the BI project. For newer, smaller projects, we most often see companies choose to purchase one or two new
Chapter 4
Physical Architecture in Business Intelligence Solutions
89
servers for the development environment with the intention of upgrading or building out the production environment as the project progresses.
Determining the Optimal Number and Placement of Servers When you’re planning for new servers, you have many things to consider. These can include the following: ■■
Source systems If your source systems are RDBMSs, such as an earlier version of SQL Server or some other vendor’s RDBMS (Informix, Sybase, and so on), you might want to upgrade source data systems to SQL Server 2008 to take advantage of BI-related and non-BI-related enhancements. For more information about enhancements to the relational engine of SQL Server 2008, go to http://download.microsoft.com/ download/C/8/4/C8470F54-D6D2-423D-8E5B-95CA4A90149A/SQLServer2008_OLTP_ Datasheet.pdf.
■■
Analysis Services Where will you install this core service for BI? We find that for smaller BI projects, some customers elect to run both SQL Server 2008 and Analysis Services on the same physical machine. There, they run both OLAP and extract, transform, and load (ETL) processes. This configuration is recommended only for smaller implementations—fewer than 100 end users and less than 5 GB of data in an OLAP cube. However, when creating a development environment, it is common to install both SQL Server and Analysis Services on the same machine. Another common way to set up a development environment is to use multiple virtual machines. The latter is done so that the security complexities of multiple physical server machine installs can be mirrored in the development or test environments.
■■
Integration Services Where will you install this service? As mentioned, for development, the most common configuration we see is a SQL Server instance, primarily for hosting SSIS packages, installed on the same machine as SSAS. In production, the most common configuration is to install the SSIS component for the SQL Server instance on a separate physical machine. This is done to reduce contention (and load) on the production SSAS service machine.
■■
Reporting Services Where will you install this service? As with SSIS, in development, we usually see an SSRS instance installed on the same machine as SSAS and SSIS. Some customers elect to separate SSAS or SSIS from SSRS in the development phase of their BI project because, as mentioned earlier in this section, this type of configuration can more closely mirror their typical production environment. Production installations vary greatly depending on the number of end users, cube size, and other factors. In production, at a minimum, we see SSRS installed on at least one dedicated server.
90
Part I
Business Intelligence for Business Decision Makers and Architects
Tip Like SQL Server itself, SSAS installations support multiple instances on the same server. This is usually not done in development environments, with one important exception. If you’re using an earlier version of SSAS, it’s common to install multiple instances in development so that upgradeability can be tested.
More often, we see a typical starting BI physical server installation consisting of at least two new physical servers. SSAS and SQL Server are installed on the first server. (SSIS is run on this server.) SSRS is installed on the second server. We consider this type of physical consideration to be a starting setup for most customers. Many of our customers elect to purchase additional servers for the reasons listed in the component list shown earlier—that is, to scale out SSRS, and so on. We also choose to add physical servers in situations where hard security boundaries, differing server configuration settings, or both are project requirements. Figure 4-2 illustrates this installation concept. It also shows specifically which logical objects are part of each SSAS instance installation. We’ll talk more about these logical objects in Chapter 5, “Logical OLAP Design Concepts for Architects.” AMO Application
Server Analysis Services Instance
Analysis Services Instance Server Object Database Object
Analysis Services Instance
Data Source
Data
Helper Objects OLAP Objects Data Mining Objects
Database Object
...
Database Object
Data Source Views
Data Source
Data
FIgure 4-2 Multiple SSAS instances on a single physical server (illustration from SQL Server Books Online)
Note An important consideration when you’re working with existing (that is, previous versions, such as SQL Server 2000 or earlier) SQL Server BI is how you’ll perform your upgrade of production servers for SSAS, SSIS, and SSRS. For more information about upgrading, see the main Microsoft SQL Server Web site at http://www.microsoft.com/sql, and look in the “Business Intelligence” section.
Chapter 4
Physical Architecture in Business Intelligence Solutions
91
For large BI projects, all components—that is, SSAS, SSIS, and SSRS—can each be scaled to multiple machines. It’s more common to scale SSAS and SSRS than ETL to multiple machines. We’ll cover this in more detail in Chapter 9, “Processing Cubes and Dimensions,“ for SSAS and in Chapter 22, “Advanced SQL Server 2008 Reporting Services,” for SSRS. For SSIS scaling, rather than installing SSIS on multiple physical machines, you can use specific SSIS package design techniques to scale out execution. For more information, see Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services.” Tip Because of the large number of considerations you have when planning a BI project, some hardware vendors (for example, Dell) have begun to provide guidance regarding proper server sizing for your BI projects. For more information, go to the Dell Web site at http://www.dell.com.
Considerations for Physical Servers In most environments, you’ll want to use, at a minimum, one uniquely dedicated physical server for your BI project. As mentioned, most of our clients start by using this server as a development server. Your goal in creating your particular development environment depends on whether or not you’ll also be implementing a test environment. If you are not, you’ll want to mirror the intended production environment (as much of it as you know at this point in your project) so that the development server can function as both the development server and test server. The number and configuration of physical servers for your production environment will vary greatly depending on factors like these: the amount of data to be processed and stored, the size and frequency of updates, and the complexity of queries. A more in-depth discussion of scalability can be found in Chapter 9. Note If you read in SQL Server Books Online that “BIDS is designed to run on 32-bit servers only,” you might wonder whether you can run BIDS on an x64 system. The answer is yes, and the reason is that it will run as a WOW (or Windows-on-Windows 32-bit) application on x64 hardware.
Other considerations include the possibility of using servers that have multiple 64-bit processors and highly optimized storage—that is, storage area networks (SANs). Also, you can choose to use virtualization technologies, such as Virtual Server or Hyper-V, to replicate the eventual intended production domain on one or more physical servers. Figure 4-3 shows a conceptual view of virtualization. For more information about Windows Server 2008 virtualization options, see http://www.microsoft.com/windowsserver2008/en/us/ virtualization-ent.aspx.
92
Part I
Business Intelligence for Business Decision Makers and Architects Physical Servers
Virtualization Virtual Host
Virtual Guests
FIgure 4-3 Using virtualization technologies can simplify BI development environments.
Server Consolidation When you are setting up the physical development environment for your BI project, you don’t have to be too concerned with server consolidation and virtualization. However, this will be quite different as you start to build your production environment. Effective virtualization can reduce total cost of ownership (TCO) significantly. There are a few core considerations for you to think about at this point in your project so that you can make the best decisions about consolidation when the time comes: ■■
Know baseline utilization (particularly of CPUs).
■■
Consider that tempdb will be managed in a shared environment, which involves important limits such as the use of a single collation and sort order.
■■
Remember that MSDB server settings, such as logon collisions between servers, must be resolved prior to implementing consolidation.
Considerations for Logical Servers and Services BI projects contain many parts. At a minimum, you’ll probably work with SSAS, SSIS, and SSRS. Given that, you need to consider on which physical, virtual, or logical server you’ll need to install each instance of the aforementioned services. You’ll also need to install all prerequi-
Chapter 4
Physical Architecture in Business Intelligence Solutions
93
sites on the operating system as well as development and management tools, such as SSMS and BIDS. As shown in Figure 2-1 in Chapter 2, “Visualizing Business Intelligence Results,” you can have quite a large number of items to install for a BI solution. In addition to the core services— SSAS, SSIS, and SSRS—you’ll install tools such as BIDS, SSMS, and SQL Profiler, as well as other services, such as Microsoft Office SharePoint Server 2007, Dynamics, or PerformancePoint Server. Each of these services contains many configuration options. It is important to understand that the BI stack, like all of SQL Server 2008, installs with a bare-minimum configuration. This is done to facilitate the use of security best practices. In particular, it’s done to support the idea of presenting a reduced attack surface. If your solution requires more than minimal installation—and it probably will—it’s important that you document the specific installation options when you set up the development environment. You can access these setup options in a couple of ways, the most common of which is to connect to the service via SSMS and then to right-click on the instance name to see its properties. Figure 4-4 shows just some of the service configuration options available for SSAS. Unless you have a specific business reason to do so, you most often will not make any changes to these default settings. You’ll note that the Show Advanced (All) Properties check box is not selected in Figure 4-4. Selecting this check box exposes many more configurable properties as well.
FIgure 4-4 SSAS server configuration settings in SSMS
94
Part I
Business Intelligence for Business Decision Makers and Architects
One example of this configuration is to document the service logon accounts for each of the involved services—that is, for SSAS, SSIS, and SSRS. Of course, all these services connect to one or more data sources, and those data sources each require specific logon credentials. We’ve found over and over in BI projects that failure to adequately document information such as this to be a big time waster over the lifetime of the project. We’ve spent many (expensive, billable!) hours chasing down connection account credentials. You’ll definitely want to use the SQL Server Configuration Manager to verify all service settings. It’s particularly useful as a quick way to verify all service logon accounts. Figure 4-5 shows this tool.
FIgure 4-5 SQL Server Configuration Manager
Other considerations include what your security baselines should be when installing new services. Determining this might require coordination with your security team to plan for changes to group policies to restrict usage of various tools available with BI services. An example of this is the use of SQL Profiler, which allows traces to be created for SSAS activity. Also, you need to decide how to implement security based on the type of client environment your development environment should contain. As mentioned, most BI projects will include some sort of SSRS implementation. However, some projects will include other products, such as PerformancePoint Server and Microsoft Office SharePoint Server 2007. Office SharePoint Server 2007 requires both server and client access licenses, unless we specifically say otherwise throughout this book. Office SharePoint Server 2007 includes many rich features commonly used in BI projects, such as the Report Center site template. Your project might also include Windows SharePoint Services, a free download for Windows Server 2003 or later. Of course, the more complex the development (and eventual production) environments, the more important detailed documentation of service configuration becomes. Other possible client installation pieces are Office 2007 and the Data Mining Add-ins for SQL Server 2008. There are also many third-party (commercial) custom client interface tools available for purchase.
Chapter 4
Physical Architecture in Business Intelligence Solutions
95
understanding Security requirements As with scalability, we’ll get to a more comprehensive discussion of security later in this book. We include an introduction to this topic here because we’ve noted that some customers elect to use a copy (or a subset) of production data in development environments. If this is your situation, it’s particularly critical that you establish appropriate security for this data from the start. Even if you use masked or fake data, implementing a development environment with least-privilege security facilitates an appropriate transition to a secure production environment. Note You can use a free tool to help you plan security requirements. It’s called the Microsoft Security Assessment Tool, and it’s located at https://www.microsoft.com/technet/security/tools/ msat/default.mspx.
Several new features and tools are available with SQL Server 2008 (OLTP) security, such as the new Declarative Management Framework (DMF) policies and the Microsoft Baseline Security Analyzer (which can be downloaded from https://www.microsoft.com/technet/security/tools/ mbsahome.mspx). Most of these tools are not built to work with SSAS, SSIS, and SSRS security. For this reason, you should be extra diligent in planning, implementing, and monitoring a secure BI environment. We stress this because we’ve seen, more often than not, incomplete or wholly missing security in this environment. As problems such as identity theft grow, this type of security laxness will not be acceptable.
Security Requirements for BI Solutions When you’re considering security requirements for your BI solution, following the basic best practice of (appropriate) security across all tiers of the solution is particularly important. BI projects are, after all, about enterprise data. They also involve multiple tiers, consisting of servers, services, and more. Figure 4-6 gives you a high-level architectural view of the BI solution landscape.
96
Part I
Business Intelligence for Business Decision Makers and Architects
Win32 Applications for OLAP and/or Data Mining
COM-Based Client Applications for OLAP and/or Data Mining
.NET Client Applications for OLAP and/or Data Mining
ADO MD
OLE DB for OLAP
ADO MD.NET
Any Application for OLAP and Data Mining
XMLA over HTTP (WAN) IIS
– OR –
Data Pump
XMLA over TCP/IP
XMLA over TCP/IP Instance of SQL Server 2008 Analysis Services
Data Sources Integration Services Packages Relational Databases FIgure 4-6 In planning BI solution security, it’s important to consider security at each tier.
Source Data: Access Using Least-Privileged Accounts As discussed, you should have collected connection information for all source data as part of your baseline survey activities. Connect to source data using the security best practice of least privilege—that is, use nonadministrator accounts to connect to all types of source data. This includes both file-based and relational source data. We’ve found that in some cases, we’ve actually had to have reduced-privilege accounts created in the source systems specifically for BI component connectivity.
Chapter 4
Physical Architecture in Business Intelligence Solutions
97
Note If you’re using SQL Server 2008 as one of your source data repositories, you can take advantage of many new security enhancements available in the OLTP source layer. A few examples include the following: declarative policy management, auditing, and transparent data encryption. For more information, see http://download.microsoft.com/download/d/1/b/d1b8d9aa19e3-4dab-af9c-05c9ccedd85c/SQL%20Server%202008%20Security%20-%20Datasheet.pdf.
Data in Transit: Security on the Wire You might be tempted to skip the key security consideration step of protecting data on the wire, or in transit, between systems. This would be an unfortunate oversight. Security breaches happen, on average, at a rate of three-to-one inside the firewall versus outside of it. It’s naïve to believe that such issues will not arise in your environment. Sensitive or personal data—such as credit card information, Social Security numbers, and so on—must be protected in transit as well as while stored. We most commonly see either network-level (such as IPSec) or application-level (HTTPS) security performing this function.
Processing Layer: ETL The next phase in your security planning process is to consider security in what we call the processing layer—that is, during SSIS package implementation and execution. Keep in mind that SSIS packages are executable files, which are designed, of course, with a specific purpose. That purpose (when building out a BI solution) is to extract, transform, and load data in preparation for building SSAS cubes and mining models. We do see customers using SSIS for non-BI purposes, such as data migration, but we won’t cover those other types of cases in this discussion. Security in SSIS is layered, starting with the packages themselves. Packages include several security options. These are related to the ability to read, write, and execute the packages. Using SSMS, you can view (or change) the roles associated with read/write ability for each package. Figure 4-7 shows the dialog box that enables you to do this.
FIgure 4-7 SSIS packages have associated read and write permissions.
98
Part I
Business Intelligence for Business Decision Makers and Architects
When you create SSIS packages in BIDS, each package has some default security settings assigned to sensitive package properties, such as connection strings. You can change these default encryption settings as your security requirements necessitate. For example, these settings include the ProtectionLevel property value, and (optionally) assigning a password to the package. The default setting for the ProtectionLevel option is EncryptSensitiveWithUserKey. All possible settings for this option are shown in Figure 4-8.
FIgure 4-8 ProtectionLevel SSIS package option settings
It’s important that you set and follow a regular security standard for SSIS packages at the time development begins for your project. We’ve really just scratched the surface of SSIS security. An example of another security option available for SSIS packages is verification of integrity via association with digital signatures. Yet another security consideration is how you will choose to store connection string information used by your packages. Some options for doing this include storing connection string information within the package or storing it externally (such as in a text file, XML file, and so on) We’ll cover this topic and other advanced SSIS security topics more thoroughly in Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services,” which covers SSIS in depth. One final thought before we end this introduction to SSIS security is related to where you choose to store your SSIS packages. Packages can be stored in SQL Server itself or in a folder on the file system. We’ll cover the exact locations in more detail in just a bit. For security purposes—particularly if you choose to store your packages in folders—you should remember to appropriately apply an access control list (ACL) to those folders.
SSAS Data After the data is associated with SSAS, you have a new set of security considerations in this key service for BI. The first of these is the permissions associated with the logon account for the SSAS service itself. As with any service, the SSAS logon account should be configured using the security principle of least privilege. The next consideration for security is at the level of a BIDS solution. As in Visual Studio, toplevel containers are called solutions and represented as *.sln files. Solutions contain one or more projects. Inside of each BIDS solution, you’ll set up one or more data source objects. Each of these objects represents a connection string to some type of data that will be used as a source for your SSAS objects. Each data source requires specific connection credentials. As with any type of connection string, you supply these credentials via the dialog box provided. In addition to supplying these credentials, you must also supply the impersonation
Chapter 4
Physical Architecture in Business Intelligence Solutions
99
information for each data source object. Using the SSAS service account is the default option. You can also choose to use a specific Microsoft Windows account, to use the credentials of the current user, or to inherit credentials. This step is shown in Figure 4-9.
FIgure 4-9 Impersonation Information options when connecting to an SSAS 2008 cube
So which one of these options should you use? At this point, we simply want to present you with the options available. As we drill deeper into working with BIDS, we’ll arm you with enough information to answer this question in the way that best supports your particular project requirements. Here’s a bit more information about the impersonation option as well. It might help you to understand the answer to the following question: “When does impersonation get used rather than the credentials supplied in the connection string?” SQL Server Books Online includes this guidance: The specified credentials will be used for (OLAP cubes) processing, ROLAP queries, out-of-line bindings, local cubes, mining models, remote partitions, linked objects, and synchronization from target to source. For DMX OPENQUERY statements, however, this option is ignored and the credentials of the current user will be used rather than the specified user account. You’ll have to plan for and implement many other security considerations for SSAS as your project progresses. As with the SSIS processes discussed earlier, our aim is to get you started with baseline security. We’ll go in to detail about other considerations, including SSAS roles and object-specific permissions, in Part II, “Microsoft SQL Server 2008 Integration Services for Developers,” which focuses on SSAS implementation.
100
Part I
Business Intelligence for Business Decision Makers and Architects
On the User Client: Excel or SSRS The last consideration in your core security architecture for your BI development environment is how thoroughly you want to build your client environment. It has been our experience that the complete client environment is often fully known at this phase of BI projects. Usually, the environment includes, at a minimum, SSRS and Excel. Other clients that we’ve seen included at this point are Office SharePoint Server 2007 and PerformancePoint Server. In this section, we’ll focus first on security in Excel 2007 for SSAS objects—focusing on OLAP cubes and then data mining structures. After we briefly discuss best practices on that topic, we’ll take a look at core security for SSRS. A more complete discussion of SSRS security is included in Part IV, “Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence,” which is devoted entirely to implementation details of SSRS. Excel as an OLAP Cube Client As we saw in previous chapters, connecting Excel 2007 as a client of SSAS 2008 OLAP cubes via Excel’s PivotTable view is a straightforward operation. This is, of course, by design. What is the best practice for baseline security in this situation? Note that in Figure 4-10 you can see the default configuration, which is to use the currently logged-on Windows user account to authenticate to the OLAP cube data source. Note also that the connection string is stored as a file on the client. As a baseline measure, you should use the principle of least privilege, which means that, at the very least, a non–administrator account should be used to connect. Also, if you’re passing a specific user name and password into the connection string, be aware that the credential information is stored in plain text in the local connection string file.
FIgure 4-10 Connection properties in Excel 2007 when connecting to an SSAS 2008 cube
Chapter 4
Physical Architecture in Business Intelligence Solutions
101
You might also want to enable auditing using one of the available auditing tools, such as SQL Profiler, if you’re granting access via Excel to OLAP cubes that are built using any production data. Excel as a Data Mining Client If you plan to use Excel as a client for SSAS data mining structures, you must first install the SQL Server 2008 Data Mining Add-ins for Office 2007. After you install the add-ins, you must run the included Server Configuration Utility to set up the baseline configuration between Excel 2007 and SSAS 2008 data mining. This consists of the following steps: 1. Specify the SSAS server name. (The connecting user must have permission on SSAS to create session mining models.) 2. Specify whether you want to allow the creation of temporary mining models. 3. Specify whether you’d like to create a new database to hold information about users of your mining models. 4. Specify whether to give users other than administrators permission to create permanent mining models in SSAS by using the Excel interface. You can rerun this wizard to change these initial settings if you need to. After you complete the initial configuration, use the Connection group on the Data Mining tab of the Excel 2007 Ribbon to configure session-specific connection information. Clicking the connection button starts a series of dialog boxes identical to the ones you use to configure a connection to an SSAS cube for an Excel PivotTable view. Figure 4-11 shows the Data Mining tab, with the connection button showing that the session has an active connection to an SSAS server (the AdventureWorksDW2008 database).
FIgure 4-11 Connection button in Excel 2007 showing an active connection
The new trace utility called Tracer, which is included with the Data Mining Add-ins, allows you to see the connection string information. It also allows you to see the generated DMX code when you work with SSAS data mining structures using Excel. The connection information in Tracer is shown in Figure 4-12.
102
Part I
Business Intelligence for Business Decision Makers and Architects
FIgure 4-12 Connection information shown in the Tracer utility in Excel 2007
SSRS as an OLAP Client Installing SSRS requires you to make several security decisions. These include components to install, install locations for services, and service account names. As with the other BI components, we’ll cover SSRS installation in greater detail in Part II. Another installation complexity involves whether or not you plan to integrate an Office SharePoint Server 2007 installation with SSRS. Finally, you must choose the environment that will eventually run the rendered reports, whether it be a browser, Windows Forms, and so on. We’ll also present different options for security settings. By default, SSRS is configured to use Windows authentication. You can change this to a custom authentication module—such as forms or single sign-on (SSO)—if your security requirements necessitate such a change. Choosing an authentication mode other than Windows authentication requires you to go through additional configuration steps during installation and setup. A good way to understand the decisions you’ll need to make during SSRS setup is to review the redesigned Reporting Services Configuration Manager. Here you can set or change the service account associated with SSRS, the Web service URL, the metadata databases, the report manager URL, the e-mail settings, an (optional) execution account (used when connecting to data sources that don’t require credentials or for file servers that host images used by reports), and the encryption keys. You also have the option to scale-out deployment locations. Another important consideration is whether you’ll use Secure Sockets Layer (SSL) to encrypt all traffic. For Internet-facing SSRS servers, this option is used frequently. You associate the SSL certificate with the SSRS site using the Reporting Services Configuration Manager tool as well. All SSRS information generated by this tool is stored in a number of *.config files. We’ll look at the underlying files in detail in Part IV, which is devoted to SSRS. Figure 4-13 shows this tool.
Chapter 4
Physical Architecture in Business Intelligence Solutions
103
FIgure 4-13 Reporting Services Configuration Manager
After you’ve installed SSRS, as you did with SSAS, you must then determine the connection context (or authorization strategy) for reports. Figure 4-14 shows a high-level description of the authentication credential flow. In SSRS, roles are used to associate report-specific permissions with authorized users. You can choose to use built-in roles, or you can create custom roles for this purpose. Report Server Connection 1 User Account
Connection 2 Service Accounts
Report Server Database
Connection 3 User or Other Accounts
External Data
FIgure 4-14 Authentication flow for SSRS
SSRS permissions are object-specific. That is, different types of permissions are available for different objects associated with an SSRS server. For example, permissions associated with data sources, reports, report definitions, and so on differ by object. As a general best practice, we recommend creating several parking folders on the SSRS service. These folders should have restricted permissions. Locate all newly deployed objects (connections,
104
Part I
Business Intelligence for Business Decision Makers and Architects
reports, and report models) there. Figure 4-15 shows a conceptual rendering of the SSRS authorization scheme. Groups and Users
Group Administrators
Lisa Miller
Roles, Tasks, and Permissions Publisher
Manage Folders
Reports
Permission
Permission
Permission
Permission
Permission
Resource
Report
Folder
Resource
Report
Items in the Report Server Database
Folder A Folder B Report A Folder C FIgure 4-15 Conceptual rendering of SSRS security (illustration from SQL Server Books Online)
Security Considerations for Custom Clients In addition to implementing client systems that “just work” with SSAS objects as data sources—that is, Excel, SSRS, PerformancePoint Server, and so on—you might also elect to do custom client development. By “just work” we mean client systems that can connect to and can display data from SSAS OLAP cubes and data mining models after you configure the connection string—that is, no custom development is required for their use.
Chapter 4
Physical Architecture in Business Intelligence Solutions
105
The most common type of custom client we’ve seen implemented is one that is Web based. Figure 4-16 illustrates the architecture available for custom programming of thin-client BI interfaces. As with other types of Web sites, if you’re taking this approach, you need to consider the type of authentication and authorization system your application will use and, subsequently, how you will flow the credentials across the various tiers of the system.
Browser
Browser
Other Thin Clients
Browser
Web
Internet Information Services (IIS) ASP
Win32 Applications for OLAP and/or Data Mining
ASP.NET
COM-Based Client Applications for OLAP and/or Data Mining
ASP, ASP.NET, etc.
.NET Client Applications for OLAP and/or Data Mining
ADO MD
OLE DB for OLAP
ADO MD.NET
XMLA over TCP/IP Instance of SQL Server 2008 Analysis Services FIgure 4-16 Thin clients for BI require additional security.
Any Application for OLAP and Data Mining
106
Part I
Business Intelligence for Business Decision Makers and Architects
It’s important that you include security planning for authentication and authorization, even at the early stages of your project. This is because you’ll often have to involve IT team members outside of the core BI team if you’re implementing custom client solutions. We’ve seen situations where solutions were literally not deployable into existing production environments because this early planning and collaboration failed to occur. As you would expect, if you’re using both OLAP cubes and data mining structures, you have two disparate APIs to contend with. We have generally found that custom client development occurs later in the BI project life cycle. For this reason, we’ll cover more implementation details regarding this topic in Part IV, which presents the details about client interfaces for SSAS.
Backup and restore You might be surprised to see this topic addressed so early in the book. We’ve chosen to place it here to help you avoid the unpleasant surprises that we saw occur with a couple of our customers. These scenarios revolve around backups that either aren’t running regularly, backups that aren’t run at all, those that are incomplete, and, worst of all, those that are unrestorable. The time to begin regular backup routines for your BI project is immediately after your development environment is set up. As with the security solution setup for BI, backup scenarios require that you consider backup across all the tiers you’ve chosen to install in your development environment. At a minimum, that will usually consist of SSAS, SSIS, and SSRS, so those are the components we’ll cover here. For the purpose of this discussion, we’ll assume you already have a tested backup process for any and all source data that is intended to be fed into your BI solution. It is, of course, important to properly secure both your backup and restore processes and data as well.
Backing Up SSAS The simplest way to perform a backup of a deployed SSAS solution is to use SSMS. Right-click on the database name in the Object Explorer tree and then click Backup. This generates the XMLA script to back up the metadata (in XMLA) from the SSAS structures. This metadata will include all types of SSAS objects in your solution—that is, data sources, data source views, cubes, dimensions, data mining structures, and so on. Figure 4-17 shows the user interface in SSAS for configuring backup. Note that in this dialog box you have the option to generate the XMLA script to perform this backup. XMLA scripting is an efficiency that many BI administrators use to perform routine maintenance tasks such as regular backups.
Chapter 4
Physical Architecture in Business Intelligence Solutions
107
FIgure 4-17 The SSMS interface for SSAS backups
When you perform backups (and restores), you must have appropriate permissions. These include membership in a server role for SSAS, full control on each database to be backed up, and write permission on the backup location. Note You can also use the Synchronize command to back up and restore. This command requires that you are working with instances located in two different locations. One scenario is when you want to move development data to production, for example. A difference between the backup and synchronize commands is that the latter provides additional functionality. For instance, you can have different security settings on the source (backup) and destination (synchronization target). This type of functionality is not possible with regular SSAS backup. Also, synchronization does an incremental update of objects that differ between the source server and the destination server.
Backing Up SSIS The backup strategy you use for development packages depends on where you choose to store those packages. SSIS packages can be stored in three different locations: on SQL Server, on the file system, or in the SSIS package store. If you store the packages on SQL Server, backing up the msdb database will include a backup of your SSIS packages. If you store the packages in a particular folder on the file system, backing up that folder will back up your packages. If you store the packages in the SSIS package store, backing up the folders associated with that store will back up the packages
108
Part I
Business Intelligence for Business Decision Makers and Architects
contained within it. Although you should back up data sources, data source views, and SSIS packages, we usually don’t create global data sources and data source views when working with SSIS, so we only have to consider backing up the SSIS packages themselves.
Backing Up SSRS There are several parts to consider when setting your baseline SSRS backup strategy: ■■
Back up the RS databases Back up the reportserver and reportservertempdb databases that run on a SQL Server Database Engine instance. You cannot back up the reportserver and reportservertempdb databases at the same time, so schedule your backup jobs on SQL Server to run at different times for these two databases.
■■
Back up the RS encryption keys Manager to perform this task.
■■
Back up the RS configuration files As mentioned, SSRS configuration settings are stored in XML configuration files. These should be backed up. They are Rsreportserver.config, Rssvrpolicy.config, Rsmgrpolicy.config, Reportingservicesservice.exe.config, Web.config for both the Report Server and Report Manager ASP.NET applications, and Machine.config for ASP.NET. For default locations, see the SQL Server Books Online topic “Configuration Files (Reporting Services)” at http://msdn.microsoft.com/en-us/ library/ms155866.aspx.
■■
Back up the data files You must also back up files created by Report Designer and Model Designer. These include report definition (.rdl) files, report model (.smdl) files, shared data source (.rds) files, data source view (.dsv) files, data source (.ds) files, report server project (.rptproj) files, and report solution (.sln) files.
■■
Back up administrative script files and others You should back up any script files (.rss) that you created for administration or deployment tasks. You should also back up any custom extensions and custom assemblies you are using with a particular SSRS instance.
You can use the Reporting Services Configuration
Auditing and Compliance When you begin your BI development efforts, you can choose to work with a copy of production data. If you do that, you might have to maintain certain pre-established standards for data access. Usually, auditing records are used to prove whether these standard have been met. Some examples are HIPPA compliance (for health care records) and SOX (or Sarbannes-Oxley, for businesses of a certain size and type). Of course not all industries have such rigorous requirements. However, as problems such as identity theft grow in scale, protecting data in all situations becomes increasingly important. We’ve encountered several unfortunate situations where key company data was compromised—that is, read, copied,
Chapter 4
Physical Architecture in Business Intelligence Solutions
109
altered, or stolen—during our many years as professional database consultants. In more than one situation, failure to properly protect company data has led to disciplinary action or even termination for staff who failed to be diligent about data security (which should include access auditing). Because BI projects usually include very large sets of data, we see proper attention being paid to auditing and compliance to be a key planning consideration from the earliest stages of every project. A great tool to use for access auditing is SQL Server Profiler. This tool has two main uses: compliance (via access auditing) and performance monitoring. In this section, we’re talking only about the first use. Later in your project cycle, you’ll use SQL Server Profiler for performance tuning. We find that SQL Server Profiler is underused or even not used at all in BI projects, and this is unfortunate when you consider the business value the tool can bring to your project. Figure 4-18 shows SQL Server Profiler in action, with a live trace capturing an MDX query.
FIgure 4-18 SQL Server Profiler traces capture SSAS activity such as MDX queries.
To use SQL Server Profiler to capture activity on SSAS cubes and mining models, you need to perform these steps: 1. Connect to SSAS using an account that is a member of the Analysis Services Administrators server role. You do not have to be a local machine administrator to run SQL Server Profiler. 2. Select File, New Trace to create a new trace. Select the default trace definition in the New Trace dialog box. You might want to add more counters in the trace definition, such as those which capture events of interest. For security compliance, this at least includes logon activity. You can also choose to monitor the success or failure of permissions in accessing particular statements or objects.
110
Part I
Business Intelligence for Business Decision Makers and Architects
3. After selecting the objects to include in your trace definition in the New Trace dialog box, you can choose to see data structures of interest by clicking the Filter button in the New Trace Definition dialog box and then typing the name of those structures in the text box below the LIKE filter. Keep in mind that you can capture activity on source SQL Server OLTP systems, as well as on SSAS cubes and mining models using SQL Server Profiler trace definitions. 4. You can continue to set filters on the trace as needed. We like to filter out extraneous information by entering values in the text box associated with the NOT LIKE trace definition filter. Our goal is always to capture only the absolute minimum amount of information. SQL Server Profiler traces are inclusive by nature and can quickly become huge if not set up in a restrictive way. Huge traces are unproductive for two reasons. You probably don’t have the time to read through huge traces looking for problems. Also, capturing large amounts of information places a heavy load on the servers where tracing is occurring. We’ve seen, even in development environments, that this load can slow down performance significantly. Figure 4-19 shows the events you can select to monitor related to security for SSAS.
FIgure 4-19 SQL Server Profiler traces can capture security audit events.
Note A side benefit of using SQL Server Profiler during the early phases of your BI project (in addition to helping you maintain compliance with security standards) is that your team will become adept at using the tool itself. We’ve seen many customers underuse SQL Server Profiler (or not use it at all) during the later performance-tuning phases of a project. The more you and your developers work with SQL Server Profiler, the more business value you’ll derive from it. Another aspect of SQL Server Profiler traces more closely related to performance monitoring is that trace activity can be played back on another server. This helps with debugging.
Chapter 4
Physical Architecture in Business Intelligence Solutions
111
Auditing Features in SQL Server 2008 SQL Server 2008 has several new features related to auditing and compliance. If you’re using SQL Server 2008 as a one of the source databases in your BI solution, you can choose to implement additional security auditing on this source data of your solution in addition to any auditing that might already be implemented on the data as it is being consumed by SSIS, SSAS, or SSRS. This additional layer of auditing might be quite desirable, depending on the security requirements you’re working with. One example of the auditing capabilities included in SQL Server 2008 OLTP data is the new server-level audit policies. These can be configured by using SSMS or by using newly added Transact-SQL commands for SQL Server 2008. Figure 4-20 shows a key dialog box for doing this in SSMS.
FIgure 4-20 The new server audit dialog box in SQL Server 2008
Source Control A final consideration for your development environment is how you will handle source control for all the code your team generates. Keep in mind that the types of code you must maintain vary depending on which component of the SQL Server 2008 BI stack you’re working with. As with any professional development project, if your developer team consists of multiple people, you should select and use a production-quality source control system. We
112
Part I
Business Intelligence for Business Decision Makers and Architects
use Visual Studio Team System Team Foundation Server most often for this purpose. There are, of course, many source code solutions available on the market. The obvious approach is to pick one and to get everyone on the team to use it. The type of code your team will be checking in to the selected source system will vary depending on which component or components of the BI solutions they are working on. For SSAS, you’ll be concerned with XMLA for OLAP cube and data mining model metadata, as well as MDX for OLAP cubes (queries or expressions) or DMX for data mining structures (queries or expressions). For SSIS, you’ll be managing SSIS packages. These packages also consist of an XML dialect. Packages can also contain code, such as VBScript, C#, and so on. You might also be concerned with backing up dependent external files that contain package configuration information. These files are also in an XML format. Another source control consideration for SSIS is storage location. Your choices (built into the BIDS interface) are to store packages in SQL Server (which stores them in the syssispackages tables in the msdb database), on the file system (in any folder you specify), or in the SSIS package store (which is a set of folders named File System and msdb). If you choose to store packages in msdb, you can use SQL Server Backup to back up that database, which will include the files associated with SSIS packages stored there. These options are shown in Figure 4-21.
FIgure 4-21 You have three options for where to store SSIS packages.
Finally, when you are planning your SSIS backup, you should choose to back up folders that are referenced in the SSIS configuration file, MsDtsSrvr.ini.xml. This file lists the folders on the server that SSIS monitors. You should make sure these folders are backed up. For SSRS, you’ll have RDL code for report definitions and report models. These are .rdl and .smdl files. Also, SSRS stores metadata in SQL databases called reportserver and
Chapter 4
Physical Architecture in Business Intelligence Solutions
113
reportservertempdb. You’ll also want to back up the configuration files and the encryption key. See the topic “Back Up and Restore Operations for a Reporting Services Installation” in SQL Server Books Online. When you’re conducting SSAS development in BIDS, your team has two possible modes of working with SSAS objects—either online (which means live or connected to deployed cubes and mining models) or project mode (which means disconnected or offline). In online mode all changes made in BIDS are implemented in the solution when you choose to process those changes. If you have created invalid XMLA using the BIDS designer, you could break your live OLAP cubes or data mining structures by processing invalid XMLA. BIDS will generate a detailed error listing showing information about any breaking errors. These will also be indicated in the designer by red squiggly lines. Blue squiggly lines indicate AMO design warnings. These errors will not break an SSAS object (although they do indicate nonstandard design, which could result in suboptimal performance). In project mode, XMLA is generated by your actions in the BIDS designer. You must deploy that XMLA to a live server to apply the changes. If you deploy breaking changes (that is, code with errors), the deploy step might fail. In that case, the process dialog box in BIDS will display information about the failed deployment step (or steps). It is important for you to understand that when you deploy changes that you’ve created offline, the last change overwrites all intermediate changes. So if you’re working with a team of SSAS developers, you should definitely use some sort of formal third-party source control system to manage offline work (to avoid one developer inadvertently overwriting another’s work upon check-in). Figure 4-22 shows the online mode indicator in the BIDS environment title bar. Figure 4-23 shows the project (offline) mode indicator. Note that online mode includes both the SSAS project and server name (to which you are connected), and that project mode shows only the solution name.
FIgure 4-22 BIDS online mode includes the SSAS server name.
FIgure 4-23 BIDS project mode shows only the SSAS solution name.
114
Part I
Business Intelligence for Business Decision Makers and Architects
Summary In this chapter, we got physical. We discussed best practices for implementing an effective development environment for BI projects. Our discussion included best practices for surveying, preparing, planning, and implementing both physical servers and logical services. We included a review of security requirements and security implementation practices. We also discussed preparatory steps, such as conducting detailed surveys of your existing environment, for setting up your new production environment. In the next chapter, we turn our attention to the modeling processes associated with BI projects. We look in detail at architectural considerations related to logical modeling. These include implementing OLAP concepts such as cubes, measures, and dimensions by using SQL Server Analysis Services as well as other tools. We’ll also talk about modeling for data mining using Analysis Services. Appropriate logical modeling is one of the cornerstones to implementing a successful BI project, so we’ve devoted an entire chapter to it.
Chapter 5
Logical OLAP Design Concepts for Architects Correctly modeled and implemented OLAP cubes are the core of any Microsoft SQL Server 2008 business intelligence (BI) project. The most successful projects have a solid foundation of appropriate logical modeling prior to the start of developing cubes using SQL Server Analysis Services (SSAS). Effective OLAP modeling can be quite tricky to master, particularly if you’ve spent years working with relational databases. The reason for this is that proper OLAP modeling is often the exact opposite of what you had been using in OLTP modeling. We’ve seen over and over that unlearning old habits can be quite challenging. One example of how OLAP modeling is the opposite of OLTP modeling is the deliberate denormalization (duplication of data) in OLAP models, a practice that stands in contrast to the typical normalization in OLTP models. We’ll explore more examples throughout this chapter. In this chapter, we’ll explore classic OLAP cube modeling in depth. This type of modeling is based on a formalized source structure called a star schema. As previously mentioned, this type of modeling is also called dimensional modeling in OLAP literature. We’ll take a close look at all the parts of this type of database schema, and then we’ll talk about different approaches to get there, including model first or data first. We’ll also discuss the real-world cost to your project of deviating from this standard design. Because data mining is another key aspect of BI solutions, we’ll devote two chapters (Chapter 12, “Understanding Data Mining Structures,” and Chapter 13, “Implementing Data Mining Structures”) to it. There we’ll include information about best practices regarding the use of logical modeling as a basis for building data mining structures. Modeling for data mining is quite different than the modeling techniques you use in OLAP cube modeling. We believe that most BI projects include OLAP and data mining implementations and the appropriate time to model for both is at the beginning of the BI project. We’ll start, however, with OLAP modeling because that is where most readers of this book will start the modeling process for their BI project.
Designing Basic OLAP Cubes Before we begin our discussion of OLAP modeling specifically, let’s take a minute to discuss an even more fundamental idea for your BI project. Is a formalized method (that is, star schema) strictly required as a source schema for an OLAP cube? The technical answer to this question is “No.” Microsoft purposely does not require you to base OLAP cubes only on source data—that is, in a star schema format. In other words, you can create cubes based on 115
116
Part I
Business Intelligence for Business Decision Makers and Architects
OLTP (or relational and normalized data) or almost any data source in any format that you can connect to via the included SSAS data source providers. SSAS 2008 is designed with a tremendous amount of flexibility with regard to source data. This flexibility is both a blessing and a curse. If logical OLAP modeling is done well, implementing cubes using SSAS can be very straightforward. If such modeling is done poorly, cube implementation can be challenging at best and counterproductive at worst. We’ve seen many BI projects go astray at this point in development. It’s physically possible to use SSAS to build an OLAP cube from nearly any type of relational data source. But this is not necessarily a good thing! Particularly if you are new to BI (using OLAP cubes), you’ll probably want to build your first few projects by taking advantage of the included wizards and tools in Business Intelligence Development Studio (BIDS). These timesavers are designed to work using traditional star schema source data. If you intend to go this route, and we expect that most of our readers will, it’s important to understand dimensional modeling and to attempt to provide SSAS with data that is as close to this format as possible. You might be wondering why Microsoft designed BIDS wizards and tools this way but doesn’t require star schemas as source data. The answer is that Microsoft wants to provide OLAP developers with a large amount of flexibility. By flexibility, we mean the ability for developers to create cubes manually. What this means in practice is that for the first few attempts, you’ll fare much better if you “stick to the star.” After you become more experienced at cube building, you’ll probably find yourself enjoying the included flexibility to go “beyond the star” and build some parts of your BI solution manually. Another way to understand this important concept is to think of a star schema as a starting point for all cube projects, after which you have lots of flexibility to go outside of the normal rigid star requirements to suit your particular business needs. What this does not mean, however, is that you can or should completely ignore dimensional modeling—and, yes, we’ve seen quite a few clients make this mistake! Assuming we’ve captured your interest, let’s take a closer look at dimensional modeling for OLAP cubes.
Star Schemas Key and core to OLAP logical modeling is the idea of using at least one (and usually more) star schema source data models. These models can be physical structures (relational tables) or logical structures (views). What we mean is that the model itself can be materialized (stored on disk), or it can be created via a query (normally called a view) against the source data. We’ve seen clients have success with both approaches. Sometimes we see a combination of these approaches as well—that is, some physical storage and some logical definition. Despite the instances of success using other implementations, the most common implementation is to store the star schema source data on disk. Because on-disk storage entails
Chapter 5
Logical OLAP Design Concepts for Architects
117
making a copy of all the source data you intend to use to populate your OLAP cube (which can, of course, be a huge amount of data), the decision of whether to use physical or logical source data is not a trivial one. For this reason, we’ll address the advantages and disadvantages of both methods in more detail later in this chapter. At this point, we’ll just say that the most common reasons to incur physical disk storage overhead are to improve cube load performance (because simple disk reads are usually faster than aggregated view queries) and reduce the query load on OLTP source systems. So then, what is this mysterious star schema source structure? It consists of two types of relational sources (usually tables), called fact and dimension tables. These tables can be stored in any location to which SSAS can connect. Figure 5-1 shows a list of the included providers in SSAS.
Figure 5-1 SSAS providers
For the purposes of our discussion, we’ll reference source data stored in a SQL Server 2008 RDBMS instance in one single database. It is in no way a requirement to use only SQL Server to store source data for a cube. Source data can be stored in any system for which SSAS includes a provider. Production situations usually are much more complex in terms of source data than our simplified example. At this point in our discussion, we want to focus on explaining the concept of dimensional modeling, and simplicity is best. A star schema consists of at least one fact table and at least one dimension table. Usually, there are many dimension tables, which are related to one or more fact tables. These two types of tables each have a particular schema. For a pure star design, the rows in the two types of tables are related via a direct primary key-foreign key relationship. (Other types of relationships are possible, and we’ll cover those later in this chapter.) Specifically, primary keys uniquely identify each row in a dimension table and are related to foreign keys that reside in the fact table. The term star schema originated from a visualization of this relationship, as
118
Part I
Business Intelligence for Business Decision Makers and Architects
illustrated in Figure 5-2. One aspect of the visualization does not hold true—that is, there is no standard-size star. Rather than the five points that we think of as part of a typical drawing of a star, an OLAP star schema can have 5, 50, or 500 dimension tables or points of the star. The key is to think about the relationships between the rows in the two types of tables. We’ve taken a view (called a data source view) from SSAS in BIDS and marked it up so that you can visualize this structure for yourself.
Figure 5-2 Visualization of a star schema
Next we’ll take a closer look at the particular structures of these two types of tables—fact and dimension.
Fact Tables A star schema fact table consists of at least two types of columns: key columns and fact (or measure) columns. As mentioned, the key columns are foreign-key values that relate rows in the fact table to rows in one or more dimension tables. The fact columns are most often numeric values and are usually additive. These fact columns express key business metrics. An example of this is sales amount or sales quantity.
Chapter 5
Logical OLAP Design Concepts for Architects
119
Note Facts or measures, what’s the difference? Technically, facts are individual values stored in rows in the fact table and measures are those values as stored and displayed in an OLAP cube. It’s common in the OLAP literature to see the terms facts and measures used interchangeably. An example fact table, called FactResellerSales (from the AdventureWorksDW2008 sample), is shown in Figure 5-3. As mentioned previously, some of the example table names might vary slightly as we used community technology preview (CTP) samples, rather than the release to manufacturing (RTM) version, in the early writing of this book. It’s standard design practice to use the word fact in fact table names, either at the beginning of the name as shown here (FactResellerSales) or at the end (ResellerSalesFact). In Figure 5-3, note the set of columns all named xxxKey. These columns are the foreign key values. They provide the relationship from rows in the fact table to rows in one or more dimension tables and are said to “give meaning” to the facts. These columns are usually of data type int for SQL Server, and more generally, integers for most RDBMSs. In Figure 5-3, the fact columns that will be translated into measures in the cube start with the SalesOrderNumber column and include typical fact columns, such as OrderQuantity, UnitPrice, and SalesAmount.
Foreign key columns
Fact source columns
Other columns Figure 5-3 Typical star schema fact table.
Fact tables can also contain columns that are neither keys nor facts. These columns are the basis for a special type of dimension called a degenerate dimension. For example, the RevisionNumber column provides information about each row (or fact), but it’s neither a key nor a fact.
120
Part I
Business Intelligence for Business Decision Makers and Architects
Exploring the actual data in this sample fact table can yield further insight into fact table modeling. The SSAS interface in BIDS includes the ability to do this. You can simply right-click on any source table in the Data Source View section and then select the menu option Explore Data. Partial results are shown in Figure 5-4. Note the duplication of data in the results in the CurrencyKey, SalesTerritoryKey, and SalesOrderNumber columns—this is deliberate. It’s clear when reviewing this table that you only have part of the story with a fact table. This part can be summed up by saying this: Fact tables contain foreign keys and facts. So what gives meaning to these facts? That’s the purpose of a dimension table.
Figure 5-4 Fact table data consists of denormalized (deliberately duplicated) data.
BIDS contains several ways to explore the data in a source table in a data source view. In addition to the table view shown in Figure 5-4, you can also choose a pivot table, chart, or pivot chart view of the data. In the case of fact table data, choosing a different view really only illustrates the fact that tables are only a part of a star schema. Figure 5-5 shows a pivot table view of the same fact table we’ve been looking at in this section. The view is really not meaningful because facts (or measures) are associated only with keys, rather than with data. We show you this view so that you can visualize how the data from the fact and dimension tables works together to present a complete picture.
Chapter 5
Logical OLAP Design Concepts for Architects
121
Figure 5-5 Fact table information in a pivot table view in BIDS
Another important consideration when modeling your fact tables is to keep the tables as narrow as possible. Another way to say this is that you should have a business justification for each column you include. The reason for this is that a star schema fact table generally contains a much larger number of rows than any one dimension table. So fact tables represent your most significant storage space concern in an OLAP solution. We’ll add a bit more information about modeling fact tables after we take an initial look at the other key component to a star schema—dimension tables.
Dimension Tables As we mentioned previously, the rows in the dimension table provide meaning to the rows in the fact table. Each dimension table describes a particular business entity or aspect of the facts (rows) in the fact table. Dimension tables are based typically on factors such as time, customers, and products. Dimension tables consist of three types of columns: ■■
A newly generated primary key (sometimes called a surrogate key) for each row in the dimension table
122
Part I
Business Intelligence for Business Decision Makers and Architects
■■
The original, or alternate, primary key
■■
Any additional columns that further describe the business entity, such as a Customers table with columns such as FirstName, LastName, and so on
We’ll take a closer look at the sample table representing this information from AdventureWorksDW2008, which is called DimCustomer and is shown in Figure 5-6. As with fact tables, the naming convention for dimension tables usually includes the dim prefix as part of the table name.
Figure 5-6 Customer dimension tables contain a large number of columns.
Dimension tables are not required to contain two types of keys. You can create dimension tables using only the original primary key. We do not recommend this practice, however. One reason to generate a new unique dimension key is that it’s common to load data into dimensions from disparate data sources (for example, from a SQL Server table and a Microsoft Office Excel spreadsheet). Without generating new keys, you have no guarantee of having a unique identifier for each row. Even if you are retrieving source data from a single source database initially, it’s an important best practice to add this new surrogate key when you load the dimension table. The reason is that business conditions can quickly change—you might find yourself having to modify a production cube to add data from another source for many reasons (a business merger, an acquisition, a need for competitor data, and so on). We recommend that you always use surrogate keys when building dimension tables.
Chapter 5
Logical OLAP Design Concepts for Architects
123
Note that this table is particularly wide, meaning it contains a large number of columns. This is typical for an OLAP design. In your original source data, the customer data was probably stored in many separate tables, yet here it’s all lumped into one big table. You might be wondering why. We’ll get to that answer in just a minute. Before we do, however, let’s take a look at the data in the sample dimension table using the standard table viewer and the pivot table viewer for SSAS data source views in BIDS. Figure 5-7 shows the result of exploring the table view. To see the exact view that’s shown in Figure 5-7, after the Explore Data window opens, scroll a bit to the right.
Figure 5-7 Dimension table data is heavily denormalized in a dimensional design.
The pivot table view of the dimension data is a bit more interesting than the fact table information. Here we show the pivot chart view. This allows you to perform a quick visual validation of the data—that is, determine whether it looks correct. Figure 5-8 shows this view. The ability to sample and validate data using the table, pivot table, chart, and pivot chart views in the data source view in BIDS is quite a handy feature, particularly during the early phases of your OLAP design. Figure 5-8 shows how you can select the columns of interest to be shown in the any of the pivot views.
124
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 5-8 The pivot viewer included in BIDS for the SSAS data source view
A rule of thumb for dimension table design is that, when in doubt, you should include it— meaning that adding columns to dimension tables is good design. Remember that this is not the case with fact tables. The reason for this is that most dimension tables contain a relatively small number of rows of data compared to a fact table, so adding information does not significantly slow performance. There are, of course, limits to this. However, most of our customers have been quite pleased by the large number of attributes they can associate to key business entities in dimension tables. Being able to include lots of attributes is, in fact, a core reason to use OLAP cubes overall— by this, we mean the ability to combine a large amount of information into a single, simple, and fast queryable structure. Although most limits to the quantity of dimensional attributes you can associate to a particular business entity have been removed in SSAS 2008, you do want to base the inclusion of columns in your dimension tables on business needs. In our real-world experience, this value is usually between 10 and 50 attributes (or columns) per dimension.
Chapter 5
Logical OLAP Design Concepts for Architects
125
Dimensional information is organized a bit differently than you see it here after it has been loaded into an OLAP cube. Rather than simply using a flat view of all information from all rows in the source dimension table, OLAP cubes are generally built with dimensional hierarchies, or nested groupings of information. We’ll look more closely at how this is done later in this chapter.
Denormalization Earlier, we mentioned the difficulty that we commonly see regarding understanding and implementing effective OLAP modeling, particularly by those with a lot of traditional database (OLTP) experience. The concept of denormalization is key to understanding this. Simply put, denormalization is the deliberate duplication of data to reduce the number of entities (usually tables) that are needed to hold information. Relational data is generally modeled in a highly normalized fashion. In other words, data is deliberately not duplicated. This is primarily done to facilitate quick data manipulation, via INSERT, UPDATE, and DELETE operations. The fewer times a data item appears in a set of tables, the faster these operations run. Why then is an OLAP source denormalized? The answer lies in the structure of cubes themselves. Although it’s difficult to visualize, an OLAP cube is an n-dimensional structure that is completely denormalized. That is, it’s one big storage unit, sort of similar conceptually to a huge Excel PivotTtable view. Another way to think about this is that an OLAP cube has significantly fewer joins than any RDBMS system. This is one of the reasons that OLAP cube query results are returned so much more quickly than queries to RDBMS systems that include many joins. This is the exact opposite of the classic RDBMS database, which consists of lots of disparate tables. And this is the crux of OLAP modeling—because, of course, you’ll be extracting data from normalized OLTP source systems. So just exactly how do you transform the data and then load the transformed data into OLAP cubes? Note If you’re interested in further discussions about normalization and denormalization, go to the following Web site for an interesting discussion of these two different database modeling styles: http://www.devx.com/ibm/Article/20859.
Back to the Star There are additional options to OLAP modeling (that is, using table types other than fact tables and star dimension tables) that we’ll discuss later in this chapter, but the basic concept for OLAP modeling is simply a fact table and some related dimension tables. Figure 5-9 shows a conceptual star schema. Note that we’ve modeled the dimension keys in the preferred way in this diagram—that is, using original and new (or surrogate) keys.
126
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 5-9 Conceptual view of a star schema model
So why create star schemas? As discussed earlier, the simplest answer is that star schemas work best in the BIDS development environment for SSAS. Although it’s possible to create a cube from OLTP (or normalized) source data, the results will be not optimal without a great deal of manual work on your part. In general, we do not recommend this practice. Also, it’s common to discover flawed data in source systems during the life cycle of a BI project. We find that usually at least some source data needs to be part of a data-cleansing and validation process. This validation is best performed during the extract, transform, and load (ETL) phase of your project, which occurs prior to loading OLAP cubes with data. With the 2008 release of SSAS, Microsoft has improved the flexibility of source structures for cubes. What this means is that you start with series of star schemas and then make adjustments to your model to allow for business situations that fall outside of a strict star schema model. One example of this improved flexibility is the ability to base a single cube on multiple fact tables. Figure 5-10 shows an example of using two fact tables in an OLAP star schema. This type of modeling is usually used because some, but not all, dimensions relate to some, but not all, facts. Another example of this is the need to allow null values to be loaded into a dimension or fact table and to define a translation of those nulls into a non-null value that is understandable by end users—for example, “unknown.” In many OLAP products (including versions of SSAS older than 2005), you were limited to creating one cube per each fact table. So, if you wanted a unified view of your data, you’d be forced to manually “union” those cubes together. This caused unnecessary complexity and additional administrative overhead. In SSAS 2008, the SSAS Dimension Usage tab in the cube designer in BIDS allows you to define the grain of each relationship between the rows in the various fact tables to the rows in the dimension tables. This improved flexibility now results in most BI solutions being based on a single, large (or even huge) cube (or view of enterprise data). This type of modeling reflects the business need to have a single, unified version of relevant business data. This cube presents the data in whatever level of detail that is meaningful for the particular end user—that is, it can be summarized, detailed, or any combination
Chapter 5
Logical OLAP Design Concepts for Architects
127
of both. This ability to create one view of the (business) “truth” is one of the most compelling features of SSAS 2008.
Figure 5-10 Conceptual view of a star schema model using two fact tables
To drill into the Dimension Usage tab, we’ll look at Figure 5-11. This is found in BIDS after you double-click the sample OLAP cube on the second tab. Here the employee dimension has no relationship with the rows (or facts) in the Internet Sales table (because no employees are involved in Internet sales), but it does have a relationship with the Reseller Sales table (because employees are involved in reseller sales). Also, the customer dimension has no relationship with the rows in the Reseller Sales table because customers are not resellers, but the customer dimension does have a relationship with the data in the Internet Sales group (because customers do make Internet purchases). Dimensions common to both fact tables are products and time (due date, order date, and ship date. Note that the time dimension has three aliases. Aliasing the same dimension multiple times is called a role-playing dimension. We’ll discuss this type of dimension in more detail in Chapter 8, “Refining OLAP Cubes and Dimensions.”
128
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 5-11 The Dimension Usage tab in BIDS shows the relationships between the fact and dimension
tables.
When creating a cube, the SSAS Cube Wizard detects the relationships between dimension and fact table rows and populates the Dimension Usage tab using its best guess by examining the source column names. You will review and update as needed if the results do not exactly match your particular business scenarios. When we review building your first cube in Chapter 7, “Designing OLAP Cubes Using BIDS,” we’ll show more specific examples of updates you might need to make.
Using Grain Statements By this point, you should understand the importance and basic structure of a star schema. Given this knowledge, you might be asking yourself, “So, if the star schema is so important, what’s the best way for me to quickly and accurately create this model?” In our experience, if you begin with your end business goals in mind, you’ll arrive at the best result in the quickest fashion, which will be made of a series of tightly defined grain statements. A grain statement is a verbal expression of the facts (or measures) and dimensions, expressed at an appropriate level of granularity. Granularity means level of detail. An example is a time dimension. Time granularity can be a month, day, or hour. Specifically, well-written grain statements capture the most detailed level of granularity that is needed for the business requirements. Here are some examples of simple grain statements: ■■
Show the sales amount and sales quantity (facts or measures) by day, product, employee, and store location (with included dimensions being time, products, employees, and geography). The grain of each dimension is also expressed—that is, time dimension at the grain of each day, products dimension at the grain of each product, and so on.
Chapter 5
Logical OLAP Design Concepts for Architects
129
■■
Show the average score and quantity of courses taken by course, day, student, manager, curriculum, and curriculum type.
■■
Show the count of offenders by location, offense type, month, and arresting officer.
Well-written grain statements are extremely useful in OLAP modeling. It’s critical that each of the grain statements be validated by both subject matter experts and business decision makers prior to beginning the modeling phase. We use a sign-off procedure to ensure that appropriate validation of grain statements has taken place before we begin any type of prototype (or production) development work. We consider the sign-off on grain statements as a key driver in moving from the envisioning to the design and development stages of work. Here is a sampling of validation questions that we ask to determine (or validate) grain statements: ■■
What are the key metrics for your business? Some examples for a company that sells products are as follows: sales amount, sales quantity, gross profit, net profit, expenses, and so on.
■■
What factors do you use to evaluate those key metrics? For example, do you evaluate sales amount by customer, employee, store, date, or “what”?
■■
By what level of granularity do you evaluate each factor? For example, do you evaluate sales amount by day or by hour? Do you evaluate customer by store or by region?
As you can see by the example grain statements listed, BI solutions can be used by a broad range of organizations. In my experience, the “Show me the sales amount by day” model, although it’s the most typical, is not the only situation in which BI can prove useful. Some other interesting projects we’ve worked on included using OLAP cubes to improve decision support for the following business scenarios: ■■
Improve detection of foster care families not meeting all state requirements. (SSAS data mining was also used in this scenario.)
■■
Provide a flexible, fast query system to look up university course credits that are transferable to other universities.
■■
Improve food and labor costs for a restaurant by viewing and acting on both trends and exception conditions.
■■
Track the use and effectiveness of a set of online training programs by improving the timeliness and flexibility of available reports.
As we’ve mentioned, when considering why and where you might implement SSAS OLAP cubes in your enterprise it’s important to think broadly across the organization—that is, which groups would benefit from an aggregated view of their (and possibly other groups’) data? BI for everyone via OLAP cubes is one of the core design features of the entire BI suite of products from Microsoft.
130
Part I
Business Intelligence for Business Decision Makers and Architects
Design Approaches There are two ways to approach the design phase of BI: either schema first or data first. Schema first means that the OLAP architect designs BI based on business requirements. This means starting with the careful definition of one or more grain statements. Then empty OLAP cube structures (star schema tables) are created and mappings to source data are drawn up. ETL processes connect the source data to the destination locations in this model. This method is used for large and complex projects. It’s also used for projects that will load data that must be cleansed and validated prior to loading. And this method is used when the BI architect does not have administrative rights on source data systems. In data-first modeling, source data is shaped for load into OLAP cubes. This is often done via views against this data. If, for example, source data is stored in an RDBMS, views are written in SQL. This approach is used for smaller, simpler implementations, often in shops where the DBA is the BI architect, so she already has administrative control over the source RDBMS data. Also, the source data must be in a very clean state—that is, it should have already been validated, with the error data having been removed. The approach you take depends on a number of factors: ■■
Your skill and experience with OLAP modeling
■■
The number of data sources involved in the project and their complexity
■■
Your access level to the source data
We favor the schema-first approach because we find that it lends itself to cleaner design of a more standard star-schema type. The resulting OLAP cubes are easier to create and maintain, and they are generally more usable. Part of this cleanliness is driven by the method—that is, by designing an empty set of structures first, and then by mapping source data, and then by cleaning and validating source data prior to loading it into the destination structures. As stated, the BIDS tools for SSAS, although flexible, are primarily designed to work with source data that is as close to a classic OLAP (or star schema) format as possible. Given our preference for schema-first design, you might wonder which tools we recommend that you use to perform this task.
Choosing Tools to Create Your OLAP Model Our OLAP modeling tool of choice is Visio 2007. We like it for its ease of use and availability, and we especially like it for its ability to quickly generate the Transact-SQL data definition language (DDL) source statements. This is important so that the design for OLAP cube source data can be rapidly materialized on your development server. Although we prefer Visio, it’s also possible to use the BIDS SSAS itself to create an empty star schema model. We’ll detail the process for this later. If you are already using another modeling tool, such as ERwin, that will work as well. Just pick a tool that you already know how to use (even if you’ve used it only for OLTP design) if possible.
Chapter 5
Logical OLAP Design Concepts for Architects
131
You’ll usually start your design by creating dimension tables because much of the dimension data will be common to multiple grain statements. Common examples of this are time, customers, products, and so on. In the example we’ve provided in Figure 5-12, you can see that there are relatively few tables and that they are highly denormalized (meaning they contain many columns with redundant data—for example, in StudentDim, the region, area, and bigArea columns).
Figure 5-12 A model showing single-table source dimensions (star type) and one multitable source
(snowflake type)
Contrast this type of design with OLTP source systems and you’ll begin to understand the importance of the modeling phase in an OLAP project. In Figure 5-12, each dimension source table except two (OfferingDim and SurveyDim) are the basis of a single cube dimension. That is, StudentDim is the basis of the Student dimension, InstructorDim is the basis of the Instructor dimension, and so on. These are all examples of star dimensions. OfferingDim and SurveyDim have a primary key-foreign key relationship between the rows. They are the basis for a snowflake (or multitable sourced) dimension. We’ll talk more about snowflake dimensions later in this chapter. You’ll also notice in Figure 5-12 that each table has two identity (or key) fields: a newID and an oldID. This is modeled in the preferred method that we discussed earlier in this chapter. We’ve also included a diagram (Figure 5-13) of the fact tables for the same project. You can see that there are nearly as many fact tables as dimension tables in this particular model example. This is not necessarily common in OLAP model design. More commonly, you’ll use from one to five fact tables with five to 15 dimension tables, or more, of both types. The reason we show this model is that it illustrates reasons for using multiple fact tables—for example, some Session types have facts measured by day, while other Session types have facts measured by hour. The ability to base a single OLAP cube on multiple fact tables is a valuable
132
Part I
Business Intelligence for Business Decision Makers and Architects
addition to SSAS. It was introduced in SQL Server 2005, and we see underutilization of the multiple-fact table view because of a lack of understanding of star schema structure.
Figure 5-13 A model showing five fact tables to be used as source data for a single SSAS OLAP cube
Other Design Tips As we mentioned in Chapter 4, “Physical Architecture in Business Intelligence Solutions,” because your project is now in the developing phase, all documents, including any model files (such as .vsd models, .sql scripts, and so on), must be under source control if multiple people will be working on the OLAP models. You can use any source control product or method your team is comfortable with. The key is to use some type of tool to manage this process. We prefer Visual Studio Team System Team Foundation Server for source control and overall project management. We particularly like the new database source control support, because we generally use SQL Server as a relational source system and as an intermediate storage location for copies of data after ETL processes have been completed. As in relational modeling, OLAP modeling is an iterative process. When you first start, you simply create the skeleton tables for your star schema by providing table names, keys, and a couple of essential column names (such as first name and last name for customer). Ideally, these models will be directly traceable to the grain statements you created during earlier phases of your project. As you continue to work on your design, you’ll refine these models by adding more detail, such as updating column names, adding data types, and so on. At this point, we’ll remind you of the importance of using the customer’s exact business terminology when naming objects in your model. The more frequently you can name source schema tables and columns per the captured taxonomy, the more quickly and easily your model can be understood, validated, and translated into cubes by everyone working on your project. We generally use a simple tool, such as an Excel spreadsheet to document customer taxonomies. We’ve found that taking time to document, validate, and use customer taxonomies in BI projects has resulted in a much higher rate of adoption and satisfaction because it’s intuitive to use and has lower end-user training costs.
Chapter 5
Logical OLAP Design Concepts for Architects
133
Using BIDS as a Designer To create an empty OLAP cube structure using BIDS, you can use one of two methods. The first is to create a new cube using the wizard and then select the option of creating a new cube without using a data source. If using this method, you design all parts of the cube using the SSAS wizards. We’ll review the SSAS interface, including the wizards, in detail in Chapter 7. The second method you can use is to build a prototype cube (or a dimension) without using a data source and to base your object on one or more template files. You can use the included sample template files to explore this method. The default cube template files are located at C:\Program Files\Microsoft SQL Server\100\Tools\Templates\olap\1033\Cube Templates. The default dimension template files are located at C:\Program Files\Microsoft SQL Server\100\ Tools\Templates\olap\1033\Dimension Templates. These files are located in the Program Files (x86) folder on x64 machines. These files consist of XMLA scripts that are saved with the appropriate extension—for example, *.dim for dimension files. Later in this book, you’ll learn how to generate metadata files for cubes and dimensions you design. You can also use files that you have generated from cubes or dimensions that you have already designed as templates to design new cubes or dimensions. For either of these methods, after you complete your design to materialize it into an RDBMS, you click on the Database menu in BIDS, and then on Generate Relational Schema. This opens a configurable wizard that allows you to generate the Transact-SQL DDL code to create the empty OLAP source schema in a defined instance of SQL Server 2008. Although the preceding methods are interesting and practical for some (simple) design projects, we still prefer to use Visio for most projects. The reason is because we find Visio to be more flexible than using SSAS; however, that flexibility comes with a tradeoff. The tradeoff is that you must design the entire model from scratch in Visio. Visio contains no guidelines, wizards, or hints to help you model an OLAP cube using a star-schema-like source. Using BIDS, you can choose to use wizards to generate an iteration of your model. Then you manually modify that model. We can understand how use of BIDS for this process would facilitate rapid prototyping. The key factor in deciding which method to use is your depth of knowledge with OLAP concepts—the BIDS method assumes you understand OLAP modeling; Visio, of course, does not.
Modeling Snowflake Dimensions As mentioned previously, SSAS has increased the flexibility of source schemas to more easily accommodate common business needs that aren’t easily modeled using star schemas. This section discusses some of those new or improved options in SQL Server 2008 SSAS. We’ll begin with the most common variation to a standard star schema. We discuss it first because you’ll probably have occasion to use it in your model. Quite a few people do, and this is the one we most often see being implemented inopportunely. It’s called a snowflake schema.
134
Part I
Business Intelligence for Business Decision Makers and Architects
Snowflake Schemas A snowflake is a type of source schema used for dimensional modeling. Simply put, it means basing an OLAP cube dimension on more than one source (relational) table. The most common case is to use two source tables. However, if two or more tables are used as the basis of a snowflake dimension, there must be a key relationship between the rows in each of the tables containing the dimension information. Note in the example in Figure 5-14 that the DimCustomer table has a GeographyKey in it. This allows for the snowflake relationship between the rows in the DimGeography and DimCustomer tables to be detected by the New Cube Wizard in BIDS.
Figure 5-14 A snowflake dimension is based on more than one source table.
The Dimension Usage section of SSAS usually reflects the snowflake relationship you model when you initially create the cube using the New Cube Wizard (as long as the key columns have the same names across all related tables). If you need to manually adjust any relationships after the cube has been created, you can do that using tools provided in BIDS. Figure 5-15 shows the Dimension Usage tab in BIDS, which was shown earlier in the chapter. This time, we are going to drill down a bit deeper into using it. To adjust, or verify any relationship, click the build button (the small gray square with the three dots on it) on the dimension name at the intersection of the dimension and fact tables. We’ll start by looking at a Regular or star dimension—which, in the screen shot shown, is the ProductKey build button.
Chapter 5
Logical OLAP Design Concepts for Architects
135
Figure 5-15 The Dimension Usage tab allows you to establish the granularity of relationships between source tables.
After you click the build button, you’ll access the Define Relationship dialog box. There you can confirm that the relationship BIDS detected during the cube build is correct. If the relationship or the level of granularity has been incorrectly defined, you can adjust it as well. In Figure 5-16, you can see that (star or Regular) relationship has been correctly detected—you validate this by verifying that the correct identifying key columns were detected by BIDS when the cube was initially created. In this example, using the ProductKey from the Dim Product (as the primary key) and Fact Internet Sales (as the foreign key) tables reflects the intent of the OLAP design. For a snowflake dimension, you review or refine the relationship between the related dimension tables in the Define Relationship dialog box that you accessed from the Dimension Usage tab in BIDS. An example is shown in Figure 5-17. Note that the dialog box itself changes to reflect the modeling needs—that is, you must select the intermediate dimension table and define the relationship between the two dimension tables by selecting the appropriate key columns. You’ll generally leave the Materialize check box selected (the default setting) for snowflake dimensions. This causes the value of the link between the fact table and the reference dimension for each row to be stored in SSAS. This improves dimension query performance because the intermediate relationships will be stored on disk rather than calculated at query time.
136
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 5-16 The most typical relationship is one-to-many between fact and dimension table rows.
Figure 5-17 The referenced relationship involves at least three source tables.
When Should You Use Snowflakes? Because snowflakes add overhead to cube processing time and query processing time, you should use them only when the business needs justify their use. The reason they add overhead is that the data from all involved tables must be joined at the time of processing. This means the data must be sorted and matched prior to being fetched for loading into the OLAP dimension. Another source of overhead can be on each query to members of that dimension, depending on whether the relationships are materialized (or stored on disk as part of the dimension processing) or simply retrieved both at the time of dimension processing and at query time.
Chapter 5
Logical OLAP Design Concepts for Architects
137
The most typical business situation that warrants the use of a snowflake dimension design is one that reduces the size of the dimension table by removing one or more attributes that are not commonly used and places that attribute (or attributes) in a separate dimension table. An example of this type of modeling is a customer dimension with an attribute (or some attributes) that is used for less than 20 percent of the customer records. Another way to think about this is by considering the predicted percentage of null values for an attribute. The greater percentage of nulls predicted, the higher the possibility of creating one or more separate, but related via key columns, attribute tables. Taking the example further, you can imagine a business requirement to track any existing URL for a customer’s Web site in a business scenario where very few of your customers actually have their own Web sites. By creating a separate but related table, you significantly reduce the size of the customer dimension table. Another situation that might warrant the use of a snowflake design is one in which the update behavior of particular dimensional attributes varies—that is, a certain set of dimensional attributes should have their values overwritten if updated values become part of the source data, whereas a different set should have new records written for each update (maintaining change history). Although it’s possible to combine different types of update behavior, depending on the complexity of the dimension, it might be preferable to separate these attributes into different source tables so that the update mechanisms can be simpler. Tip In the real world, clients who have extensive experience modeling normalized databases often overuse snowflakes in OLAP scenarios. Remember that the primary goal of the star schema is to denormalize the source data for efficiency. Any normalization, such as a snowflake dimension, should relate directly to business needs. Our experience is that usually less than 15 percent of dimensions need to be presented as snowflakes.
What Other Cube Design Variations Are Possible? With SSAS 2008, you can use many other advanced design techniques when building your OLAP cubes. These include many-to-many dimensions, data mining dimensions, and much more. Because these advanced techniques are not tied to source schema modeling in the way that modeling dimensions are, we cover these (and other) more advanced modeling techniques in future chapters.
Why Not Just Use Views Against the Relational Data Sources? At this point, you might be thinking, “This OLAP modeling seems like a great deal of work. Why couldn’t I just create views against my OLTP source (or sources) to get the same result?” Although you are not prevented by restrictions inside of the BIDS SSAS OLAP cube from doing this, our experience has been that it’s seldom the case that the relational source data is clean enough to directly model against. Also, these designs seldom perform in an optimal manner.
138
Part I
Business Intelligence for Business Decision Makers and Architects
Microsoft has tried to map out the middle ground regarding design with SQL Server 2008. In SQL Server 2005 SSAS, Microsoft removed most restrictions that required a strict star source schema. The results, unfortunately, especially for customers new to BI, were often less than optimal. Many poorly designed OLAP cubes were built. Although these cubes seemed OK in development, performance was often completely unacceptable under production loads. New in SQL Server 2008 is a set of built-in AMO design warnings. These are also called real-time best practice design warnings. When you attempt to build an OLAP cube in BIDS, these warnings appear if your cube does not comply with a set of known best design practices. These warnings are guidelines only. You can still build and deploy OLAP cubes of any design. Also, you can configure (or completely turn off) all these warnings. Typically, the OLAP model is created and then validated against the grain statements. In the subsequent steps, source data is mapped to destination locations, and then it’s cleaned. The validated data is then loaded into the newly created OLAP model via ETL processes. Although the SSAS does not prevent you from direct building against your source data (without modeling, cleaning, and loading it into a new model), the reality that we’ve seen is that most organizations’ data simply isn’t prepared to allow for a direct OLAP query against OLTP source data. One area where relational views (rather than copying data) are sometimes used in OLAP projects is as data sources for the ETL. That is, in environments where OLAP models and ETL engineers are not allowed direct access to data sources, it’s common for them to access the various data sources via views created by DBAs. Also, the time involved to write the queries to be used in (and possibly to index) the relational views can be quite substantial.
More About Dimensional Modeling The Unified Dimensional Model (UDM) is one of the recent key enhancements to SSAS. It was introduced in SSAS 2005; however, in our experience, it was not correctly understood or implemented correctly by most customers. A big reason for this was improper modeling. For that reason, we’ll spend some time drilling down into details regarding this topic. To start, you need to understand how dimensional data is best presented in OLAP cubes. As we’ve seen, dimension source data is best modeled in a very denormalized (or deliberately duplicated) way—most often, in a single table per entity. An example is one table with all product attribute information, such as product name, product size, product package color, product subcategory, product category, and so on. So the question is, how is that duplicated information (in this case, for product subcategories and categories) best displayed in a cube? The answer is in a hierarchy or rollup. The easiest way to understand this is to visualize it. Looking at the following example, you can see the rollup levels in the AdventureWorksDW2008 cube product dimension sample. Note in Figure 5-18 that individual products roll up to higher levels—in this case, subcategories and categories. Modeling for hierarchy
Chapter 5
Logical OLAP Design Concepts for Architects
139
building inside of OLAP cubes is a core concept in dimensional source data. Fortunately, SSAS includes lots of flexibility in the actual building of one or more hierarchies of this source data. So, once again, the key to easy implementation is appropriate source data modeling.
Figure 5-18 Dimension members are often grouped into hierarchies.
It will also be helpful for you to understand some terminology used in BIDS. Dimension structure refers to the level (or rollup group) names—in this case, Product Names, Product Subcategories, and Product Categories. Level names roughly correspond to source dimension table column names. Attribute relationships are defined in a tab, the Dimension Structure tab, that is new in SQL Server 2008 BIDS. This tab was added because so many customers defined these relationships incorrectly in previous versions of the product. Attribute relationships establish relationships between data in source columns from one or more dimension source tables. We’ll be taking a closer look at the hows and whys of defining attribute relationships in greater detail later in this chapter. In Figure 5-19, you can see the Dimension Structure tab in the dimension designer in BIDS. There you can see all columns from the source tables in the Data Source View section. In the Attributes section, you can see all attributes that have been mapped to the dimension. In the Hierarchies section in the center of the screen, you can see all defined hierarchies.
140
Part I
Business Intelligence for Business Decision Makers and Architects
Figure 5-19 The Dimension Structure tab allows you to edit dimension information.
The most important initial OLAP dimension modeling consideration is to make every attempt to denormalize all source data related to a particular entity, which means that each dimension’s source data is put into a single table. Typically, these tables are very wide—that is, they have many columns—and are not especially deep—that is, they don’t have many rows. An example of a denormalized source structure is a product dimension. Your company might sell only a couple types of products; however, you might retain many attributes about each product—for example, package size, package color, introduction date, and so on. There can be exceptions to the general “wide but not deep” modeling rule. A common exception is the structure of the table for the customers dimension. If it’s a business requirement to capture all customers for all time, and if your organization services a huge customer base, your customer dimension source table might have millions of rows. One significant architectural enhancement in Analysis Services 2008 is noteworthy here. That is, in this version, only dimension members requested by a client tool are loaded into memory. This behavior is different from SSAS 2000, where upon startup all members of all dimensions were loaded into memory. This enhancement allows you to be inclusive in the design of dimensions—that is, you can assume that more (attributes) is usually better.
Chapter 5
Logical OLAP Design Concepts for Architects
141
After you’ve created the appropriate source dimension table or tables and populated them with data, SSAS retrieves information from these tables during cube and dimension processing. Then SSAS uses a SELECT DISTINCT command to retrieve members from each column. When using the Cube Wizard in SSAS 2005, SSAS attempted to locate natural hierarchies in the data. If natural hierarchies were found, BIDS would add them to the dimensions. This feature has been removed in SSAS 2008. The reason for this is that there are more steps required to build optimized hierarchies. We’ll discuss this in detail in Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler.” Also new to SSAS 2008 is a tab in the dimension designer, the Attribute Relationships tab, which helps you to better visualize the attribute relationships in dimensional hierarchies, as shown in Figure 5-20.
Figure 5-20 The Attribute Relationships tab allows you to visualize dimensional hierarchies.
There are a couple of other features in SSAS dimensions you should consider when you are in the modeling phase of your BI project. These features include the ability to set the default displayed member for each attribute in a dimension, the ability to convert nulls to a value— usually, Unknown or 0 (zero)—and the ability to allow duplicate names to be displayed. You
142
Part I
Business Intelligence for Business Decision Makers and Architects
should model for these features based on business requirements, so you should capture the business requirements for defaults, unknowns, and duplicates for each dimension and all of its attributes during the modeling phase of your project. When we cover dimension and cube creation later in this book, we’ll take a more detailed look at all of these considerations—at this point, we are just covering topics that relate to modeling data. In that vein, there is one remaining important topic—the rate of change of dimensional information. It’s expected that fact table information will change over time. By this we mean more rows are normally added to the fact tables. In the case of dimension information, you must understand the business requirements that relate to new dimension data and model correctly for them. Understanding slowly changing dimensions is key, so we’ll take on this topic next.
Slowly Changing Dimensions To understand slowly changing dimensions (SCD), you must understand what constitutes a change to dimensional data. What you are looking for, or asking, is what the desired outcome is when dimension member information is updated or deleted. In OLAP modeling, adding (inserting) new rows into dimension tables is not considered a change. The only two cases you must be concerned with for modeling are situations where you must account for updates and deletes to rows in dimension tables. The first question to ask of your subject matter experts is this one: “What needs to happen when rows in the dimension source tables no longer have fact data associated with them?” In our experience, most clients prefer to have dimension members marked as Not Active at this point, rather than as Deleted. In some cases, it has been a business requirement to add the date of deactivation as well. A business example of this is a customer who returned a purchase and has effectively made no purchases. The next case to consider is what the business requirements for dimension table row value updates are. A common business scenario element is names (of customers, employees, and so on). The question to ask here is this one: “What needs to happen when an employee changes her name, (for example, when women get married)”? There are four different possibilities in modeling, depending on the answer to the preceding question. You should verify the business requirements for how requested changes should be processed for each level of dimension data. For example, when using a Geography dimension, verify the countries, states, counties, and cities data. Here are the possibilities: ■■
No changes. Any change is an error. Always retain original information.
■■
Last change wins. Overwrite. Do not retain original information.
■■
Retain some history. Retain a fixed number of previous values.
■■
Retain all history. Retain all previous values.
Chapter 5
Logical OLAP Design Concepts for Architects
143
Using the Geography dimension information example, typical requirements are that no changes will be allowed for country, state, and county. However, cities might change names, so “Retain all history” could be the requirement. You should plan to document the change requirements for each level in each dimension during the design stage of your project. Because dimension change behavior is so important to appropriate modeling and OLAP cube building, dimension change behavior types have actually been assigned numbers for the different possible behaviors. You’ll see this in the SSIS interface when you are building packages to maintain the dimension source tables.
Types of Slowly Changing Dimensions The requirements for dimension row data change processing (for example, overwrite, no changes allowed, and so on) are the basis for you to model the source tables using standard slowly changing dimension type behavior modeling. SCD modeling consists of a couple of things. First, to accommodate some of these behaviors, you’ll need to include additional columns in your source dimension tables. Second, you’ll see the terms and their associated numbers, listed in the Slowly Changing Dimension data flow component in the SSIS package designer. This data flow component contains a Slowly Changing Dimension Wizard, which is quite easy to use if you’ve modeled the source data with the correct number and type of columns. Although initially you’ll probably manually populate your OLAP dimensions from source tables, as you move from development into production, you’ll want to automate this process. Using SSIS packages to do this is a natural fit. After you have reviewed the requirements and noted that some dimension rows will allow changes, you should translate those requirements to one of the solution types in the following list. We’ll use an example of a customer name attribute when describing each of these various ways of modeling dimensional attributes so that you can better understand the contrasts between the standard ways of modeling dimensional attributes. ■■
Changing Attribute Type 1 means overwriting previous dimension member values, which is sometimes also called “last change wins.” This type number is called a Changing Attribute in the SSIS Slowly Changing Dimension Wizard. You do not need to make any change to source dimension tables for this requirement. You might want to include a column to record date of entry, however. An example of this would be a customer name where you needed only to see the most current name, such as a customer’s married name rather than her maiden name.
■■
Historical Attribute Type 2 means adding a single new record (or row value) when the dimension member value changes. This type is called a Historical Attribute in the SSIS Slowly Changing Dimension Wizard. You add at least one new column in your source dimension table to hold this information—that is, the previous name. You might also be required to track the date of the name change and add another column to store the start date of a new name.
144
Part I
Business Intelligence for Business Decision Makers and Architects
■■
Add Multiple Attributes Type 3 means adding multiple attributes (or column values) when the dimension member value changes. Sometimes there is a fixed requirement. For example, when we worked with a government welfare agency, it was required by law to keep 14 surnames. At other times, the requirement was to keep all possible values of a surname. In general, Type 3 is not supported in the SSIS Slowly Changing Dimension Wizard. This doesn’t mean, of course, that you shouldn’t use it in your dimensional modeling. It just means that you’ll have to do a bit more work to build SSIS dimension update packages. If maintaining all history on an attribute such as a surname is your requirement, you need to establish how many values you need to retain and build either columns or a separate (related) table to hold this information. In this example you’d add three columns at minimum, a CurrentFlag, an EffectiveStart date, and an EffectiveEnd date. This type of modeling can become quite complex, and it should be avoided if possible.
■■
Fixed This type means that no changes are allowed. Any requested change is treated as an error and throws a run-time exception when the SSIS package runs. No particular change to source data is required.
Figure 5-21 shows the Slowly Changing Dimension Wizard that runs when you configure the Slowly Changing Dimension data flow component when building an SSIS package in BIDS. Note that you set the change behavior for each column when you configure this wizard. Also note that, although on the next page of this wizard the default is Fail The Transformation If Changes Are Detected In A Fixed Attribute, you can turn this default off. An interesting option in the SSIS SCD task is the ability for you to cascade changes to multiple, related attributes (which is turned off by default), by selecting the Change All The Matching Records When Changes Are Detected In A Changing Attribute option. This last option is available to support highly complex modeling scenarios.
Rapidly Changing Dimensions In rapidly changing dimensions, the member values change constantly. An example of this is employee information for a fast-food restaurant chain, where staff turnover is very high. This type of data should be a very small subset of your dimensional data. To work with this type of dimension, you’ll probably vary the storage location, rather than implementing any particular design in the OLAP model itself. Rapidly changing dimensional data storage models are covered in more detail in Chapter 9, “Processing Cubes and Dimensions.”
Chapter 5
Logical OLAP Design Concepts for Architects
145
Figure 5-21 The Slowly Changing Dimension Wizard in SSIS speeds maintenance package creation for SCDs.
Writeback Another advanced capability of dimensions is writeback. Writeback is the ability for authorized users to directly update the data members in a dimension using their connected client tools. These client tools would also have to support writeback, and not all client tools have this capability. So, if dimension member writeback is a business requirement, you must also verify that any client tool you intend to make available for this capacity actually includes it. Dimension writeback changes can include any combination of inserts, updates, or deletes. You can configure which types of changes are permitted via writeback as well. In our experience, only a very small number of business scenarios warrant enabling writeback for particular cube dimensions. If you are considering enabling writeback, do verify that it’s acceptable given any regulatory requirements (SOX, HIPAA, and so on) in your particular business environment. There are some restrictions to consider if you want to enable writeback. The first restriction applies to the modeling of the dimension data and that is why we are including this topic here. The restriction is that the dimension must be based on a single source table (meaning it must use a star schema modeling format). The second restriction is that writeback dimensions are supported only in the Enterprise edition of Analysis Services (which is available only as part of the Enterprise edition of SQL Server 2008). Finally, writeback security must be specifically enabled at the user level. We’ll cover this in more detail in Chapter 9.
146
Part I
Business Intelligence for Business Decision Makers and Architects
Tip To review features available by edition for SSAS 2008, go to http://download.microsoft.com/download/2/d/f/2df66c0c-fff2-4f2e-b739-bf4581cee533/SQLServer%20 2008CompareEnterpriseStandard.pdf.
Understanding Fact (Measure) Modeling A core part of your BI solution are the business facts you choose to include in your OLAP cubes. As mentioned previously, source facts become measures after they are loaded into OLAP cubes. Measures are the key metrics by which you measure the success of your business. Some examples include daily sales amount, product sales quantity, net profit and so on. It should be clear by now that the selection of the appropriate facts is a critical consideration in your model. As discussed previously, the creation of validated grain statements is the foundation of appropriate modeling of fact tables. The tricky part of modeling fact source tables is twofold. First, this type of “key plus fact” structure doesn’t usually exist in OLTP source data, and second, source data often contains invalid data. For this reason most of our customers choose to use ETL processes to validate, clean, combine, and then load source data into materialized star schema source fact tables as a basis for loading OLAP cubes. We’ve worked with more than one customer who wanted to skip this step, saying, “The source data is quite clean and ready to go.” Upon investigation the customer found that this wasn’t the case. It’s theoretically possible to use relational views directly against OLTP source data to create virtual fact tables via views; however, we rarely implement this approach because of the reasons listed previously. Another consideration is timing. We purposely discussed dimension table modeling before fact table modeling because of the key generation sequence. If you follow classic dimensional modeling, you’ll generate new and unique keys for each dimension table row when you populate it from source data. In this case, you’ll next use these (primary) keys from the dimension source tables as (foreign) keys in the fact table. Obviously, the dimension tables must load successfully prior to loading the fact table in this scenario. In addition to (foreign) keys, the fact tables will also contain fact data. In most cases, these fact data columns consist of data types that are numeric. Also, this data will most often be aggregated by summing the facts across all levels of all dimensions. There are, however, exceptions to this rule. A typical case is a cube that captures sales activity. If you take a look at the measure aggregates in the AdventureWorksDW2008 sample, you’ll see that most measures simply use the SUM aggregate. However, there are exceptions. For example the Internet Transaction Count measure uses the COUNT aggregate. BIDS also includes the ability to group similar measures together. Measure groups are shown in BIDS using folders that have the same name as the source fact tables. In fact, it is common that the contents of one measure group are identical to one source fact table. These fact tables originate from source star schemas. You can also see the data type and aggregation
Chapter 5
Logical OLAP Design Concepts for Architects
147
type for each measure in BIDS. The ability to combine multiple fact tables into a single OLAP cube was introduced in SSAS 2005 and is a powerful feature for simplifying views of large amounts of business data. We’ll talk about the implementation of this feature in more detail in Chapter 7. The built-in aggregation types available for use in SSAS are listed next. When you are modeling fact tables, you should determine which type of aggregation behavior meets your business requirements. Be aware that you are not limited to the built-in aggregation types. You can also create any number of custom aggregations via MDX statements. We’ll be explaining how to do this in Chapter 11, “Advanced MDX.” Note in Table 5-1 that the list of built-in aggregate functions contains type information for each aggregate function. This is a descriptor of the aggregation behavior. Additive means to roll up the one ultimate total. Semi-additive means to roll up to a total for one or more particular, designated levels, but not a cumulative total for all levels. An example of a common use of semi-additive is in the time dimension. It is a common requirement to roll up to the Year level, but not to roll up to the All Time level–that is, to be able to see measures summarized by a particular year but to not show a measure’s aggregate value rolled up across all years to a grand total. Non-additive means it does not roll up—that is, that the value displayed in the OLAP cube shows only that particular (original) value. Also semi-additive measures require the Enterprise edition of SSAS. TABLe 5-1
List of Built-in Aggregations and Type information
Aggregation
Type
Sum
Additive
Count
Additive
Min, Max
Semi-additive
FirstChild, LastChild
Semi-additive
AverageOfChildren
Semi-additive
First(Last)NonEmpty
Semi-additive
ByAccount
Semi-additive
Distinct Count
Non-additive
Note ByAccount aggregation is a type of aggregation that calculates according to the aggregation function assigned to the account type for a member in an account dimension. An account dimension is simply a dimension that is derived from a single, relational table with an account column. The data value in this column is used by SSAS to map the types of accounts to well-known account types (for example Assets, Balances, and so on) so that you replicate the functionality of a balance sheet in your cube. SSAS uses these mappings to apply the appropriate aggregation functions to the accounts. If no account type dimension exists in the measure group, ByAccount is treated as the None aggregation function. This is typically used if a portion of your cube is being used as a balance sheet.
148
Part I
Business Intelligence for Business Decision Makers and Architects
Calculated vs. Derived Measures A final consideration for measures is that you can elect to derive measure values when loading data into the cube from source fact tables. This type of measure is called a derived measure because it’s “derived” or created when the cube is loaded rather than simply retrieved using a SELECT statement from the source fact tables. Creating derived measures is done via a statement (Transact-SQL for SQL Server) that is understood by the source database. We do not advocate using derived measures because the overhead of creating them slows cube processing times. Rather than incurring the overhead of deriving measures during cube loads, an alternative way to create the measure value is to calculate and store the measure value during the ETL process that is used to load the (relational) fact table rather than the SSAS cube. This approach assumes you’ve chosen to materialize the star schema source data. By materialize, we mean that you have made a physical copy of your source data on some intermediate storage location, usually SQL Server. This is opposed to simply creating a logical representation (or a view) of source data. That way, the value can simply be retrieved (rather than calculated) during the cube load process. In addition to derived measures, SSAS supports calculated measures. Calculated measure values are calculated at query time by SSAS. Calculated measures execute based on queries that you write against the OLAP cube data. These queries are written in the language required for querying SSAS cubes, which is MDX. If you opt for this approach, no particular modeling changes are needed. We’ll review the process for creating calculated measures in Chapter 8.
Other Considerations in BI Modeling SSAS 2008 supports additional capabilities that might affect the final modeling of your cube source schemas. In our experience, the majority of you will likely begin your modeling and design process by using the concepts presented in this chapter. You’ll then load some sample data into prototype OLAP cubes to validate both the data and the modeling concepts. You’ll then iterate, refining the cube design by implementing some of the more advanced capabilities of SSAS and by continuing to add data that has been validated and cleansed.
Data Mining Data mining capabilities are greatly enhanced in SSAS 2008 compared to what was available in earlier editions of SSAS. There are a now many more sophisticated algorithms. These algorithms have been optimized and client tools, such as Excel, have been enhanced to support these improvements. A quick definition of data mining is the ability to use included algorithms to detect patterns in the data. For this reason, data mining technologies are sometimes also called predictive analytics. Interestingly, SSAS’s data mining capabilities can be used
Chapter 5
Logical OLAP Design Concepts for Architects
149
with either OLTP or OLAP source data. We’ll cover data mining modeling and implementation in greater detail in Chapters 12 and 13.
Key Performance Indicators (KPIs) The ability to create key performance indicators from inside SSAS 2008 cubes is a muchrequested feature. A simple definition of a KPI is a method (usually displayed in an end-user tool visually) of showing one or more key business metrics. For each metric (such as daily sales), the current state or value, a comparison to overall goal, a trend over time (positive, neutral, or negative), and other information can be shown. KPIs are usually displayed via graphics—that is, red, yellow, or green traffic lights; up arrows or down arrows; and so on. KPIs are often part of dashboards or scorecards in client interfaces. SSAS OLAP cubes include built-in tools to facilitate the quick and easy creation of KPIs. We’ll discuss the planning and implementation of SSAS KPIs in Chapter 8.
Actions, Perspectives, and Translations SSAS actions give end users the ability to right-click a cell of the cube (using client tools that support SSAS actions) and to perform some type of defined action, such as passing the value of the selected cell into an external application as a parameter value and then launching that application. They are not new to this release; however, there are new types of actions available. Perspectives are similar to relational views. They allow you to create named subsets of your cube data for the convenience of your end users. They require the Enterprise edition of SSAS 2008. Translations give you a quick and easy way to present localized cube metadata to end users. All of these capabilities will be covered in more detail in Chapter 8.
Source Control and Other Documentation Standards By the time it’s in the OLAP modeling phase, your BI project will contain many files of many different types. Although you are in the modeling phase, the files will probably mostly consist of Visio diagrams, Excel spreadsheets, and Word documents. It’s important to establish a methodology for versioning and source control early in your project. When you move to the prototyping and developing phase, the number and types of files will increase exponentially. You can use any tool that works for you and your team. Some possible choices include Visual Source Safe, Visual Studio Team System, SharePoint Document Libraries, or versioning via Office. The important point is that you must establish a system that all of your BI team members are committed to using early in your BI project life cycle. Also it’s important to use the right tool for the right job—for example, SharePoint Document Libraries are designed to support versioning of requirements documents (which are typically written using Word, Excel,
150
Part I
Business Intelligence for Business Decision Makers and Architects
and so on), while Visual Source Safe is designed to support source control for OLAP code files, which you’ll create later in your project’s life cycle. Another important consideration is naming conventions. Unlike OLTP (or relational) database design, there are few common naming standards in the world of OLAP design. I suggest that you author, publish, and distribute written naming guidelines to all members of your BI team during the requirements-gathering phase of your project. These naming guidelines should include suggested formats for the following items at a minimum: cubes, dimensions, levels, attributes, star schema fact and dimension tables, SSIS packages, SSRS reports, or Office SharePoint Server 2007 pages (SPS) and dashboards.
Summary In this chapter, we covered the basic modeling concepts and techniques for OLAP cubes in a BI project. We discussed the idea of using grain statements for a high-level validation of your modeling work. You learned how best to determine what types of dimensions (fixed, slowly changing, or rapidly changing) and facts (stored, calculated, or derived) will be the basis for your cubes. We also discussed the concept of hierarchies of dimensional information. If you are new to BI, and are thinking that you’ve got some unlearning to do, you are not alone. We hear this comment quite frequently from our clients. OLAP modeling is not at all like OLTP modeling, mostly because of the all-prevalent concept in OLAP of deliberate denormalization. In the next chapter, we’ll show you tools and techniques to move your design from idea to reality. There we’ll dive into the SSAS interface in BIDS and get you started building your first OLAP cube.
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
151
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler We covered quite a bit of ground in the preceding five chapters—everything from the key concepts of business intelligence to concepts, languages, processes, and modeling methodologies. By now, we’re sure you’re quite ready to roll up your sleeves and get to work in the Business Intelligence Development Studio (BIDS) interface. Before you start, though, we have one more chapter’s worth of information. In this chapter, we’ll explore the ins and outs of some tools you’ll use when working with Microsoft SQL Server Analysis Services (SSAS) objects. After reading this chapter, you’ll be an expert in not only SQL Server Profiler but also SQL Server Management Studio (SSMS) and other tools that will make your work in SSAS very productive. We aim to provide valuable information in this chapter for all SSAS developers— from those of you who are new to the tools to those who have some experience. If you’re wondering when we’re going to get around to discussing BIDS, we’ll start that in Chapter 7, “Designing OLAP Cubes Using BIDS.”
Core Tools in SQL Server Analysis Services We’ll begin this discussion of the tools you’ll use to design, create, populate, secure, and manage OLAP cubes by taking an inventory. That is, we’ll first list all the tools that are part of SQL Server 2008. Later in this chapter, we’ll look at useful utilities and other tools you can get for free or at low cost that are not included with SQL Server 2008. We mention these because we’ve found them to be useful for production work. Before we list our inventory, we’ll talk a bit about the target audience—that is, who Microsoft thinks will use these tools. We share this information so that you can choose the tools that best fit your style, background, and expectations. SQL Server 2008 does not install SSAS by default. When you install SSAS, several tools are installed with the SSAS engine and data storage mechanisms. Also, an SSAS installation does not require that you install SQL Server Database Engine Services. You’ll probably want to install SQL Server Database Engine Services, however, because some of the tools that install with it are useful with SSAS cubes. SQL Server 2008 installation follows the minimuminstallation paradigm, so you’ll probably want to verify which components you’ve installed before exploring the tools for SSAS. To come up with this inventory list, follow these steps: 1. Run Setup.exe from the installation media.
153
154
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
2. On the left side of the Installation Center screen, click Installation and then select New SQL Server Stand-Alone Installation Or Add Features To An Existing Installation on the resulting screen. 3. Click OK after the system checks complete, and click the Install button on the next screen. 4. Click Next on the resulting screen. Then select the Add Features To An Existing Instance Of SQL Server 2008 option, and select the appropriate instance of SQL Server from the list. After you select the particular instance you want to verify, click Next to review the features that have been installed for this instance.
Who Is an SSAS Developer? Understanding the answer to this question will help you to understand the broad variety of tools available with SSAS OLAP cubes in SQL Server 2008. Unlike classic .NET developers, SSAS developers are people with a broad and varied skill set. Microsoft has provided some tools for developers who are comfortable writing code and other tools for developers who are not. In fact, the bias in SSAS development favors those who prefer to create objects by using a graphical user interface rather than by writing code. We mention this specifically because we’ve seen traditional application developers, who are looking for a code-first approach, become frustrated with the developer interface Microsoft has provided. In addition to providing a rich graphical user interface in most of the tools, Microsoft has also included wizards to further expedite the most commonly performed tasks. In our experience, application developers who are familiar with a code-first environment often fail to take the time to understand and explore the development environments available for SSAS. This results in frustration on the part of the developers and lost productivity on business intelligence (BI) projects. We’ll start by presenting information at a level that assumes you’ve never worked with any version of Microsoft Visual Studio or Enterprise Manager (or Management Studio) before. Even if you have experience in one or both of these environments, you might still want to read this section. Our goal is to maximize your productivity by sharing our tips, best practices, and lessons learned.
Note in Figure 6-1 that some components are shared to the SQL Server 2008 instance, but others install only when a particular component is installed. As mentioned in previous chapters, SQL Server 2008 no longer ships with sample databases. If you want to install the AdventureWorks OLTP and OLAP samples, you must download them from CodePlex. For instructions on where to locate these samples and how to install them, see Chapter 1, “Business Intelligence Basics.”
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
155
FIgure 6-1 Installed features are shown on the Select Features page.
After you’ve verified that everything you expected to be installed is actually installed in the particular instance of SSAS, you’re ready to start working with the tools. A variety of tools are included; however, you’ll do the majority of your design and development work in just one tool—BIDS. For completeness, this is the list of tools installed with the various SQL Server components: ■■
Import/Export Wizard transformations
■■
Business Intelligence Development Studio ment environment
■■
SQL Server Management Studio environment
■■
SSAS Deployment Wizard server to another
■■
SSRS Configuration Manager Used to configure SSRS
■■
SQL Server Configuration Manager Used to configure SQL Server components, including SSAS
■■
SQL Server Error and Usage Reporting Used to configure error/usage reporting— that is, to specify whether or not to send a report to Microsoft
Used to import/export data and to perform simple Primary SSAS, SSIS, and SSRS develop-
Primary SQL Server (all components) administrative
Used to deploy SSAS metadata (*.asdatabase) from one
156
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
■■
SQL Server Installation Center New information center, as shown in Figure 6-2, which includes hardware and software requirements, baseline security, and installed features and samples
■■
SQL Server Books Online
■■
Database Engine Tuning Advisor Used to provide performance tuning recommendations for SQL Server databases
■■
SQL Server Profiler Used to capture activity running on SQL Server components, including SSAS
■■
Report Builder 2.0 Used by nondevelopers to design SSRS reports. This is available as a separate download and is not on the SQL Server installation media.
Product documentation
FIgure 6-2 The SQL Server Installation Center
A number of GUI tools are available with a complete SQL Server 2008 installation. By complete, we mean that all components of SQL Server 2008 are installed on the machine. A full installation is not required, or even recommended, for production environments. The best practice for production environments is to create a reduced attack surface by installing only the components and tools needed to satisfy the business requirements. In addition, you should secure access to powerful tools with appropriate security measures. We’ll talk more about security in general later in this chapter, and we’ll describe best practices for locking
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
157
down tools in production environments. For now, we’ll stay in an exploratory mode by installing everything and accessing it with administrator privileges to understand the capabilities of the various tools. Because BIDS looks like the Visual Studio interface, people often ask us if an SSAS instance installation requires a full Visual Studio install. The answer is no. If Visual Studio is not installed on a machine with SSAS, BIDS, which is subset of Visual Studio, installs. If Visual Studio is installed, the BIDS project templates install inside of the Visual Studio instance on that machine. The core tools you’ll use for development of OLAP cubes and data mining structures in SSAS are BIDS, SSMS, and SQL Server Profiler. Before we take a closer look at these GUI tools, we’ll mention a couple of command-line tools that are available to you as well. In addition to the GUI tools, several command-line tools are installed when you install SQL Server 2008 SSAS. You can also download additional free tools from CodePlex. One tool available on the CodePlex site is called BIDS Helper, which you can find at http://www.codeplex.com/bidshelper. It includes many useful features for SSAS development. You can find other useful tools on CodePlex as well. We’ll list only a couple of the tools that we’ve used in our projects: ■■
ascmd.exe Allows you to run XMLA, DMX, or DMX scripts from the command prompt (available at http://www.codeplex.com/MSFTASProdSamples)
■■
SQLPS.exe Allows you to execute Transact-SQL via the Windows PowerShell command line—mostly used when managing SQL Server source data for BI projects.
As we mentioned, you’ll also want to continue to monitor CodePlex for new communitydriven tools and samples. Contributors to CodePlex include both Microsoft employees and non-Microsoft contributors.
Baseline Service Configuration Now that we’ve seen the list of tools, we’ll take a look at the configuration of SSAS. The simplest way to do this is to use SQL Server Configuration Manager. In Figure 6-3, you can see that on our demo machine, the installed instance of SSAS is named MSSQLSERVER and its current state is set to Running. You can also see that the Start Mode is set to Automatic. The service log on account is set to NT AUTHORITY\Network Service. Of course, your settings may vary from our defaults. Although you can also see this information using the Control Panel Services item, it’s recommended that you view and change any of this information using the SQL Server Configuration Manager. The reason for this is that the latter tool properly changes associated registry settings when changes to the service configuration are made. This association is not necessarily observed if configuration changes are made using the Control Panel Services item.
158
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 6-3 SQL Server Configuration Manager
The most important setting for the SSAS service itself is the Log On (or service) account. You have two choices for this setting. You can select one of three built-in accounts: Local System, Local Service, or Network Service. If you do not do that, you can use an account that has been created specifically for this purpose either locally or on your domain. Figure 6-4 shows the dialog box in SQL Server Configuration Manager where you set this. Which one of these choices is best and why? Our answer depends on which environment you’re working in. If you’re exploring or setting up a development machine in an isolated domain, or as a stand-alone server, you can use any account. As we show in Figures 6-3 and 6-4, we usually just use a local account that has been added to the local administrator’s group for this purpose. We do remind you that this circumvention of security is appropriate only for nonproduction environments, however.
FIgure 6-4 The SSAS service account uses a Log On account.
SQL Server Books Online contains lots of information about log-on accounts. You’ll want to review the topics “Setting Up Windows Service Accounts” and “Choosing the Service
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
159
Account” for details on exactly which permissions and rights are needed for your particular service (user) account. We’ll distill the SQL Server Books Online information down a bit because, in practice, we’ve seen only two configurations. Most typically, we see our clients use either a local or domain lower-privileged (similar to a regular user) account. Be aware that for SSAS-only installations, the ability to use a domain user account as the SSAS logon account is disabled. One important consideration specific to SSAS is that the service logon account information is used to encrypt SSAS connection strings and passwords. This is a further reason to use an isolated, monitored, unique, low-privileged account.
Service Principal Names What is a Service Principal Name (SPN)? An SPN is a particular type of Domain Name System (DNS) record. When you associate a service account with SSAS at the time of installation, an SPN record is created. If your SSAS server is part of a domain, this record is stored in your domain DNS database. It’s required for some authentication scenarios (particular client tools). If you change the service account for SSAS, you must delete the original SPN and create a new SPN record for DNS You can do this with the setSPN.exe tool available from the Windows Server Resource Kit. Here’s further guidance from SQL Server Books Online: “Service SIDs are available in SQL Server 2008 on Windows Server 2008 and Windows Vista operating systems to allow service isolation. Service isolation provides services a way to access specific objects without having to either run in a high-privilege account or weaken the object’s security protection. A SQL Server service can use this identity to restrict access to its resources by other services or applications. Use of service SIDs also removes the requirement of managing different domain accounts for various SQL Server services. A service isolates an object for its exclusive use by securing the resource with an access control entry that contains a service security ID (SID). This ID, referred to as a per-service SID, is derived from the service name and is unique to that service. After a SID has been assigned to a service, the service owner can modify the access control list for an object to allow access to the SID. For example, a registry key in HKEY_LOCAL_MACHINE\SOFTWARE would normally be accessible only to services with administrative privileges. By adding the per-service SID to the key’s ACL, the service can run in a lower-privilege account, but still have access to the key.” Now that you’ve verified your SSAS installation and checked to make sure the service was configured correctly and is currently running, it’s time to look at some of the tools you’ll use to work with OLAP objects. For illustration, we’ve installed the AdventureWorks DW2008 sample OLAP project found on CodePlex, because we believe it’s more meaningful to explore
160
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
the various developer surfaces with information already in them. In the next chapter, we’ll build a cube from start to finish. So if you’re already familiar with SSMS and SQL Server Profiler, you might want to skip directly to that chapter.
SSAS in SSMS Although we don’t believe that the primary audience of this book is administrators, we do choose to begin our deep dive into the GUI tools with SSMS. SQL Server Management Studio is an administrative tool for SQL Server relational databases, SSAS OLAP cubes and data mining models, SSIS packages, SSRS reports, and SQL Server Compact edition data. The reason we begin here is that we’ve found the line between SSAS developer and administrator to be quite blurry. Because of a general lack of knowledge about SSAS, we’ve seen many an SSAS developer being asked to perform administrative tasks for the OLAP cubes or data mining structures that have been developed. Figure 6-5 shows the connection dialog box for SSMS.
FIgure 6-5 SSMS is the unified administrative tool for all SQL Server 2008 components.
After you connect to SSAS in SSMS, you are presented with a tree-like view of all SSAS objects. The top-level object is the server, and databases are next. Figure 6-6 shows this tree view in Object Explorer. An OLAP database object is quite different than a relational database object, which is kept in SQL Server’s RDBMS storage. Rather than having relational tables, views, and stored procedures, an OLAP database consists of data sources, data source views, cubes, dimensions, mining structures, roles, and assemblies. All of these core object types are represented by folders in the Object Explorer tree view. These folders can contain child objects as well, as shown in Figure 6-6 in the Measure Groups folder that appears under a cube in the Cubes folder. So what are all of these objects? Some should be familiar to you based on our previous discussions of OLAP concepts, including cubes, dimensions, and mining structures. These are the basic storage units for SSAS data. You can think of them as somewhat analogous to relational tables and views in that respect, although structurally, OLAP objects are not relational but multidimensional.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
161
FIgure 6-6 Object Explorer lists all SSAS objects in a tree-like structure.
Data sources represent connections to source data. We’ll be exploring them in more detail in this chapter and the next one. Data source views are conceptually similar to relational views in that they represent a view of the data from one or more defined data sources in the project. Roles are security groups for SSAS objects. Assemblies are .NET types to be used in your SSAS project—that is, they have been written in a .NET language and compiled as .dlls. The next area to explore in SSMS is the menus. Figure 6-7 shows both the menu and standard toolbar. Note that the standard toolbar displays query types for all possible components—that is, relational (Transact-SQL) components, multidimensional OLAP cubes (MDX), data mining structures (DMX), administrative metadata for OLAP objects (XMLA), and SQL Server Compact edition.
FIgure 6-7 The SSMS standard toolbar displays query options for all possible SQL Server 2008 components.
It’s important that you remember the purpose of SSMS—administration. When you think about this, the fact that it’s straightforward to view, query, and configure SSAS objects—but more complex to create them—is understandable. You primarily use BIDS to create OLAP objects. Because this is a GUI environment, you’re also provided with guidance should you
162
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
want to examine or query objects. Another consideration is that SSMS is not an end-user tool. Even though the viewers are sophisticated, SSMS is designed for SSAS administrators.
How Do I View OLAP Objects? SSMS includes many object viewers. You’ll see these same viewers built into other tools designed to work with SSAS, such as BIDS. You’ll also find versions of these viewers built into client tools, such as Microsoft Office Excel 2007. The simplest and fastest way to explore cubes and mining models in SSMS is to locate the object in the tree view and then to rightclick on it. For cubes, dimensions, and mining structures, the first item on the shortcut menu is Browse. We’ll begin our exploration with the Product dimension. Figure 6-8 shows the results of browsing the Product dimension. For each dimension, we have the ability to drill down to see the member names at the defined levels—in this case, at the category, subcategory, and individual item levels. In addition to being able to view the details of the particular dimensional (rollup) hierarchy, we can also select a localization (language) and member properties that might be associated with one or more levels of a dimension. In our example, we have elected to include color and list price in our view for the AWC Logo Cap clothing item. These member properties have been associated with the item (bottom) level of the product dimension.
FIgure 6-8 The dimension browser enables you to view the data in a dimension.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
163
The viewing options available for dimensions in SSMS include the ability to filter and implement dimension writeback. Writeback has to be enabled on the particular dimension, and the connected user needs to have dimension writeback permission to be permitted to use this action in SSMS. In addition to being able to view the dimension information, you can also see some of the metadata properties by clicking Properties on the shortcut menu. Be aware that you’re viewing a small subset of structural properties in SSMS. As you would expect, these properties are related to administrative tasks associated with a dimension. Figure 6-9 shows the general dialog box of the Product dimension property page. Note that the only setting you can change in this view is Processing Mode. We’ll examine the various processing modes for dimensions and the implications of using particular selections in Chapter 9, “Processing Cubes and Dimensions.”
FIgure 6-9 The Dimension Properties dialog box in SSMS shows administrative properties associated with a
dimension.
In fact, you can process OLAP dimensions, cubes, and mining structures in SSMS. You do this by right-clicking on the object and then choosing Process on the shortcut menu. Because this topic requires more explanation, we’ll cover it in Chapter 9. Suffice it to say at this point that, from a high level, SSAS object processing is the process of copying data from source locations into destination containers and performing various associated processing actions on this data as part of the loading process. As you might expect, these processes can be complex and require that you have an advanced understanding of the SSAS objects before you try to implement the objects and tune them. For this reason, we’ll explore processing in Part II of this book. If you’re getting curious about the rest of the metadata associated with a dimension, you can view this information in SSMS as well. This task is accomplished by clicking on the shortcut menu option Script Dimension As, choosing Create To, and selecting New Query Editor Window. The results are produced as pure XMLA script. You’ll recall from earlier in the book that XMLA is a dialect of XML.
164
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
What you’re looking at is a portion of the XMLA script that is used to define the structure of the dimension. Although you can use Notepad to create SSAS objects, because they are entirely based on an XMLA script, you’ll be much more productive using the graphical user interface in BIDS to generate this metadata script. The reason you can generate XMLA in SSMS is that when you need to re-create OLAP objects, you need the XMLA to do so. So XMLA is used to copy, move, and back up SSAS objects. In fact, you can execute the XMLA query you’ve generated using SSMS. We take a closer look at querying later in this chapter. Now that you’ve seen how to work with objects, we’ll simply repeat the pattern for OLAP cubes and data mining structures. That is, we’ll first view the cube or structure using the Browse option, review configurable administrative properties, and then take a look at the XMLA that is generated. We won’t neglect querying either. After we examine browsing, properties, and scripting for cubes and models, we’ll look at querying the objects using the appropriate language—MDX, DMX, or XMLA.
How Do I View OLAP Cubes? The OLAP cube browser built into SSMS is identical to the one you’ll be working with in BIDS when you’re developing your cubes. It’s a sophisticated pivot table–style interface. The more familiar you become with it, the more productive you’ll be. Just click Browse on the shortcut menu after you’ve selected any cube in the Object Explorer in SSMS to get started. Doing this presents you with the starter view. This view includes the use of hint text (such as Drop Totals Of Detail Field Fields Here) in the center work area that helps you understand how best to use this browser. On the left side of the browser, you’re presented with another object browser. This is where you select the items (or aspects) of the cube you want to view. You can select measures, dimension attributes, levels, or hierarchies. Note that you can select a particular measure as a filter from the drop-down list box at the top of this object browser. Not only will this filter the measures selected, it will also filter the associated dimensions so that you’re selecting from an appropriate subset as you build your view. Measures can be viewed in the Totals work area. Dimension attributes, levels, or hierarchies can be viewed on the Rows, Columns, or Filters (also referred to as slicers) axis. These axes are labeled with the hint text Drop xxx Fields Here. We’ll look at Filters or Slicers axes in more detail later in this chapter. At the top of the browser, you can select a perspective. A perspective is a defined view of an OLAP cube. You can also select a language. Directly below that is the Filter area, where you can create a filter expression (which is actually an MDX expression) by dragging and dropping a dimension level or hierarchy into that area and then completing the rest of the information—that is, configuring the Hierarchy, Operator, and Filter Expression options. We’ll be demonstrating this shortly. To get started, drag one or more measures and a couple of dimensions to the Rows and Columns axes. We’ll do this and show you our results in Figure 6-10.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
165
To set up our first view, we filtered our list by the Internet Sales measure group in the object browser. Next we selected Internet Sales Amount and Internet Order Quantity as our measures and dragged them to that area of the workspace. We then selected the Product Categories hierarchy of the Product dimension and dragged it to the Rows axis. We also selected the Sales Territory hierarchy from the Sales Territory dimension and dragged it to the Columns axis. We drilled down to show detail for the Accessories product category and Gloves subcategory under the Clothing product category on the Rows axis. And finally, we filtered the Sales Territory Group information to hide the Pacific region. The small blue triangle next to the Group label indicates that a filter has been applied to this data. If you want to remove any item from the work area, just click it and drag it back to the left side (list view). Your cursor will change to an X, and the item will be removed from the view. It’s much more difficult to write the steps as we just did than to actually do them! And that is the point. OLAP cubes, when correctly designed, are quick, easy, and intuitive to query. What you’re actually doing when you’re visually manipulating the pivot table surface is generating MDX queries. The beauty of this interface is that end users can do this as well. Gone are the days that new query requests of report systems require developers to rewrite (and tune) database queries.
FIgure 6-10 Building an OLAP cube view in SSMS
166
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Let’s add more sophistication to our view. To do this, we’ll use the filter and slicer capabilities of the cube browser. We’ll also look at the pivot capability and use the built-in common queries. To access the latter, you can simply right-click on a measure in the measures area of the designer surface and select from a shortcut menu, which presents you with common queries, such as Show Top 10 Values and other options as well. Figure 6-11 shows our results.
FIgure 6-11 Results of building an OLAP cube view in SSMS
Here are the steps we took to get there. First we dragged the Promotions hierarchy from the Promotion dimension to the slicer (Filter Fields) area. We then set a filter by clearing the check boxes next to the Reseller promotion dimension members. This resulted in showing data associated only with the remaining members. Note that the label indicates this as well by displaying the text “Excluding: Reseller.” We then dragged the Ship Date.Calendar Year hierarchy from the Ship Date dimension; we set the Operator area to Equal, and in the Filter Expression area we chose the years 2003 and 2004 from the available options. Another area to explore is the nested toolbar inside of the Browser subtab. Using buttons on this tab toolbar, you can connect as a different user and sort, filter, and further manipulate the data shown in the working pivot table view. Note that there is an option to show only the top or bottom values (1, 2, 5, 10, or 25 members or a percentage). Finally, if drillthrough is enabled for this cube, you can drill through using this browser by right-clicking on a data cell and selecting that option. Drillthrough allows you
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
167
to see additional columns of information that are associated with the particular fact item (or measure) that you’ve selected. You should spend some time experimenting with all the toolbar buttons so that you’re thoroughly familiar with the different built-in query options. Be aware that each time you select an option, you’re generating an MDX query to the underlying OLAP cube. Note also that when you select cells in the grid, additional information is shown in a tooltip. You can continue to manipulate this view for any sort of testing purposes. Possible actions also include pivoting information from the rows to the column’s axis, from the slicer to the filter, and so on. Conceptually, you can think of this manipulation as somewhat similar to working with a Rubik’s cube. Of course, OLAP cubes generally contain more than three dimensions, so this analogy is just a starting point.
Viewing OLAP Cube Properties and Metadata If you next want to view the administrative properties associated with the particular OLAP cube that you’re working with (as you did for dimensions), you simply right-click that cube in the SSMS Object Browser and then click Properties. Similar to what you saw when you performed this type of action on an OLAP dimension, you’ll then see a dialog box similar to the one shown in Figure 6-12 that allows you to view some properties. The only properties you can change in this view are those specifically associated with cube processing. As mentioned previously, we’ll look at cube processing options in more detail in Chapter 9.
FIgure 6-12 OLAP cube properties in SSMS
168
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
By now, you can probably guess how you’d generate an XMLA metadata script for an OLAP cube in SSMS. Just right-click the cube in the object browser and click Script Cube As on the shortcut menu, choose Create To, and select New Query Editor Window. Note also that you can generate XMLA scripts from inside any object property window. You do this by clicking the Script button shown at the top of Figure 6-12. Now that we’ve looked at both OLAP dimensions and cubes in SSMS, it’s time to look at a different type of object—SSAS data mining structures. Although conceptually different, data mining (DM) objects are accessed using methods identical to those we’ve already seen—that is, browse, properties, and script.
How Do I View DM Structures? As we begin our tour of SSAS data mining structures, we need to remember a couple of concepts that were introduced earlier in this book. Data mining structures are containers for one or more data mining models. Each data mining model uses a particular data mining algorithm. Each data mining algorithm has one or more data mining algorithm viewers associated with it. Also, each data mining model can be viewed using a viewer as well via a lift chart. New to SQL Server 2008 is the ability to perform cross validation. Because many of these viewing options require more explanation about data mining structures, at this point we’re going to stick to the rhythm we’ve established in this chapter—that is, we’ll look at a simple view, followed by the object properties, and then the XMLA. Because the viewers are more complex for data mining objects than for OLAP objects, we’ll spend a bit more time exploring. We’ll start by browsing the Customer Mining data mining structure. Figure 6-13 shows the result. What you’re looking at is a rendering of the Customer Clusters data mining model, which is part of the listed structure. You need to select the Cluster Profiles tab to see the same view. Note that you can make many adjustments to this browser, such as legend, number of histogram bars, and so on. At this point, some of the viewers won’t make much sense to you unless you have a background using data mining. Some viewers are more intuitive than others. We’ll focus on showing those in this section. It’s also important for you to remember that although these viewers are quite sophisticated, SSMS is not an end-user client tool. We find ourselves using the viewers in SSMS to demonstrate proof-of-concept ideas in data mining to business decision makers (BDMs), however. If these viewers look familiar to you, you’ve retained some important information that we presented in Chapter 2, “Visualizing Business Intelligence Results.” These viewers are nearly identical to the ones that are intended for end users as part of the SQL Server 2008 Data Mining Add-ins for Office 2007. When you install the free add-ins, these data mining viewers become available as part of the Data Mining tab on the Excel 2007 Ribbon. Another consideration for you is this—similar to the OLAP cube pivot table viewer control in SSMS that we just finished looking at, these data mining controls are also part of BIDS.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
169
FIgure 6-13 Data mining structure viewer in SSMS
In our next view, shown in Figure 6-14, we’ve selected the second mining model, Subcategory Associations, associated with the selected mining structure. Because this second model has been built using a different mining algorithm, after we make this selection the Viewer dropdown list automatically updates to list the associated viewers available for that particular algorithm. We then chose the Dependency Network tab from the three available views and did a bit of tuning of the view, using the embedded toolbar to produce the view shown (for example, sized it to fit, zoomed it, and so on). An interesting tool that is part of this viewer is the slider control on the left side. This control allows you to dynamically adjust the strength of association shown in the view. We’ve found that this particular viewer is quite intuitive, and it has helped us to explain the power of data mining algorithms to many nontechnical users. As you did with the OLAP pivot table viewer, you should experiment with the included data mining structure viewers. If you feel a bit frustrated because some visualizations are not yet meaningful to you, we ask that you have patience. We devote Chapter 12, “Understanding Data Mining Structures,” to a detailed explanation of the included data mining algorithms. In that chapter, we’ll provide a more detailed explanation of most included DM views.
170
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 6-14 Data mining structure viewer in SSMS showing the Dependency Network view for the Microsoft
Association algorithm
Tip You can change any of the default color schemes for the data mining viewers in SSMS by adjusting the colors via Tools, Options, Designers, Analysis Services Designers, Data Mining Viewers. Because the processes for viewing the data mining object administrative properties and for generating an XMLA script of the object’s metadata are identical to those used for OLAP objects, we won’t spend any more time reviewing them here.
How Do You Query SSAS Objects? As with relational data, you have the ability to write and execute queries against multidimensional data in SSMS. This is, however, where the similarity ends. The reason is that when you work in an RDBMS, you need to write any query to the database using SQL. Even if you generate queries using tools, you’ll usually choose to perform manual tuning of those queries. Tuning steps can include rewriting the SQL, altering the indexing on the involved tables, or both. SSAS objects can and sometimes are queried manually. However, the extent to which you’ll choose to write manual queries will be considerably less than the extent to which you’ll query relational sources. What are the reasons for this? There are several: ■■
MDX and DMX language expertise is rare among the developer community. With less experienced developers, the time to write and optimize queries manually can be prohibitive.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
171
■■
OLAP cube data is often delivered to end users via pivot table–type interfaces (that is, Excel, or some manual client that uses a pivot table control). These interfaces include the ability to generate MDX queries by dragging and dropping members of the cube on the designer surface—in other words, by visual query generation.
■■
SSMS and BIDS have many interfaces that also support the idea of visual query generation for both MDX and DMX. This feature is quite important to developer productivity.
What we’re saying here is that although you can create manual queries, and SSMS is the place to do this, you’ll need to do this significantly less frequently while working with SSAS objects (compared to what you have been used to with RDBMS systems). It’s very important for you to understand and embrace this difference. Visual development does not mean lack of sophistication or power in the world of SSAS. As you move toward understanding MDX and DMX, we suggest that you first monitor the queries that SSMS generates via the graphical user interface. SQL Server Profiler is an excellent tool to use when doing this.
What Is SQL Server Profiler? SQL Server Profiler is an activity capture tool for the database engine and SSAS that ships with SQL Server 2008. SQL Server Profiler broadly serves two purposes. The first is to monitor activity for auditing or security purposes. To that end, SQL Server Profiler can be easily configured to capture login attempts, access specific objects, and so on. The other main use of the tool is to monitor activity for performance analysis. SQL Server Profiler is a powerful tool—when used properly, it’s one of the keys to understanding SSAS activity. We caution you, however, that SQL Server Profiler can cause significant overhead on production servers. When you’re using it, you should run it on a development server or capture only essential information. SQL Server Profiler captures are called traces. Appropriately capturing only events (and associated data) that you’re interested in takes a bit of practice. There are many items you can capture! The great news is that after you’ve determined the important events for your particular business scenario, you can save your defined capture for reuse as a trace template. If you’re familiar with SQL Server Profiler from using it to monitor RDBMS data, you’ll note that when you set the connection to SSAS for a new trace, SQL Server Profiler presents you with a set of events that is specific to SSAS to select from. See the SQL Server Books Online topics “Introduction to Monitoring Analysis Services with SQL Server Profiler” and “Analysis Services Event Classes” for more detailed information. Figure 6-15 shows some of the events that you can choose to capture for SSAS objects. Note that in this view, we’ve selected Show All Events in the dialog box. This switch is off by default. After you’ve selected which events (and what associated data) you want to capture, you can run your trace live, or you can save the results either to a file or to a relational table for you
172
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
to rerun and analyze later. The latter option is helpful if you want to capture the event on a production server and then replace the trace on a development server for analysis and testing of queries. At this point, we’re really just going to use SQL Server Profiler to view MDX queries that are generated when you manipulate the dimension and cube browsers in SSMS. The reason we’re doing this is to introduce you to the MDX query language. You can also use SQL Server Profiler to capture generated DMX queries for data mining structures that you manipulate using the included browsers in SSMS.
FIgure 6-15 SQL Server Profiler allows you to capture SSAS-specific events for OLAP cubes and data mining
structures.
To see how query capture works, just start a trace in SQL Server Profiler, using all of the default capture settings, by clicking Run on the bottom right of the Trace Properties dialog box. With the trace running, switch to SSMS, right-click on the Adventure Works sample cube in the object browsers, click Browse, and then drag a measure to the pivot table design area. We dragged the Internet Sales Amount measure for our demo. After you’ve done that, switch back to SQL Server Profiler and then click on the pause trace button on the toolbar. Scroll through the trace to the end, where you should see a line with the EventClass showing Query End and EventSubclass showing 0 - MDXQuery. Then click that line in the trace. Your results should look similar to Figure 6-16. Note that you can see the MDX query that was generated by your drag action on the pivot table design interface in SSMS. This query probably doesn’t seem very daunting to you, particularly if you’ve worked with Transact-SQL before. Don’t be fooled, however; this is just the tip of the iceberg.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
173
FIgure 6-16 SQL Server Profiler allows you to view MDX query text details.
Now let’s get a bit more complicated. Click the Play button in SQL Server Profiler to start the trace again. After that, return to the SSMS OLAP cube pivot table browse area and then drag and drop some dimension information (hierarchies or members) to the rows, columns, slicer, and filter areas. After you have completed this, return to SQL Server Profiler and again pause your trace and then examine the MDX query that has been generated. Your results might look similar to what we show in Figure 6-17. You can see if you scroll through the trace that each action you performed by dragging and dropping generated at least one MDX query.
FIgure 6-17 Detail of a complex MDX query
We find SQL Server Profiler to be an invaluable tool in helping us to understand exactly what type of MDX query is being generated by the various tools (whether developer, administrator, or end user) that we use. Also, SQL Server Profiler does support tracing data mining activity. To test this, you can use the SSMS Object Browser to browse any data mining model while a SQL Server Profiler trace is active. In the case of data mining, however, you’re not presented with the DMX query syntax. Rather, what you see in SQL Server Profiler is the text of the call to a data mining stored procedure. So the results in SQL Server Profiler look something like this: CALL System.Microsoft.AnalysisServices.System.DataMining.AssociationRules. GetStatistics(‘Subcategory Associations’) These results are also strangely categorized as 0 - MDXQuery type queries in the EventSubclass column of the trace. You can also capture data mining queries using SQL Server Profiler. These queries are represented by the EventSubclass type 1 – DMXQuery in SQL Server Profiler.
174
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
We’ll return to SQL Server Profiler later in this book, when we discuss auditing and compliance. Also, we’ll take another look at this tool in Chapters 10 and 11, which we devote to sharing more information about manual query and expression writing using the MDX language. Speaking of queries, before we leave our tour of SSMS, we’ll review the methods you can use to generate and execute manual queries in this environment.
Using SSAS Query Templates Another powerful capability included in SSMS is that of being able to write and execute queries to SSAS objects. These queries can be written in three languages: MDX, DMX, and XMLA. At this point, we’re not yet ready to do a deep dive into the syntax of any of these three languages; that will come later in this book. Rather, here we’d like to understand the query execution process. To that end, we’ll work with the included query templates for these three languages. To do this, we need to choose Template Explorer from the View menu, and then click the Analysis Services (cube) icon to show the three folders with templated MDX, DMX, and XMLA queries. The Template Explorer is shown in Figure 6-18.
FIgure 6-18 SSMS includes MDX, DMX, and XMLA query templates
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
175
You can see that the queries are further categorized into functionality type in child folders under the various languages—such as Model Content and Model Management under DMX. You can also create your own folders and templates in the Template Explorer by right-clicking and then clicking New. After you do this, you’re actually saving the information to this location on disk: C:\Users\Administrator\AppData\Roaming\Microsoft\Microsoft SQL Server\100\ Tools\Shell\Templates\AnalysisServices.
Using MDX Templates Now that you’ve opened the templates, you’ll see that for MDX there are two types of queries: expressions and queries. Expressions use the syntax With Member and create a calculated member as part of a sample query. You can think of a calculated member as somewhat analogous to a calculated cell or set of cells in an Excel workbook, with the difference being that calculated members are created in n-dimensional OLAP space. We’ll talk in greater depth about when, why, and how you choose to use calculated members in Chapter 9. Queries retrieve some subset of an OLAP cube as an ADO.MD CellSet result, and they do not contain calculated members. To execute a basic MDX query, simply double-click the Basic Query template in the Template Explorer and then connect to SSAS. You can optionally write queries in a disconnected state and then, when ready, connect and execute the query. This option is available to reduce resource consumption on production servers. You need to fill the query parameters with actual cube values before you execute the query. Notice that the query window opens yet another metadata explorer in addition to the default Object Explorer. You’ll probably want to close Object Explorer when executing SSAS queries in SSMS. Figure 6-19 shows the initial cluttered, cramped screen that results if you leave all the windows open. It also shows the MDX parser error that results if you execute a query with errors. (See the bottom window, in the center of the screen, with text underlined with a squiggly line.) Now we’ll make this a bit more usable by hiding the Object Explorer and Template Explorer views. A subtle point to note is that the SSAS query metadata browser includes two filters: a Cube filter and, below it, a Measure Group filter. The reason for this is that SSAS OLAP cubes can contain hundreds or even thousands of measure groups. Figure 6-20 shows a cleaned-up interface. We’ve left the Cube filter set at the default, Adventure Works, but we’ve set the Measure Group filter to Internet Sales. This reduces the number of items in the viewer, as it shows only items that have a relationship to measures associated with the selected measure group. Also note that in addition to a list of metadata, this browser includes a second nested tab called Functions. As you’d expect, this tab contains an MDX function language reference list.
176
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 6-19 The SSMS SSAS query screen can be quite cluttered by default.
You might be wondering why you’re being presented with yet another metadata interface, particularly because you’re inside of a query-writing tool. Aren’t you supposed to be writing the code manually here? Nope, not yet. Here’s the reason why—MDX object naming is not as straightforward as it looks. For example, depending on uniqueness of member names in a dimension, you sometimes need to list the ordinal position of a member name; at other times, you need to actually list the name. Sound complex? It is. Dragging and dropping metadata onto the query surface can make you more productive if you’re working with manual queries. To run the basic query, you need to replace the items shown in the sample query between angle brackets—that is, <some value>—with actual cube metadata. Another way to understand this is to select Specify Values For Template Parameters on the Query menu. You can either type the information into the Template Parameters dialog box that appears, or you can click on any of the metadata from the tree view in the left pane and then drag it and drop it onto the designer surface template areas.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
177
FIgure 6-20 The SSMS SSAS query screen with fewer items in the viewer
We’ll use the latter approach to build our first query. We’ll start by dragging the cube name to the From clause. Next we’ll drag the Customers.Customer Geography hierarchy from the Customer dimension to the On Columns clause. We’ll finish by dragging the Date.Calendar Year member from the Date hierarchy and Calendar hierarchy to the On Rows clause. We’ll ignore the Where clause for now. As with Transact-SQL queries, if you want to execute only a portion of a query, just select the portion of interest and press F5. The results are shown in Figure 6-21.
FIgure 6-21 SSMS SSAS query using simple query syntax
178
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Do the results seem curious to you? Are you wondering which measure is being shown? Are you wondering why only the top-level member of each of the selected hierarchies is shown on columns and rows? As we’ve said, MDX is a deceptively simple language. If you’ve worked with Transact-SQL, which bears some structural relationship but is not very closely related at all, you’ll find yourself confounded by MDX. We do plan to provide you with a thorough grounding in MDX. However, we won’t be doing so until much later in this book—we’ll use Chapters 10 and 11 to unravel the mysteries of this multidimensional query language. At this point in our journey, it’s our goal to give you an understanding of how to view and run prewritten MDX queries. Remember that you can also re-execute any queries that you’ve captured via SQL Server Profiler traces in the SSMS SSAS query environment as well. Because we know that you’re probably interested in just a bit more about MDX, we’ll add a couple of items to our basic query. Notably, we’ll include the MDX Members function so that we can display more than the default member of a particular hierarchy on an axis. We’ll also implement the Where clause so that you can see the result of filtering. The results are shown in Figure 6-22. We changed the dimension member information on Columns to a specific level (Country), and then we filtered in the Where clause to the United States only. The second part of the Where clause is an example of the cryptic nature of MDX. The segment [Product].[Product Categories].[Category].&[1] refers to the category named Bikes. We used the drag (metadata) and drop method to determine when to use names and when to use ordinals in the query. This is a time-saving technique you’ll want to use as well.
FIgure 6-22 MDX query showing filtering via the Where clause
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
179
Using DMX Templates Next we’ll move to the world of DM query syntax. Again, we’ll start by taking a look at the included templates in the Template Explorer. They fall into four categories: Model Content, Model Management, Prediction Queries, and Structure Content. When you double-click on a DMX query template, you’ll see that the information in the Metadata browser reflects a particular mining model. You can select different mining model metadata in the pick list at the top left of the browser. Also, the functions shown now include those specific to data mining. The Function browser includes folders for each data mining algorithm, with associated functions in the appropriate folder. Because understanding how to query data mining models requires a more complete understanding of the included algorithms, we’ll simply focus on the mechanics of DMX query execution in SSMS at this point. To do this, we’ll double-click the Model Attributes sample DMX query in the Model Content folder that you access under DMX in the Template Explorer. Then we’ll work with the templated query in the workspace. As with templated MDX queries, the DMX templates indicate parameters with the syntax. You can also click the Query menu and select Specify Values For Template Parameters as you can with MDX templates. We’ll just drag the [Customer Clusters] mining model to the template replacement area. Note that you must include both the square brackets and the single quotes, as shown in Figure 6-23, for the query to execute successfully.
FIgure 6-23 A DMX query showing mining model attributes
180
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
If you click on the Messages tab in the results area (at the bottom of the screen), you’ll see that some DMX queries return an object of type Microsoft.AnalysisServices.AdomdClient. AdomdDataReader. Other DMX query types return scalar values—that is, DMX prediction queries. For more information, see the SQL Server Books Online topic “Data Mining Extensions (DMX) Reference.”
Using XMLA Templates As with the previous two types of templates, SSMS is designed to be an XMLA query viewing and execution environment. The SSMS Template Explorer also includes a couple of types of XMLA sample queries. These are Management, Schema Rowsets, and Server Status. The XMLA language is an XML dialect, so structurally it looks like XML rather than a data structure query language, such as MDX or DMX (which look rather Transact-SQL-like at first glance). One important difference between MDX and XMLA is that XMLA is case-sensitive and space-sensitive, following the rules of XML in general. Another important difference is that the Metadata and Function browsers are not available when you perform an XMLA query. Also, the results returned are in an XML format. In Figure 6-24, we show the results of executing the default Connections template. This shows detailed information about who is currently connected to your SSAS instance. Be reminded that metadata for all SSAS objects—that is, OLAP dimensions, cubes, data mining models, and so on—can easily be generated in SSMS by simply right-clicking the object in the Object Browser and then clicking Script As. This is a great way to begin to understand the capabilities of XMLA. In production environments, you’ll choose to automate many administrative tasks using XMLA scripting. The templates in SSMS represent a very small subset of the XMLA commands that are available in SSAS. For a more complete reference, see the SQL Server Books Online topic “Using XMLA for Analysis in Analysis Services (XMLA).” Another technical note: certain commands used in XMLA are associated with a superset of commands in the Analysis Services Scripting Language (ASSL). The MSDN documentation points out that ASSL commands include both data definition language (DDL) commands, which define and describe instances of SSAS and the particular SSAS database, and also XMLA action commands such as Create, which are then sent to the particular object named by the ASSL. ASSL information is also referred to as binding information in SQL Server Books Online.
Chapter 6
Understanding SSAS in SSMS and SQL Server Profiler
181
FIgure 6-24 SSAS XMLA connections query in SSMS
Closing Thoughts on SSMS Although our primary audience is developers, as discussed, we’ve found that many SSAS developers are also tasked with performing SSAS administrative tasks. For this reason, we spent an entire chapter exploring the SSMS SSAS interface. Also, we find that using SSMS to explore built objects is a gentle way to introduce OLAP and DM concepts to many interested people. We’ve used SSMS to demonstrate these concepts to audiences ranging from .NET developers to business analysts. Finally, we’d like to note that we’re continually amazed at the richness of the interface. Even after having spent many years with SSAS, we still frequently find little time-savers in SSMS.
182
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Summary In this chapter, we took a close look at the installation of SSAS. We then discussed some tools you’ll be using to work with SSAS objects. We took a particularly detailed look at SSMS because we’ve found ourselves using it time and time again on BI projects where we were tasked with being developers. Our real-world experience has been that SSAS developers must often also perform administrative tasks. So knowledge of SSMS can be a real timesaver. We included an introduction to SQL Server Profiler because we’ve found that many clients don’t use this powerful tool correctly or at all because of a lack of understanding of it. By now, we’re sure you’re more than ready to get started developing your first OLAP cube using BIDS. That’s exactly what we’ll be doing starting in the next chapter and then continuing on through several additional chapters—being sure to hit all the important nooks and crannies along the way.
Chapter 7
Designing OLAP Cubes Using BIDS In Chapters 1 through 6, we set the stage for you to begin developing Microsoft SQL Server Analysis Services (SSAS) OLAP cubes using Business Intelligence Development Studio (BIDS). In those chapters, we defined business intelligence (BI) and introduced you to some of its terminology, concepts, languages, and process and modeling methodologies. We then looked at other tools you’ll be using, such as SQL Server Management Studio and Microsoft SQL Server Profiler. Now, you’re ready to roll up your sleeves and get to work in BIDS. In this chapter, we’ll tour the BIDS interface for Analysis Services OLAP cubes by looking at the AdventureWorksample cube in the Adventure Works DW 2008 OLAP database. We’ll also build a simple cube from scratch so that you can see the process to do that.
Using BIDS BIDS is the primary tool in which you’ll be working as you develop your BI project. The good news is that many of the tool concepts that you’ve learned from working in SQL Server Management Studio (SSMS) are duplicated in BIDS, so you’ll have a head start. In this section, we’ll first focus on how to use BIDS, and then follow with what specific tasks to perform in this chapter and through several more chapters, as we dig deep into OLAP cube and data mining structure building using BIDS. As mentioned previously, SSAS installs BIDS in one of two ways. Either it installs BIDS as a set of templates into an existing Visual Studio 2008 SP1 installation when you install SSAS, or BIDS installs as a stand-alone tool called the Visual Studio Shell if no installation of Visual Studio is present when you install SSAS. In either case, you start with solutions and projects, so we’ll begin there. When you’re starting with a blank slate, you’ll open BIDS and then create a new project. For SSAS objects—which include both OLAP cubes and data mining (DM) structures—you’ll use the SSAS template named Analysis Services Project, which is shown in the New Project dialog box in Figure 7-1. You can see in this figure that the dialog box also includes templates for SQL Server Integration Services (SSIS) and SQL Server Reporting Services (SSRS). Those will be covered in the sections pertaining to the creation of those types of objects—that is, packages and reports.
183
184
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-1 BIDS installs as a series of templates in an existing Visual Studio 2008 instance.
After you click OK, you’ll see the full BIDS development environment. If you have experience using Visual Studio, you’ll find that many views, windows, and tools in BIDS are familiar to you from having worked with similar structures in Visual Studio. If the Visual Studio development environment is new to you, you’ll have to allow yourself time to become accustomed to the development interface. Let’s look first at the Solution Explorer view, which appears by default on the right side of your work area. It contains a series of containers, or folders, to group the various SSAS objects. If you click the Show All Files icon on the top toolbar in Solution Explorer, you’ll see that exactly one file has been generated. In our case, it’s using the default name or Analysis Services Project1.database. If you right-click that file name and then click View Code, you’ll see the XMLA metadata. XMLA metadata is created as a result of actions or selections that you make while developing SSAS objects in the BIDS environment. This pattern is one that you’ll see repeated as you continue to work in this environment. Although you can hand-edit the XMLA produced in BIDS, you won’t often do so. Rather, you’ll tend to use the generated XMLA in native form when performing administrative tasks—for example, backing up, restoring, copying, and moving SSAS objects—after you’ve completed your design work. You might remember from previous chapters that you can easily generate XMLA from SSMS as well. We’re now almost ready to start creating SSAS objects, but first let’s look at one more consideration. SSAS development has two modes: offline and online. You need to understand the implications of using either of these modes when doing development work using BIDS.
Offline and Online Modes When you create a new SSAS project, you’re working in an offline, or disconnected, mode. As previously mentioned, when you begin to create SSAS objects using the GUI designers, templates, and wizards in BIDS, you’ll be creating XMLA metadata, MDX queries or expressions, and DMX statements. These items must be built and then deployed to a live SSAS server instance for the objects to actually be created on the SSAS server (and for them to be available to be populated with source data). The steps for doing this are building and processing. These two steps can be completed using BIDS. You can also use other methods, such as SSMS or script, to deploy, but not to perform a build.
Chapter 7
Designing OLAP Cubes Using BIDS
185
After SSAS objects have been built and deployed, you then have a choice about the subsequent development method. You can work live (connected to the SSAS service), or you can work in a disconnected fashion. Both methods have advantages and disadvantages. In the case of live development, of course, there’s no lag or latency when implementing your changes. However, you’re working with a live server; if you choose to work with a production server, you could end up making changes that adversely affect performance for other users. In the worst case, you could make breaking changes to objects that others expect to access. For these reasons, we recommend using the live development approach only when using dedicated development servers in a single developer environment. To connect to an existing, live SSAS solution using BIDS, choose Open from the File menu and then select Analysis Services Database. You’ll then be presented with the dialog box shown in Figure 7-2. There you’ll select the SSAS instance, SSAS database name, and whether or not you’d like to add the current solution. We’ll open the Adventure Works DW 2008 sample. As mentioned previously, this sample is downloadable from CodePlex. Again, we remind you, if you choose to work while connected, all changes you make are immediately applied to the live (deployed) instance.
FIgUre 7-2 Connecting to a live SSAS database in BIDS
Your other option is to work in an offline mode. If you choose this option and have more than one SSAS developer making changes, you must be sure to select and implement a source control solution, such as Visual Studio Team System or something similar. The reason
186
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
for this is that the change conflict resolution process in BIDS is primitive at best. What we mean is this: If multiple developers attempt to process changes, only the last change will win and all interim changes will be overwritten. This behavior is often not desirable. If more than one developer is on your team, you’ll generally work in offline mode. Teams working in offline mode must understand this limitation. BIDS gives you only minimal feedback to show whether you’re working live or not. If you’re working live, the name of the server to which you’re connected (in our case, WINVVP8K0GA45C) is shown in the title bar, as you can see in Figure 7-3.
FIgUre 7-3 The server name is shown in BIDS if you’re working in connected mode.
As we show you how to create BIDS objects, we’ll use two instances of BIDS running on the same computer. This is a technique that is best suited only to learning. We’ll first look at the sample SSAS objects by type using the live, connected BIDS instance and the Adventure Works DW 2008 sample. At each step, we’ll contrast that with the process used to create these objects using the default, blank SSAS template and disconnected BIDS instance. We’ll follow this process over the next few chapters as we drill down into the mechanics of creating these objects in BIDS.
Working in Solution explorer The starting point for most of your development work will be to create or update SSAS objects listed in Solution Explorer. The simplest way to work with this interface is to select the object or container (folder) of interest and then click on the relevant item from the shortcut menu. All the options are also available on the main menu of BIDS, but we find it fast to work directly in Solution Explorer. One other BIDS usability tip is that you should keep in mind that object properties appear in two locations. They’re located at the bottom right in Solution Explorer, as has been traditional in Visual Studio, and sometimes you’ll find additional property sheets after you click on the Properties item on an object’s shortcut menu. Also, the properties available change depending on whether you’re working in live or disconnected mode. For example, if you right-click on the top node of the disconnected BIDS instance and select Properties on the resulting menu, you see the property sheet shown in Figure 7-4. It allows you to configure various options related to building, debugging, and deploying the SSAS solution. However, if you attempt that same action in the connected instance, no pop-up dialog box is shown. Rather, an empty Properties window is displayed in the bottom right of the development environment. If you’re working live, and there’s no need to configure build, debug, or
Chapter 7
Designing OLAP Cubes Using BIDS
187
deploy settings, this behavior makes sense. However, the surface inconsistency in the interface can be distracting for new SSAS developers.
FIgUre 7-4 SSAS top-level properties for a disconnected instance
In addition, some shortcut-menu options occur in unexpected locations. An example of this is the Edit Database option, which is available after you click the top node in Solution Explorer. This brings up the General dialog box, which has configurable properties. On the Warnings tab, shown in Figure 7-5, you can enable or disable design best practice warnings that have been added to SQL Server 2008 SSAS. The reason these warnings were added is that many customers failed to follow best practices in OLAP and data mining modeling and design when using SSAS 2005. This resulted in BI objects that performed poorly under production load. We’ll be looking more closely at these warnings as we proceed with object building.
FIgUre 7-5 SSAS database properties for a disconnected instance
At this point, we’ll just point out where you can review the master list and enable or turn off warnings. These are only recommendations. Even if you leave all the warnings enabled, you’ll still be able to build SSAS objects with any kind of structure using BIDS. None of the warnings prevent an object from building. Object designs that violate any of the rules shown in the warnings result in a blue squiggly line under the offending code and a compiler warning at build time. Illegal design errors produce red squiggly line warnings, and you must first correct those errors before a successful build can occur. We’ll look at a subset of SSAS objects first. These objects are common to all BI solutions. That is, you’ll create and use data sources, data source views, roles, and (optionally) assemblies when you build either cubes (which contain dimensions) or mining structures. You can,
188
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
of course, build both cubes and mining structures in the same solution, although this is less common in production situations. In that case, we typically create separate solutions for cubes and for mining structures. Also, this first group of objects is far simpler to understand than cubes or mining structures.
Data Sources in Analysis Services A BI SSAS data source is simply a connection to some underlying data source. Be sure to keep in mind that SSAS can use any type of data source that it can connect to. Many of our customers have had the misperception that SSAS could use only SQL Server data as source data. This is not true! As mentioned earlier, SSAS is now the top OLAP server for RDBMS systems other than SQL Server, notably Oracle. The reason for this is that businesses find total cost of ownership (TCO) advantages in Microsoft’s BI offering compared to BI products from other vendors. We’ll start by examining the data source that is part of the Adventure Works DW 2008 sample. First, double-click the Adventure Works DW data source in Solution Explorer. You’ll then see the Data Source Designer dialog box with editable configuration information. The General tab contains the Data Source Name, Provider, Connection String, Isolation Level, Query Timeout, Maximum Number Of Connections, and (optional) Description sections. The Impersonation Information tab, shown in Figure 7-6, contains the options for connection credentials.
FIgUre 7-6 Data source configuration dialog box for connection impersonation settings
In Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” we talked about the importance of using an appropriately permissioned service account for SSAS. Reviewing the Data Source Designer dialog box, you might be reminded of why this configuration is so important. Also, you might want to review the SQL Server Books Online topic, “Impersonation Information Dialog Box,” to understand exactly how credentials are passed by SSAS to other tiers.
Chapter 7
Designing OLAP Cubes Using BIDS
189
Now we’ll switch to our empty (disconnected) BIDS instance and create a new data source object. The quickest way to do so is for you to right-click the data source folder in Solution Explorer and then click New Data Source. You’ll then be presented with a wizard that will guide you through the steps of setting up a data source. At this point, you might be surprised that you’re being provided with a wizard to complete a task that is relatively simple. There are two points to remember here. First, BIDS was designed for both developers and for nondevelopers (that is, administrators and business analysts) to be able to quickly and easily create SSAS objects. Second, you’ll reach a point where you’ll be happy to see a wizard because some SSAS objects are quite complex. We think it’s valuable to simply click through the wizard settings so that you see can see the process for creating any data source. If you want to examine or change any configuration settings, you can simply double-click the new data source you created. Also, you can view the XMLA metadata that the wizard has generated by right-clicking on the new data source in Solution Explorer and then clicking View Code. Before we leave data sources, we’ll cover just a couple more points. First, it’s common to have multiple data sources associated with a single SSAS project. These sources can originate from any type or location that SSAS is permitted to connect to. Figure 7-7 shows a list of included providers. Tip If you’re connecting to SQL Server data, be sure to use the default provider, Native OLE DB\ SQL Server Native Client 10.0. Using this provider will result in the best performance.
FIgUre 7-7 List of available providers for data source connections
190
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Second, it’s important that you configure data sources with least-privileged accounts. Too many times in the real world, we’ve seen poor practices, such as an administrator-level account being used for this. Failing to use least-privileged connection accounts for data source objects presents unacceptable security vulnerability. If you choose to use a specific Microsoft Windows user name and password, be aware that, by default, BIDS does not save passwords in the connection string. So you’ll be prompted for a password when you attempt to connect if you’re using this type of connection authentication. If you choose to use the credentials of the current user, be aware that BIDS does not support impersonation of the current user for object processing (for example, cubes). If you attempt to process using this authentication mode, you’ll receive an impersonation mode error during any processing attempts. Remember that once you move to production, you can choose to automate processing via XMLA scripting or by using SSMS. After you’ve created and configured your connections to data sources, you’ll next want to create data source views. We’ll define this object type, discuss the reasons for its existence, and then look at the mechanics of creating a data source view.
Data Source Views A data source view (DSV) is an SSAS object that represents some type of view of the data that you’ve created a data source (or connection) to. If your environment is simple and small—for example, your source data originates from a single RDBMS, probably SQL Server, and you’re administrator of that server as well as the server where SSAS is running—the purpose of the DSV object will not be obvious to you. It’s only when you consider larger BI projects, particularly those of enterprise size, that the reason for DSVs becomes clear. A DSV allows you to define some subset of your source data as data that is available for loading into OLAP cubes, data mining structures, or both. If you own (administer) the source data, it’s likely that you’ve already created views (saved Transact-SQL queries) to prepare the data for loading into the source system. If that’s the case, you can simply reference these saved queries rather than the underlying base tables. However, if you have limited permissions to the source systems (such as read-only access), as will generally be the case in larger BI implementations, DSVs allow you to pull in information in its original format and then perform shaping via the SSAS service. This shaping can include renaming source tables and columns, adding and removing relationships, and much more. Let’s start by examining the DSVs that are part of the Adventure Works DW 2008 sample. To do this, navigate to the Data Source View folder in Solution Explorer and then double-click Adventure Works DW to open it in the designer. You might need to click on the first entry in the Diagram Organizer () in the top left of the workspace to have the tables
Chapter 7
Designing OLAP Cubes Using BIDS
191
displayed on the designer surface. We adjusted the zoom to 25 percent to get all the tables to fit in the view. Figure 7-8 shows the result.
FIgUre 7-8 The data source view for the Adventure Works DW 2008 sample is complex.
This is the point where SSAS developers begin to see the complexity of an OLAP cube. You’ll remember that a correctly designed cube is based on a series of source star schemas (and those star schemas have been created based on the validated grain statements (for example, “We want to view sales amount by each product, by each day, by each customer, and so on”). This flattened visualization of the entire cube is less than optimal for learning. We recommend that you examine a single source star schema. That is the purpose of the Diagram Organizer section at the top left of the workspace. To start, click on the Internet Sales entry in the Diagram Organizer section. In addition to being able to visualize the star schemas—that is, tables and relationships—the DSV includes multiple viewers that are designed to help you visualize the source data. If you remember that the assumption is that you might have limited access either to the source data or to its included query tools, the inclusion of these viewers makes sense. To access these viewers—which include a table, pivot table, chart, and pivot chart view of the data— you simply right-click on any table (or view) that is displayed on the DSV designer surface (or in the Tables list shown in the metadata browser to the left of the designer surface) and then click Explore Data.
192
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
After you’ve done that, if you’ve chosen a pivot table or pivot chart, you can manipulate the attribute values being shown. In the case of a pivot chart, you can also select the format of the chart. We find that this ability to quickly take a look at source data can be a real time saver when attempting to perform an informal validation of data, particularly for quick prototypes built with samples of actual data. Figure 7-9 shows a view of the data from the Products table shown in a chart view.
FIgUre 7-9 DSVs include pivot chart source data browsers.
In addition to viewing the source data, you can actually make changes to the metadata that will be loaded into your SSAS objects. You do this by changing metadata—that is, renaming tables, renaming columns, changing relationships, and so on—and by defining new columns and tables. New calculations are called named calculations. They are basically calculated columns. New tables are called named queries. Again, the thinking is that you’ll use this feature if you can’t or don’t want to change source data. To add a named query in a particular DSV, right-click in an empty area on the designer surface and choose New Named Query. This will look quite familiar to SQL Server developers and administrators if SQL Server is the source RDBMS for your DSV. That’s because this is an identical dialog box to the query designer that you see in SSMS when connected to SQL Server. Figure 7-10 shows this dialog box. This functionality allows you to create a view of one or more source tables or views to be added to your DSV. You can see in Figure 7-10 that we’ve selected multiple tables and, just
Chapter 7
Designing OLAP Cubes Using BIDS
193
like in the RDBMS, when we did that, the query design tool automatically wrote the TransactSQL join query for us.
FIgUre 7-10 Named queries can be added to DSVs.
In addition to making changes to the DSV by adding named queries, which, in essence, produce new virtual tables, you can also make changes to the existing tables by adding virtual columns. This type of a column is called a named calculation. To open this dialog box, click on the table name of the table you want to affect in the DSV designer and then click on New Named Calculation. A dialog box with the name Create Named Calculation will open on the DSV designer surface. You can then enter information into the dialog box. Unlike the Named Query designer, you have to type queries into the Create Named Calculation dialog box without any prompting. These queries must be written in syntax that the source system can understand. Figure 7-11 shows the dialog box for a sample named calculation that is part of the Product table in the Adventure Works DW 2008 sample. Columns created via named calculations are shown in the source table with a small, blue calculator icon next to them. Now that we’ve explored the mechanics of how to create DSVs and what you can see and do (to make changes) in DSVs, let’s talk a bit about best practices for DSV creation. It’s with this object that we’ve seen a number of customers begin to go wrong when working in SSAS.
194
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-11 Named calculations can be added to tables in the DSV.
As we mentioned, the DSVs are the basis for creating both OLAP cubes and DM structures. In particular, the OLAP cube designer in BIDS expects a star schema source structure from DSVs. The more closely you comply with this expectation in your creation of your DSV, the more easily and quickly you’ll be able to build cubes that perform well. We covered dimensional modeling extensively in Chapter 5, “Logical OLAP Design Concepts for Architects.” If you’re new to OLAP, you might want to review that information now. In a nutshell, each star schema should have at least one fact table and some related dimension tables. Dimension tables are denormalized structures, typically containing many columns (that is, wide tables), each describing one entity (for example, customer, product, or date). A dimension table should contain the original, source primary key and a newly generated unique primary key, as well as attribute information such as name, status, color, and so on. Individual dimensions are sourced (usually) from a single table. This is called a star design. Dimension tables can originate from multiple, related tables. This is called a snowflake design. An example of a snowflake dimension source is the group of Product, ProductSubcategory, and ProductCategory tables. There should be a business justification for snowflaking source dimension tables. An example of a business reason is that values are changeable in one source dimension table and not in another related one. Fact tables should be narrow (or contain few columns). They should contain foreign keys relating each row to one or more dimension table–type rows. Fact tables should also contain fact values. Fact values are sometimes called measures. Facts are usually numeric and most often additive. Some examples are OrderQuantity, UnitPrice, and DiscountAmount. A common mistake we’ve seen is for SSAS developers to simply make a DSV of an existing RDBMS without giving any consideration to OLAP modeling. Don’t do this! An OLAP cube is a huge, single structure intended to support aggregation and read-only queries. Although the SSAS engine is very fast and well optimized, it simply can’t perform magic on normalized
Chapter 7
Designing OLAP Cubes Using BIDS
195
source data. The resultant cube will most often be difficult to use (and understand) for end users and often will be unacceptably slow to work with. If you’re new to OLAP modeling, spend some time looking at the DSV associated with the Adventure Works DW 2008 sample. Then follow our earlier advice regarding a design-first approach—that is, write solid grain statements, design an empty OLAP destination structure, and then map source data to the destination structure. Finally, use the SSIS tools to perform data extract, transform, and load (as well as cleansing and validation!) and then materialize or populate the destination structure on disk. Do all of this prior to creating your DSV. We understand the effort involved. As mentioned, the extract, transform, and load (ETL) process can be more than 50 percent of the initial BI project’s time; however, we haven’t found any way around this upfront cost. In addition, cleaning and validating data is extremely important to the integrity of your BI solution. This is time well spent. The purpose of your BI project is to make timely, correct, validated data available in an easy-to-understand and quick-to-query fashion. You simply can’t skip the OLAP modeling phase. We’ve seen this mistake repeated repeatedly and cannot overemphasize the importance of correct modeling prior to beginning development. Later in this chapter, we’ll begin the cube and mining structure building process. First we’ve got a few more common items to review in BIDS. The first of these is the Role container.
Roles in Analysis Services It’s important that you understand that the only user with access to SSAS objects by default is the SSAS administrative user or users assigned during setup. In other words, the computer administrator or the SQL administrator do not get access automatically. This is by design. This architecture supports the “secure by default” paradigm that is simply best practice. To give other users access, you’ll need to create roles in SSAS. To do this, you simply right-click on the Roles container in BIDS and choose New Role. The interface is easy to use. Figure 7-12 shows the design window for roles.
FIgUre 7-12 SSAS roles allow you to enable access to objects for users other than the administrator.
196
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The key to working with the role designer is to review all the tabs that are available in BIDS. You can see from Figure 7-12 that you have the following tabs to work with: ■■
General Allows you to set SSAS database-level permissions. Remember that a database in BIDS includes the data source and DSV, so it includes all objects—that is, all cubes, all mining structures, and so on—associated with this particular project. You can (and will) set permissions more granularly if you create multiple SSAS objects in the same SSAS project by using the other tabs in the role designer.
■■
Membership Allows you to associate Windows groups or users with this role. The Windows users or groups must already exist on the local machine or in Active Directory if your SSAS server is part of a domain.
■■
Data Sources
■■
Cubes Allows you to assign permissions to specific cubes.
■■
Cell Data Allows you to assign permission to specific cells in particular cubes.
■■
Dimensions Allows you to assign permissions to specific dimensions.
■■
Dimension Data Allows to assign permissions to specific dimension members.
■■
Mining Structures Allows you to assign permissions to specific mining structures.
Allows you to assign permissions to specific data sources.
You’ll also want to take note of the tabs for the Roles object. When you select a particular tab, you’re presented with an associated configuration page where you can set permissions and other security options (such as setting the default viewable member) for the particular object that you’re securing via the role. Permission types vary by object type. For example, OLAP cubes with drillthrough (to source data) enabled require that you assign drillthrough permission to users who will execute drillthrough queries. You can also change default members that are displayed by user for cube measures and for particular dimensions and dimensional hierarchies. After we’ve reviewed and created the core SSAS objects—that is, cubes, dimensions, and mining structures—we’ll return to the topic of specific object permission types.
Using Compiled Assemblies with Analysis Services Objects As with the SQL Server 2008 RDBMS, you can write .NET assemblies in a .NET language, such as C# or Visual Basic .NET, which will extend the functionality available to a particular SSAS instance or database. You can also write assemblies as COM libraries. You can create types and functions using .NET languages, and then associate these compiled assemblies with SSAS objects. An example of this type of extension is a function that you write to perform some task that is common to your BI project. There are some examples on CodePlex (http://www.codeplex.com), including a project named Analysis Services Stored Procedure Project. One assembly from this CodePlex project is called Parallel. It contains two
Chapter 7
Designing OLAP Cubes Using BIDS
197
functions: ParallelUnion and ParallelGenerate. These functions allow two or more set operations to be executed in parallel to improve query performance for calculation-heavy queries on multiprocessor servers. After you write an assembly, you must associate it with an SSAS server or database instance. To associate an assembly with the SSAS server, you can either use SSMS or BIDS. If you’re using BIDS, you associate an assembly with an SSAS database instance by right-clicking the Assembly folder in BIDS and then configuring the code access security permissions (for .NET assemblies only) and the security context information via the properties pane after you define the path to the assembly. In SSMS, the dialog box to associate and configure assemblies contains settings for you to configure the Code Access Security (CAS) and the security context (impersonation) for the assembly. Figure 7-13 shows the dialog box from SSMS.
FIgUre 7-13 Custom assemblies allow you to add custom logic to SSAS.
Note that four assemblies are associated by default with an SSAS server instance: ExcelMDX, System, VBAMDX, and VBAMDXINTERNAL. What is interesting is that the MDX core query library is implemented via these assemblies. The MDX function library bears a strong resemblance to the Microsoft Office Excel function library. This is, of course, by design because functions you use with SSAS objects are used for calculation. The difference is the structure of the target data source. Creating custom assemblies is an advanced topic. In all of our BI projects, we’ve used only custom assemblies with one client. We recommend that if you plan to implement this approach you thoroughly review all the samples on the CodePlex site first. For more information, see the SQL Server Books Online topic, “Assemblies (Analysis Services – Multidimensional Data).”
198
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Building OLAP Cubes in BIDS We’re ready now to move to a core area of BIDS—the cube designer. We’ve got a couple of items to review before launching into building our first OLAP cube. We’ll use our two instances of BIDS to look at the development environment in two situations. You’ll remember that the first instance is a disconnected, blank environment and the second is working with an existing, connected cube. We’ll also talk about the uses of the Cube Wizard. Surprisingly, it has been designed to do more than simply assist you with the OLAP cube-building process. To understand this, we’ll start by right-clicking on the Cubes folder in Solution Explorer for the BIDS instance that is blank (the one that is disconnected). Choosing the New Cube option opens the Cube Wizard. The first page of the wizard is purely informational, and selecting Next takes you to the next page. Note that this page of the wizard, shown in Figure 7-14, has three options available for building an OLAP cube: ■■
Use Existing Tables
■■
Create An Empty Cube
■■
Generate Tables In The Data Source (with the additional option of basing this cube on any available source XMLA templates)
FIgUre 7-14 To create a cube based on existing tables, you must first create a DSV for that data source.
You might be wondering why the option that you’d probably want to use at this point to create a new cube—that is, Use Existing Tables—is grayed out and not available. This is because we have not yet defined a DSV in this project. As we mentioned, if you have administrative permissions on the source servers for your cube, it might not be obvious to you that you
Chapter 7
Designing OLAP Cubes Using BIDS
199
need to create a DSV because you can just make any changes you want to directly in that source data. These changes can include adding defined subsets of data, adding calculated columns, and so on. And these changes are usually implemented as relational views. As mentioned, DSVs exist to support SSAS developers who do not have permission to create new items directly in source data. Whether you do or don’t, you should know that a DSV is required for you to build an OLAP cube, which will use data from your defined data source as part of your OLAP cube. After you define a DSV, which you can do by right-clicking on that folder in Solution Explorer, click New Data Source View, select an existing data source, add all the appropriate tables from the source to the DSV, and then complete the wizard. This makes the tables and views available to your DSV. After you create a DSV and rerun the Cube Wizard, the Use Existing Tables option in the Cube Wizard becomes available. This is the usual process you take to create production cubes. Before we create a cube based on existing tables, however, let’s first take a minute to understand what the other two options in the Cube Wizard do. You can use the Create An Empty Cube option in two different ways. First, if you create an empty cube and do not base it on a DSV, you can later associate only existing dimensions from another database with this new cube. These dimensions are called linked dimensions. Second, if you create an empty cube and do base it on a DSV, you can create new measures based on columns from fact tables referenced by the DSV. As with the first case, you can associate only existing dimensions with this newly defined cube. The purpose of the Create An Empty Cube option is for you to be able to create new cubes (by creating new measures) and then associate those measures with existing dimensions in a project. So why would you want to make more than one cube in a project? One reason is to perform quick prototyping of new cubes. We’ll explore the answer to that question in more depth later. The Generate Tables In The Data Source option also has two different methods of executing. You can choose either not to use template files as a basis for your cube, or you can base your cube on source templates. SSAS ships with sample source templates (based on the Adventure Works DW 2008 sample OLAP cube structure) for both editions: Enterprise and Standard. These templates contain the XMLA structural metadata that defines cube measures and dimensions. You can use this as a basis and then modify the metadata as needed for your particular BI project. These template files are located by default at C:\Program Files\Microsoft SQL Server\100\Tools\Templates\olap\1033\Cube Templates. Note On x64 systems, substitute “Program Files (x86)” for “Program Files” in the path referenced above.
200
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 7-15 shows one of the pages that the wizard presents if you choose to use a template. On the Define New Measures page of the Cube Wizard, you can select new measures in your cube, create new measures, or both. This wizard also includes a similar page that allows you to select existing dimensions, create new dimensions, or both. On the last page of the wizard, you are presented with the Generate Schema Now option. When selected, this option allows you to generate source RDBMS code so that you can quickly create a star schema structure into which original source data can be loaded through the ETL process. In the case of a SQL Server source, Transact-SQL data definition language (DDL) code is generated. If you select the Generate Schema Now option, the Generate Schema Wizard opens after you click Finish in the Cube Wizard. You’ll then be presented with the option of generating the new schema into a new or existing DSV.
FIgUre 7-15 Using the Cube Wizard with an included template
So why would you select the Generate Tables In The Data Source option? It allows you to model a cube using BIDS without associating it to a data source. In other words, you can use BIDS as a design environment for OLAP modeling. As mentioned earlier in this book, using BIDS in this way assumes that you have a thorough understanding of OLAP modeling concepts. If you do have this understanding, using BIDS in this way can facilitate quick construction of empty prototype star schemas (or DSVs). We sometimes use this method (BIDS, create cube, with no associated DSV) to create empty star schema tables during the prototyping phase of our BI projects rather than using a more generic database modeling tool such as Visio. At this point in our exploration, we’d rather create a cube based on a data source. Be aware that the sample Adventure Works DW 2008 is modeled in a traditional star-schema-like way. We’ll use this as a teaching tool because it shows the BIDS Cube Wizard in its best light. A
Chapter 7
Designing OLAP Cubes Using BIDS
201
rule of thumb to facilitate rapid development is that you start with a star schema source, as much as is practical, and then deviate from it when business requirements justify doing so. We’ll follow this best practice in our learning path as well. Create a quick DSV in the disconnected instance by right-clicking on the data source view container, setting up a connection to the AdventureWorksDW2008 relational database, and then selecting all tables and views without making any adjustments. Next double-click the newly created DSV and review the tables and views that have been added. As we proceed on our OLAP cube creation journey, we’ll first review the sample OLAP cube and then build similar objects ourselves in our second BIDS instance.
Examining the Sample Cube in Adventure Works To get started, double-click the Adventure Works cube in Solution Explorer to open the cube designer in BIDS. As you’ve seen previously for some other SSAS objects, such as roles, opening any BIDS designer reveals a wealth of tabs. We’ve shown the cube-related tabs in Figure 7-16. These tabs are named as follows: Cube Structure, Dimension Usage, Calculations, KPIs, Actions, Partitions, Aggregations, Perspectives, Translations, and Browser.
FIgUre 7-16 The available tabs in the cube designer
The only cube tab we’ve looked at previously is the Browser tab. You might recall from our discussions in an earlier chapter that the Browser tab options serve as a type of pivot table control. The Browser tab items allow you to view your OLAP cube in an end-user-like environment. For the remainder of this chapter, as well as for future chapters, we’ll explore each of these tabs in great detail. You’ll note also that there is an embedded toolbar below each tab. In addition, the designer surface below each tab contains shortcut (right-click) menus in many areas. You can, of course, always use the main menus in BIDS; however, we rarely do so in production, preferring to use the embedded toolbars and the internal shortcut menus. This might seem like a picky point, but we’ve found that using BIDS in this way really improves our productivity. Let’s take a closer look at the Cube Structure tab. (See Figure 7-17.) To the left, you’ll see a metadata browser, which is similar to the one you saw when executing MDX queries in SSMS. It includes a section for the cube measures and another one for the cube dimensions.
202
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-17 The BIDS Cube Structure tab contains a metadata browser.
Confusingly, the designer surface is labeled Data Source View. Weren’t we just working in a separate designer for that object? Yes, we were. Here’s the difference. In the previous DSV designer, you selected tables, columns, and relationships to be made available from source data for loading into an OLAP cube destination structure. In the Cube Structure tab Data Source View designer, you can review the results of that data load. You can also make changes to the measures and dimensions on the cube side. We’ll give you a concrete example to help clarify. Rows from the fact source tables become measures in the OLAP cube. Fact rows can originate in one of two ways. The first way is to do a straight data load by simply copying each row of data from the source fact table. The second way is to originate it as a derived fact. This is a calculation that is applied at the time of cube load, based on a query language that the source RDBMS understands. So, if you’re using SQL Server, a Transact-SQL query can be used to create a calculated value. This type of action is defined in the DSV. The calculation is performed when the source data is loaded— that is, when it is copied and processed into the destination OLAP cube structure and the resultant calculated values are stored on disk in the OLAP cube. You also have the option of creating a calculated measure on the OLAP cube side. This is defined using MDX and is calculated at OLAP query time and not stored in the OLAP cube. You use the DSV area on the Cube Structure tab to create regular measures (that is, measures based on any column in any fact table in the referenced DSV) or calculated measures.
Chapter 7
Designing OLAP Cubes Using BIDS
203
Calculated measures are quite common in cubes. The SSAS engine is optimized to process them quickly at query time. They are used when you have measures that will be needed by a minority of users and when you want to keep space used on disk to a minimum. You’ll often have 50 or more calculated measures in a single production cube. Calculated measures are indicated in the metadata tree by a small, blue square with the function notation (that is, fx) on the measure icon. This is shown in Figure 7-18.
FIgUre 7-18 Calculated measures are created using MDX expressions and are indicated by the fx notation.
When you’re working on the designer surface in the cube designer, if you want to make any structural changes to the underlying DSV, you can. You simply right-click on any table that you’ve added to this view, click Edit Data Source View to open the DSV editor, and make your changes there. So you can think of the data source view area in the cube designer as a viewer rather than an editor. This chained editing is common throughout BIDS. Because of this, it can be quite easy to get lost at first! We advise you to start out slowly. Once you get the hang of it, chained editing will make sense and save you time. You’ll also see the chained editing paradigm repeated in the lower half of the metadata browser, the Dimensions section. Here you can see the dimension names, hierarchy names, and level names. However, you cannot browse the dimensional data or edit the dimension structure. So what is the purpose of listing the dimension information here if you can’t make any structural changes? And just where do you make structural changes? The configuration options available to dimensions, hierarchies, and attributes (levels) on the Cube Structure tab are limited to specific options regarding how the cube will use that dimension, some of which have an effect on processing. We haven’t begun to talk about cube processing yet, so we’ll cover these options in detail later in this book. So where do you develop dimensions? You use the dimension editor. As you might have guessed by now, you can take a shortcut directly to the dimension editor from the Cube Structure tab. In the Dimensions section, click the link in the particular dimension you want to edit. In Figure 7-19, we’ve opened the Customer dimension to show the Edit Customer link, as well as the Customer Geography hierarchy and the Attributes container.
204
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-19 The Dimensions section on the Cube Structure tab contains links to the dimension editor.
Expand the Customer dimension, and click the Edit Customer link in the Dimensions section to open the dimension editor workspace in BIDS. We’ll now take a slight detour into the world of dimensions. After we complete this, we’ll return to the cube designer.
Understanding Dimensions So far, we’ve addressed the underlying OLAP cube as a single star schema—that is, fact tables plus dimension tables. This is a bit of an oversimplification. Although it’s possible to base an OLAP cube on such a simple structure, in our experience business requirements and source data often introduce complexities. One of those complexities is the need for distinct sets of permissions or processing settings at the cube level. If this in an unavoidable case (and one that is justified by business requirements), you can create multiple OLAP cubes. Because this situation is common, we usually begin our cube design with dimensions and then proceed to measures. Our goal is always to create a single cube. If that is not practical, the shared nature of dimensions is quite useful. Let’s provide a business example to give some context to this discussion. Suppose that you have requirements to create OLAP solutions for two distinct end-user communities for a retail chain. These communities are financial analysts and store management. The complexity of what you choose to present will vary greatly for each group. Although you could create views of a single cube (called perspectives), you might also have different data update frequency requirements. For example, the analysts might require data current as of the previous day, and managers might require data current as of the previous hour. As a result of these
Chapter 7
Designing OLAP Cubes Using BIDS
205
varying requirements (and, often, as a result of other requirements beyond these), you elect to create two separate cubes. Rather than starting with fact tables and measures, you should start by attempting to create common dimensions. Examples of such often include Time, Customers, and Products dimensions. Tip As a rule of thumb, when designing and building cubes, start by building the dimensions. Given this example, let’s now examine the dimension editor in BIDS. We’ve opened the Customer dimension from the Adventure Works DW 2008 cube sample for our discussion. As with the cube designer, when you open the dimension editor, you see the tab and embedded toolbar structure that is common to BIDS. We show this in Figure 7-20. The tab names here are Dimension Structure, Attribute Relationships, Translations, and Browser. As with the cube designer, the only tab we’ve examined to this point in the process is the Browser tab.
FIgUre 7-20 The Dimension Structure metadata tab in the dimension editor.
The Dimension Structure tab contains three sections: Attributes, Hierarchies, and Data Source View. As with the Cube Structure tab, the Data Source View section on the Dimension Structure tab lets you view only the source table or tables that were used as a basis for creating this dimension. If you want to make any changes in those tables (for example, adding calculated columns) you right-click on any table in that section and then click Edit Data Source View, which opens that table in the main BIDS DSV editor. The Attributes section shows you a list of defined attributes for the Customer dimension. Note that these attributes mostly have the same names as the source columns from the Customer table. In this section, you can view and configure properties of dimensional attributes. As you build a dimension, you might notice squiggly lines underneath some attributes. This is the new Analysis Management Objects (AMO) design warning system in action. Although you can’t see color in a black-and-white book, these lines are blue when viewed on your screen. If you hover your mouse over them, the particular design rule that is being violated appears in a tooltip. As mentioned earlier, these warnings are for guidance only; you can choose to ignore them if you want. Microsoft added these warnings because many customers failed to follow best OLAP design practices when building cubes in SSAS 2005 and this resulted in cubes that had suboptimal performance in production environments. During our discussion of DSVs, we mentioned a new design approach in BIDS 2008—one based on exclusivity rather than inclusivity. This approach has also been applied to dimension
206
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
design. Although you’ll still use a wizard to create dimensions, that wizard will reference only selected source columns as attributes. In the past, the wizard automatically referenced all source columns and also attempted to auto-detect hierarchies (or summary groupings) of these attributes. You must now manually create all hierarchies and relationships between attributes in those hierarchies. Attribute hierarchies were discussed earlier; however, some information bears repeating here. Missing and improper hierarchy definitions caused poor performance in many cubes built using SSAS 2005.
Attribute Hierarchies Rollup or dimension attribute hierarchy structure design should be driven by business requirements. That is, by building structures that reflect answers to the question, “How do you roll up your data?” Time is probably the simplest dimension to use as an example. After determining the lowest level of granularity needed—that is days, hours, minutes, and so on— the next question to ask is, “How should the information be rolled up?” Again, using the Time dimension, the answer is expressed in terms of hours days, weeks, months, and so on. Creating appropriate attribute hierarchies makes your cubes more usable in a number of ways. First, end users will understand the data presented and find it more useful. Second, if the data is properly designed and optimized (more about that shortly), cube query performance will be faster. Third, cube processing will be faster, which will make the cube available more frequently. SSAS supports two types of attribute hierarchies: navigational and natural. What do these terms mean, and what are the differences between them? Navigational hierarchies can be created between any attributes in a dimension. The data underlying these attributes need not have any relationship to one another. These hierarchies are created to make the end user’s browsing more convenient. For example, you can design the Customer dimension in your cube to be browsed by gender and then by number of children. Natural hierarchies are also created to facilitate end-user browsing. The difference between these and navigational hierarchies is that in natural hierarchies the underlying data does have a hierarchical relationship based on common attribute values. These relationships are implemented through attribute relationships, which are discussed in the next section. An example of this is in our Date dimension—months have date values, quarters have month values, and so on. Because of the importance of this topic and because many customers weren’t successful using BIDS 2005, Microsoft has redesigned many parts of the dimension editor. One place you’ll see this is in the Hierarchies section of the Dimension Structure tab of the editor, which is shown in Figure 7-21. Notice that the Date dimension has four hierarchies defined, two of
Chapter 7
Designing OLAP Cubes Using BIDS
207
which are visible in Figure 7-21: Fiscal and Calendar. Creating more than one date hierarchy is a common strategy in production cubes.
FIgUre 7-21 The dimension editor Hierarchies section lists all hierarchies for a particular dimension.
The areas where you work to create different types of hierarchies has changed in BIDS 2008. There are two types of hierarchies: navigational (where you can relate any attributes) and natural (where you relate attributes that have an actual relationship in the source data, usually one-to-many). In BIDS 2005, you created and configured hierarchies in the Hierarchies section and attribute relationships in the Attributes section. In BIDS 2008, you still create hierarchies here by dragging and dropping attribute values from the Attributes section. You can also rename and configure some other properties. However, Microsoft has created a new attribute relationship designer in the dimension editor to help you to visualize and build effective attribute relationships between values in your dimensional hierarchies.
Attribute Relationships Before we examine the new attribute relationship designer, let’s take a minute to define the term attribute relationship. We know that attributes represent aspects of dimensions. In the example we used earlier of the Date dimension (illustrated by Figure 7-21), we saw that we have values such as Date, Month Name, Fiscal Quarter, Calendar Quarter, and so on. Hierarchies are roll-up groupings of one or more attributes. Most often, measure data is aggregated (usually summed) in hierarchies. In other words, sales amount can be aggregated by day, then by month, and then by fiscal quarter. Measure data is loaded into the cube via rows in the fact table. These are loaded at the lowest level of granularity. In this example, that would be by day. It’s important to understand that the SSAS query processing engine is designed to use or calculate aggregate values of measure data at intermediate levels in dimensional hierarchies. For example, it’s possible that an MDX query to the Fiscal Quarter level for Sales Amount could use stored or calculated aggregations from the Month level.
208
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
If you’re creating natural hierarchies, the SSAS query engine can use intermediate aggregations if and only if you define the attribute relationships between the level members. These intermediate aggregations can significantly speed up MDX queries to the dimension. To that end, the new Attribute Relationships tab in the dimension editor lets you visualize and configure these important relationships correctly. Figure 7-22 shows this designer for the Date dimension.
FIgUre 7-22 The Attribute Relationships tab is new in BIDS 2008.
The main designer shows you the attribute relationship between the various levels in the defined hierarchies. The bottom left section lists all defined attributes for this dimension. The bottom right section lists all relationships. Attribute relationships have two possible states: flexible or rigid. Flexible relationships are indicated by an open arrow (outlined), and rigid relationships are indicated by a solid black arrow. The state of the attribute relationship affects how SSAS creates and stores aggregations when the dimension is processed. Dimension processing means that data is loaded from source locations into the dimension destination structure. If you define a relationship as rigid, previously calculated aggregations are retained during subsequent dimension processing. This will, of course, speed up dimension processing. You should only do that when you do not expect dimension data to change. Date information is a good example of rigid data, and you’ll note that all relationships have been defined as rigid in all hierarchies. In other words, if you never expect to update or delete values from source data, as would be the case in a date hierarchy between, for example, month and year names, you should define the attribute relationship as rigid. Inserting new data is not considered a change in this case, only updating or deleting data. On the other hand, if data could be updated or deleted—for example, in the case of customer surnames (women getting married and changing their surnames)—you should define the attribute relationship as flexible.
Chapter 7
Designing OLAP Cubes Using BIDS
209
For more information about this topic, see the “Defining Attribute Relationships” topic in SQL Server Books Online. Modeling attribute relationship types correctly also relates back to dimensional data modeling. As you might recall, we discussed the importance of capturing business requirements regarding whether or not changes to dimension data should be allowed, captured, or both. As mentioned, an example of dimension data that is frequently set to changing or flexible is the Last Name attribute (for customers, employees, and so on). People, particularly women, change their last names for various reasons, such as marriage and divorce.
Translations The Translations tab allows you to provide localized strings for the dimension metadata. In Figure 7-23, we’ve collapsed the individual Attributes localizations so that you can see that you can provide localized labels for defined attribute hierarchy level names as well. We’ll talk a bit more about localization of both dimension and measure data in Chapter 9, “Processing Cubes and Dimensions.” Keep in mind that what you’re doing in this section is localizing metadata only—that is, labels. BIDS actually has some nifty built-in wizards to facilitate data localization (particularly of currency-based measures). We’ll cover that topic in the next chapter.
FIgUre 7-23 The Translations tab allows you to provide localized labels for dimension metadata.
Also, note that you can preview your dimension localization on the Browser tab by selecting one of the localization types in the Language drop-down menu. In our case, we’ve selected Spanish, as shown in Figure 7-24.
210
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-24 The dimension editor Browser tab allows you to preview any configured localizations.
Now that we’ve taken a first look at dimension building, we’ll return to our tour of the sample OLAP cube. Before we do so, you might want to review the rest of the dimensions used in the Adventure Works DW 2008 sample SSAS project. Nearly all attribute hierarchy relationship types are represented in the sample. Reviewing this sample will also give you perspective before we move to the next part of our explanation, where we’ll look at how the defined dimensions will be used in an OLAP cube. To do this, you double-click on the Adventure Works sample cube in the Cubes folder in Solution Explorer to open the cube designer. Then click the Dimension Usage tab. We’ll continue our tour there.
Using Dimensions After you’ve created all the dimensions (including defining hierarchies and attribute relationships) you need to meet your business requirements, you’ll begin to combine those dimensions with measures (derived from fact tables) to build OLAP cubes. Microsoft made a major change to the possible cube structure in SSAS 2005. In our experience, most customers haven’t fully grasped the possibilities of this change, so we’ll take a bit of time to explain it here. In classic dimensional source modeling for OLAP cubes, the star schema consists of exactly one fact table and many dimension tables. This was how early versions of Microsoft’s OLAP tools worked as well. In other words, OLAP developers were limited to basing their cubes on a single fact table. This design proved to be too rigid to be practical for many business scenarios. So starting with SSAS 2005, you can base a single cube on multiple fact tables that are related to multiple dimension tables. This is really a challenge to visualize! One way to think of it is as a series of flattened star schemas. Multiple dimension tables can be used by multiple cubes, with each cube possibly having multiple fact tables.
Chapter 7
Designing OLAP Cubes Using BIDS
211
So how do you sort this out? Microsoft has provided you with the Dimension Usage tab in the cube designer, and we believe it’s really well designed for the task. Before we explore the options on the Dimension Usage tab, let’s talk a bit more about the concept of a measure group. You might recall from the Cube Structure tab shown in Figure 7-16 that measures are shown in measure groups. What exactly is a measure group?
Measure Groups A measure group is a container for one or more measures. Measure groups are used to relate groups of measures to particular dimensions in an OLAP cube. For this reason, all measures common to a group need to share a common grain—that is, the same set of dimensions at the same level. What that means is that if the measures Internet Sales Amount and Internet Order Quantity both need to expose measure values at the “day” grain level for the Date dimension, they can be placed in a common measure group. However, if Internet Sales Amount needs to be shown by hours and Internet Order Quantity needs to be shown only by days, you might not put them in the same measure group. Just to complicate the situation further, you could still put both measures in the same group if you hide the hourly value for the Internet Sales Amount measure! This would give them the same set of dimensions, at the same level. You might recall that a measure is created in one of three ways. It can be simply retrieved from a column in the fact table. Or it can be derived when the cube is loaded via a query to the data source (for example, for SQL Server via a Transact-SQL query). Or a measure can be calculated at cube query time via an MDX query. The last option is called a calculated measure. It’s common to have dozens or even hundreds of measures in a single cube. Rather than forcing all that data into a single table, SSAS supports the derivation of measures from multiple-source fact tables. Multiple-source fact tables are used mostly to facilitate easier cube maintenance (principally updating). For example, if SQL Server 2008 source tables are used, an administrator can take advantage of relational table partitioning to reduce the amount of data that has to be physically operated on during cube processing. In the case of the Adventure Works cube, you’ll see by reviewing the source DSV in the designer that seven fact tables are used as source tables. Three of these fact tables are created via Transact-SQL queries. (Similar to relational views, the icon in the metadata viewers indicates that the tables are calculated.) So there are four physical fact tables that underlie the 11 measure groups in the Adventure Works cube. Armed with this knowledge, let’s take a look at the Dimension Usage tab of the OLAP cube designer. A portion of this is shown in Figure 7-25. The first thing to notice is that at the intersection of a particular dimension and measure group, there are three possibilities. The first is that there is no relationship between that particular dimension’s members and the measures
212
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
in the particular measure group. This is indicated by a gray rectangle. An example of this is the Reseller dimension and the Internet Sales measure group. This makes business sense because at Adventure Works, resellers are not involved in Internet sales.
FIgUre 7-25 The Dimension Usage tab allows you to configure relationships between dimensions and
measure groups.
The next possibility is that there is a regular or star-type relationship between the dimension and measure data. This is indicated by a white rectangle at the intersection point with the name of the dimension shown on the rectangle. An example of this is shown for the Date dimension and the Internet Sales measure group. The last possibility is that there is some type of relationship other than a standard star (or single dimension table source) between the dimension and the measure group data. This is indicated by a white rectangle with some sort of additional icon at the intersection point. An example of this is at the intersection of the Sales Reason dimension and the Internet Sales measure group. To examine or configure the type of relationship between the dimension data and the measure group data, you click on the rectangle at the intersection point. After you do so, a small gray build button appears on the right side of the rectangle. Click that to open the Define Relationship dialog box. The dialog box options vary depending on the type of relationship that you’ve selected. The type of relationship is described, and there is a graphic displayed to help you visualize the possible relationship types.
Chapter 7
Designing OLAP Cubes Using BIDS
213
The possible types of relationships are as follows: ■■
No Relationship Shown by gray square
■■
Regular
■■
Fact Source column from a fact table
■■
Referenced
■■
Many-to-Many
■■
Data Mining Data mining model as source data for this dimension
Star (or single dimension table) source
Snowflake (or multiple dimension tables) as source Multiple source tables, both fact and dimension, as source
Notice in Figure 7-26 that the Select Relationship Type option is set to Regular. As mentioned, you also set the Granularity Attribute here. In this case, Date has been selected from the drop-down list. Note also that the particular columns that create the relation are referenced. In this case, DateKey is the new, unique primary key in the source DimDate dimension table and OrderDateKey is the foreign key in the particular source fact table.
FIgUre 7-26 The Define Relationship dialog box allows you to verify and configure relationships between
dimensions and measures.
It’s interesting to note that after you click the Advanced button in the Define Relationship dialog box, you’ll be presented with an additional dialog box that enables you to configure null processing behavior for the attributes (levels) in the dimension. Note also that if you’ve defined a multiple column source key for the attribute, this is reflected in the Relationship section of this advanced dialog box as well, as shown in Figure 7-27.
214
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-27 The Measure Group Bindings dialog box allows you to define null processing behavior.
Your choices for controlling null processing behavior at the attribute level are as follows: ■■
Automatic (the default) Converts numeric nulls to 0 and string nulls to empty strings for OLAP cubes, and follows the configured UnknownMember property behavior for DM structures.
■■
Preserve
■■
Error Attempts to load nulls generate an exception. If it’s a key value (or a primary key from the source table), the configured value of NullKeyNotAllowed determines behavior (possibilities are IgnoreError, ReportAndContinue, and ReportAndStop).
■■
UnknownMember Relies on a two-dimensional property setting—the Unknown Member visibility property (set by default to None, but configurable to Visible or Hidden), and the UnknownMemberName property (set to the string Unknown by default).
■■
ZeroOrBlank
Preserves the null value as null. (We do not recommend using this setting.)
Same as Automatic for OLAP cubes.
We would be remiss if we didn’t mention that it’s a much better practice to trap and eliminate all nulls during your ETL process so that you don’t have to consider the possibility of nulls while loading your dimensions and measures.
Chapter 7
Designing OLAP Cubes Using BIDS
215
Beyond Star Dimensions We hope that you’ve considered our advice and will base the majority of your dimensions on a single source table so that you can use the simple star schema design (defined as Regular) described earlier. If, however, you have business-justified reasons for basing your dimensions on non-star designs, many of these varieties can be supported in SSAS. Let’s take a look at them, in priority order of use.
Snowflake Dimension The most common variation we see in production cubes is the snowflake design. As mentioned, this is usually implemented via a dimension that links to the fact table through another dimension. To establish this type of relationship on the Dimension Usage tab of the cube designer, you simply select the Referenced relationship type. After you do that, you need to configure the dialog box shown in Figure 7-28. In the example shown, we’re defining the relationship between the Geography dimension and the Reseller Sales measure group. To do this, we choose to use an intermediate dimension table (Reseller).
FIgUre 7-28 The Define Relationship dialog box allows you to define referenced relationships.
The most common business reason for using separate tables deals with the changeability and reuse of the data. In our example, geography information is static and will be reused by other dimensions. Reseller information is very dynamic. So the update behavior is quite different for these two source tables.
216
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
To create a referenced (or snowflake) dimension, you must model the two source tables with a common key. In this case, the primary key of the Geography table is used as a foreign key to the Reseller table. Of course, there must also be a key relationship between the Reseller dimension table and the Reseller Sales fact table. That attribute (key value) is not shown here. If you want to examine that relationship, you can either add all three tables to the Cube Structure tab of the Data Source View section on the Cube Structure tab or open the DSV for the cube. Finally, note that the Materialize option is selected, which is the default setting. This setting tells the SSAS instance to persist any aggregations (that are designed during processing) to disk when the dimension is processed. This option is on by default so that better MDX query performance can be realized when the dimension is queried.
Fact Dimension A fact dimension is based on an attribute or column from the fact table. The most common business case is modeled in the sample—that is, order number. You simply set the relationship type to Fact by selecting that value from the Select Relationship Type drop-down list in the Define Relationship dialog box, and then select the source column from the selected fact table. Although it’s easy to implement, this type of dimension can have a significant impact on your cube performance, as the following paragraphs explain. Usually fact tables are far larger (that is, they contain more rows) than dimension tables. The reason for this should be obvious. Let’s take the example of sales for a retail store. A successful business should have customers who make many repeat purchases, so there are relationships of one customer to many purchases, or one row in a (customers) dimension table to many rows in a (sales transactions) fact table. For this reason, anything that makes your source fact tables wider—that is, adds columns to add information about each order—should have a business justification in your model. As we mentioned earlier, it’s a best practice to model your fact tables narrowly for best load and query performance. To summarize, although you can easily use fact columns as dimensions, do so sparingly.
Many-to-Many Dimension The complex many-to-many option, which was added at the request of enterprise customers, extends the flexibility of a model but can be tricky to use. In particular, strict adherence to a source modeling pattern makes implementing this type much clearer. First, we’ll give you the business case from Adventure Works DW 2008 as a reference. Shown in Figure 7-29 is the configuration dialog box for this option.
Chapter 7
Designing OLAP Cubes Using BIDS
217
FIgUre 7-29 The many-to-many relationship type is quite complex and should be used sparingly.
In the sample, the many-to-many relationship is used to model Sales Reasons for Internet Sales. It works nicely for this business scenario. To better understand the required modeling, we’ll show the three source tables involved in this relationship in the DSV for the cube: Internet Sales Facts, Internet Sales Reason Facts, and Sales Reason. It’s quite easy to see in Figure 7-30 that Internet Sales Reason Facts functions similar to a relational join or junction table. It contains only keys—a primary key and foreign keys for both the Internet Sales Facts and Sales Reason tables. The join (or intermediate fact table as it’s called here) establishes a many-to-many relationship between sales reasons and particular instances of Internet sales. In plain English, the business case is such that there can be more than one sales reason for each Internet sale, and these sales reasons are to be selected from a finite set of possible reasons. We’ll cover the last type of dimension relationship, data mining, in Chapter 13, “Implementing Data Mining Structures.” Although we’ve not yet looked at all aspects of our sample cube, we do have enough information to switch back to our disconnected, blank instance and build our first simple cube. In doing that, we’ll finish this chapter. In the next chapter, we’ll return to reviewing our sample to cover the rest of the tabs in the BIDS cube designer: Calculations, KPIs, Actions, Partitions, Aggregations, Perspectives, and Translations. We’ll also cover the configuration of advanced property values for SSAS objects in the next chapter. We realize that by now you’re probably quite eager to build your first cube!
218
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgUre 7-30 The many-to-many relationship type requires two source fact tables. One is a type
of junction table.
Building Your First OLAP Cube We’ll now work in the blank disconnected instance and create a cube based on the AdventureWorksDW2008 relational source star schema and data source view that you’ve already created. Although we took quite a bit of time to get to this point, you’ll probably be surprised at how easy creating a cube using BIDS actually is. The interface is really quite easy to use after you gain an understanding of OLAP concepts and modeling. This is why we’ve taken such a long time to get to this point. Probably the biggest issue we saw with customers using SSAS 2005 was overeagerness to start building coupled with a lack of OLAP knowledge. This often produced undesirable results. In fact, we were called on more than one occasion to fix deployed cubes. Most often these fixes were quite expensive. Usually, we started over from scratch. We’ve armed you with quite a bit of knowledge, so we expect your experience will be much more pleasant and productive. We’ll launch the Cube Wizard from the Cube folder in Solution Explorer by right-clicking on that folder. You should choose the Use Existing Tables option from the DSV you created earlier.
Chapter 7
Designing OLAP Cubes Using BIDS
219
Selecting Measure Groups As you work with the Cube Wizard pages, you need to select measure group tables from the list of tables included in the DSV first. If you’re unsure which tables to select, you can click the Suggest button on the Select Measure Group Tables page. The results of clicking that button are shown in Figure 7-31.
FIgUre 7-31 Select Measure Group Tables page of the Cube Wizard contains a Suggest button.
Although the suggest process is fairly accurate, you’ll still want to review and adjust the results. In this case, all source tables whose names include the word Fact, as well as most of the views, were selected. We’ll clear the check boxes for all the selected views (vAssocSeqLineItems, vAssocSeqOrders, vDMPrep, and vTargetMail), ProspectiveBuyer, and the DimReseller table. Then we’ll proceed to the next page of the wizard. This page displays all possible measures from the previously selected measure group tables. You’ll want to scroll up and down to review the measures. In a production situation, you’d want to make sure that each selected measure was included as a result of business requirements. Consider our earlier discussion about the best practice of keeping the fact tables as narrow as possible so that the size of the table doesn’t become bloated and adversely affect cube query and cube processing times. For our sample, however, we’ll just accept the default, which selected everything, and proceed to the next step of the wizard.
220
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Adding Dimensions On the next page of the wizard, BIDS lists required dimensions from the source DSV. These dimensions are related to the measures that you previously selected. If you attempt to clear the check box for one of the required dimensions, the wizard displays an error that explains which measure requires the particular dimension that you’ve cleared (so that you can click the Back button in the wizard and reselect that measure if you want). The wizard does not allow you to proceed until all required dimensions are selected. As mentioned previously, hierarchies are not automatically created. We did see an exception to this in that the product/subcategory/category snowflake dimension was recognized as a snowflake and created as such. Figure 7-32 shows some of the dimensions that the wizard suggested.
FIgUre 7-32 The Cube Wizard suggests dimensions based on the measures you previously selected.
On the last page of the wizard, you can review the metadata that will be created. It’s also convenient to remember that at any point you can click the Back button to make corrections or changes to the selections that you made previously. Give your cube a unique name, and click Finish to complete the wizard. We remind you that you’ve just generated a whole bunch of XMLA metadata. If you want to review any of the XMLA, right-click on any of the newly created objects (that is, cubes or dimensions) in Solution Explorer and then click View Code.
Chapter 7
Designing OLAP Cubes Using BIDS
221
You’ll probably want to review the Dimension Usage tab of the cube designer. Do you remember how to open it? That’s right, just double-click the cube name in Solution Explorer, and then click the Dimension Usage tab. Take a look at the how the Cube Wizard detected and configured the relationships between the dimensions and the measures. It’s quite accurate. You’ll want to review each configuration, though, particularly when you first begin to build cubes. Remember that the AdventureWorksDW2008 relational database source is modeled in a way that works very well with the Cube Wizard. This is particularly true in terms of relationships. The reason for this is naming conventions for key columns. The same name is used for the primary key column and the foreign key column, so the wizard can easily identify the correct columns to use. In cases where the key column names don’t match, such as between Dim Date (DateKey) and Fact Internet Sales (OrderDateKey), the wizard won’t automatically find the relationship. You should follow these design patterns as closely as you can when preparing your source systems. You’ll also want to make sure that you understand all the non–star dimension relationships that were automatically detected. In our sample, we see several referenced (snowflake) and one fact (table source) relationship in addition to a couple other types. You’ll have to pay particular attention to correct detection and configuration of non–star dimensional relationships because the wizard doesn’t always automatically detect all these types of relationships. If, for some reason, you need to create an additional dimension, you do that by right-clicking the dimension folder in Solution Explorer. This launches the New Dimension Wizard. You should find this tool to be easy to use at this point. You select a source table, which must have a primary key, confirm any related tables that might be detected (snowflake), and then select the source columns that will become attributes in the new dimension. To add the new dimension to an existing cube, simply open the cube in the designer (on either the Cube Structure or Dimension Usage tab), click Add Cube Dimension on the nested toolbar, and then select the new dimension name from the dialog box. You might be anxious to see the results of your work. However, on the Browser tab of the cube designer, you’ll see an error after you click the Click Here For Detailed Information link. The error message will say either “The user, <username>, does not have access to Analysis Services Project x database,” or “The database does not exist.” Do you know which step we haven’t completed yet and why you’re seeing this error? You must build and deploy the cube before you can see the source data loaded into the new metadata structure that you just created. Although you might guess (correctly) that to build and deploy you could just right-click on the new cube you created in Solution Explorer, you should hold off on doing that—we’ve got more to tell you about configuring and refining the cube that you’ve just built. We’ll do that in the next few chapters.
222
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Here are a couple of questions to get you thinking. Take a closer look at the cube and dimensions that you just built. Look closely at the dimension structure. If you open some of the dimensions that were created using the Cube Wizard, you’ll see that they look different than the dimensions in the Adventure Works sample cube that we looked at earlier. Something is missing in our new dimensions. Do you know what it is? The hierarchies haven’t been built. A good example of this is the Dim Date dimension. If you open this dimension in the dimension editor, there is only one attribute and no hierarchies, as shown in Figure 7-33. Do you know the steps to take to create a date dimension in your new cube that is structurally similar to the Date dimension from the Adventure Works DW 2008 sample?
FIgUre 7-33 The Dimension Structure tab contains no attribute hierarchies by default.
Did you guess that your first step is to add more attributes to the dimension? If so, that’s correct! An easy way to do that is to click on the source columns from the DimDate table and then drag those column names to the Attributes section of the designer. To then create hierarchies of those attributes, you click on the attribute names in the Attributes section and drag them to the Hierarchies section. When you’ve done this, the dimension might look like Figure 7-34. However, notice that we have one of those pesky blue squiggles underneath the hierarchy name. The pop-up warning text reads, “Attribute relationships do not exist between one or more levels of this hierarchy. This may result in decreased query performance.” Do you know how to fix this error?
Chapter 7
Designing OLAP Cubes Using BIDS
223
FIgUre 7-34 The Dimension Structure tab displays an AMO design warning if no attribute relationships are
detected.
If you guessed that you need to use the Attribute Relationship tab to configure the relationship, you are correct (again!). Interestingly, when you view the Attribute Relationship tab, you’ll see that the default configuration is to associate every attribute directly with the key column. To fix this, right-click on the node for Calendar Quarter on the designer surface and then click New Attribute Relationship. You’ll see a dialog box in which you can configure the relationship and then specify the type of relationship (flexible or rigid). Which type of relationship will you choose for these attributes? You’ll likely choose rigid here, as date information is usually not changeable. We have much more to do, but this has been a long chapter. Take a break, fuel up, and continue on with us in the next chapter to dig even deeper into the world of OLAP cubes.
Summary We’ve started our tour of BIDS. We’re certainly nowhere near finished yet. In this chapter, we looked at working with the BIDS SSAS templates. We examined the core objects: data sources, data source views, roles, and assemblies. Next we’ll look at the cube and dimension builders. We’ve got lots more to do, so we’ll move to advanced cube building in the next chapter. Following that, we’ll take a look at building a data mining structure. Along the way, we’ll also cover basic processing and deployment so that you can bring your shiny new objects to life on your SSAS server.
Chapter 8
Refining Cubes and Dimensions Now that you’ve developed your first OLAP cube, it’s time to explore all the other goodies available in Microsoft Business Intelligence Development Studio (BIDS). You can make your base cube much more valuable by adding some of these capabilities. Be aware that none of them are required, but, most often, you’ll choose to use at least some of these powerful features because they add value for your end users. There’s a lot of information to cover here. In this chapter, we’ll look at calculated members, key performance indicators (KPIs), enabling writeback, and more. We’ll also look at adding objects using the Business Intelligence Wizard. Finally, we’ll look at advanced cube, measure, and dimension properties.
Refining Your First OLAP Cube As we get started, we’ll continue working through the OLAP cube designer tabs in BIDS. To do this, we’ll continue the pattern we started in the previous chapter. That is, we’ll open two instances of BIDS. In the first instance, we’ll work in connected mode, using the Adventure Works sample OLAP cube. In the second instance, we’ll work in offline (or disconnected) mode. For these advanced objects, we’ll first look at what has been provided in the Adventure Works sample cube, and then we’ll create these objects using BIDS. Although we’ll begin to work with the native OLAP query language, MDX, in this chapter, our approach will be to examine generated MDX rather than native query or expression writing. The reason for taking this approach is that, as mentioned, MDX is quite complex, and it’s a best practice to thoroughly exhaust the tools and wizards inside BIDS to generate MDX before you attempt to write MDX statements from scratch. Note In Chapters 10 and 11, we cover MDX syntax, semantics, expressions, query authoring, and more. There we examine specifics and best practices for using the MDX language. We review these additions in order of simple to complex because we find that, when first introduced to this material, people absorb it best in this fashion. To that end, we’ll start here with something that should be familiar, because we’ve already covered it with respect to dimensions—that is, translations.
Translations and Perspectives Translations for cube metadata function much like translations for dimension metadata. Of course, providing localized strings for the metadata is really only a small part of localization. 225
226
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
You must remember that here you’re translating only the object labels (that is, measure groups, measures, dimensions, and so on). When your project has localization requirements, these requirements usually also include requirements related to localizing the cube’s data. The requirement to localize currency (money) data is an often-used example. Because this particular translation requirement is such a common need, Microsoft provides the Currency Conversion Wizard to assist with this process. This powerful wizard (which is really more of a tool than a wizard) is encapsulated in another wizard. The metawizard is the Add Business Intelligence Wizard. We’ll review the use of this tool in general later in the chapter, as well as specifically looking at how the Currency Conversion Wizard works. To view, add, or change translations, you use the Translations tab in the OLAP cube designer, which is shown in Figure 8-1. Translation values can be provided for all cube objects. These include the cube, its measure groups, and measures, along with dimensions and other objects, such as actions, KPIs, and calculations.
FiguRe 8-1 The Translations tab of the OLAP cube designer allows you to provide localized strings for cube
metadata.
Note If a cube doesn’t contain a translation for a user’s particular locale, the default language or translation view is presented to that user.
Chapter 8
Refining Cubes and Dimensions
227
Perspectives are defined subsets of an OLAP cube. They’re somewhat analogous to relational views in an RDBMS. However, perspectives are like a view that covers the entire cube and expose only specific measures and dimensions. Also, unlike working with relational views, you cannot assign specific permissions to defined perspectives. Instead, they inherit their security from the underlying cube. We find this object useful for most of our projects. We use perspectives to provide simplified, task-specific views (subsets) of an enterprise OLAP cube’s data. Perspectives are easy to create using BIDS. You simply select the items you want to include in your defined perspective. You can select any of these types of items: measure groups, measures, dimensions, dimensional hierarchies, dimensional attributes, KPIs, actions, or calculations. It’s important for you to verify that your selected client tools support viewing of cube data via defined perspectives. Not all client tools support this option. SSAS presents the perspective to the client tool as another cube. To use the perspective, instead of selecting the actual cube (Adventure Works, in the sample), you select the perspective (Direct Sales). To easily view, add, or change perspectives, you use the Perspectives tab in the cube designer. Figure 8-2 shows a partial list of the defined perspectives for the Adventure Works sample cube.
FiguRe 8-2 The Perspectives tab of the OLAP cube designer allows you to define subsets of cube data for
particular user audience groups.
228
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Note It’s important to understand that perspectives are not a security mechanism. They are designed purely as a convenience for you to create simplified views of a particular cube. Security permissions assigned to the underlying objects—that is, cube and dimensions, and so on—are enforced when a user browses any defined perspective. In our experience, translations and perspectives are specialized options. In the real world, we’ve implemented these two features only when project specifications call for them. Some of our clients prefer to refrain from using perspectives entirely, while others quite like them. Translations are used when localization is part of the project. The next feature we cover, however, is one that nearly 100 percent of our clients have used.
Key Performance Indicators KPIs are core metrics and measurements related to the most important business analytics. In our experience, we’ve often heard them referred to as “the one (unified) view of the truth for a business.” KPIs are often displayed to end users via graphics on summary pages, such as dashboards or scorecards. Although you set up KPIs to return results as numbers—that is, 1 is good, 0 is OK, and –1 is bad—you generally display these results as graphics (such as a trafficlight graphic with red, yellow, or green selected, or as different colors or types of arrows). The returned number values aren’t as compelling and immediate as the graphics to most users. The KPIs tab in the OLAP cube designer in BIDS has the built-in capacity to display several types of graphics instead of numbers. The KPIs tab includes both a design area and a preview (or browse) area. An important consideration when including KPIs in your OLAP cube is whether or not your selected end-user client applications support the display of KPIs. Both Microsoft Office Excel 2007 and Microsoft Office SharePoint Portal Server 2007 support the display of SSAS OLAP cube KPIs. The reason we make this distinction is that both Excel and SharePoint Portal Server support display of KPIs from multiple sources—OLAP cubes, or Excel or SharePoint Portal Server. Tip We recommend that if your business requirements include the use of KPIs in your solution, you create as many of these KPIs in the OLAP cube (rather than in Excel, SharePoint Portal Server, and so on). This development approach better maintains that uniform view of the truth that is important for businesses.
Open the sample Adventure Works cube in BIDS, and click on the KPIs tab. We’ll first use the KPI browser to get an idea of what the sample KPIs are measuring and how they might appear in a client dashboard. The default view of KPIs is the design view. To open the browser view, click the tenth button from the left (the KPI icon with magnifying glass on it) on the embedded toolbar, as shown in Figure 8-3.
Chapter 8
Refining Cubes and Dimensions
229
FiguRe 8-3 The BIDS KPI designer includes a KPI viewer.
KPIs consist of four definitions—value, goal, status, and trend—for each core metric. These metrics are defined for a particular measure. Recall that each measure is associated with one particular measure group. The information for these four items is defined using MDX. At this point, we’ll simply examine generated MDX. (In future chapters, we’ll write MDX from scratch.) Following is a list of definitions and example statements for the most frequently used items in KPIs: ■■
Value MDX statement that returns the actual value of the metric. For a KPI called Revenue, this is defined as [Measures].[Sales Amount].
■■
Goal MDX statement that returns the target value of the metric. For a KPI called Revenue, this is defined using an MDX Case…When…Then expression.
■■
Status MDX statement that returns Value – Goal as 1 (good), 0 (OK), or –1 (bad). Again, this is defined using an MDX Case…When…Then expression.
■■
Trend MDX statement that returns Value – Goal over time as 1, 0, or –1. Similar to both the Goal and Status values, this also uses an MDX Case expression to define its value. (This statement is optional.)
Figure 8-4 shows the KPIs defined in the Adventure Works sample OLAP cube. Note that in addition to showing the actual amounts for the Value and Goal fields, the browser shows a graphical icon to represent state for the Status and Trend fields. Also, this browser allows you to set filters to further test your KPIs. In the example in Figure 8-4, we’ve added a filter to show results for customers who live in Australia. You can also nest KPIs—that is, create parent and child KPIs. If you do this, you can assign a Weight value to the nested KPI. This value shows the child’s percentage of the total contribution to the parent KPI value. An example from this sample is the Operating Profit KPI, which rolls up to be part of the Net Income KPI. In addition to the sample KPIs that are part of the Adventure Works cube, BIDS includes many templates for the more common types of KPIs. To see these, click the Form View button on the embedded toolbar to switch back to design view. Note that, as with other objects, this designer includes both a metadata viewer and a designer surface. If you take a closer look at the metadata browser, you can see that it lists the existing KPIs and includes three sections on the bottom: Metadata, Functions, and Templates. The templates are specific to KPIs. We’ve opened the metadata browser (shown in Figure 8-5) to show the MDX functions that are specific to KPI-building tasks.
230
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 8-4 The BIDS KPI browser allows you to see the KPIs that you design.
FiguRe 8-5 The BIDS KPI designer includes a metadata browser.
First, look at the included KPI samples and templates. You’ll notice that they’re easy to customize for your particular environment because they both include a sufficient number of comments and are set up as MDX templates. What we mean by this is that they have double chevrons (<< some value >>) to indicate replaceable parameter values. Keep in mind, though, that because of the naming idiosyncrasies of MDX, the more effective way to place object names into any MDX statement is by dragging and dropping the particular object from the metadata browser. As you work with the design area, you’ll probably be happy to discover
Chapter 8
Refining Cubes and Dimensions
231
that it color-codes the MDX—for example, all built-in functions are brown, comment code is green, and so on. Also, the designer includes basic IntelliSense for MDX functions. In Figure 8-6, you can see the template for the Net Income KPI. To add this KPI to the cube, you can double-click on the Net Income template under the Financial folder in Templates.
FiguRe 8-6 The BIDS KPI designer includes an MDX syntax color-coding function and IntelliSense, and it supports commenting.
Follow these steps to customize the KPI in our example: 1. Give the KPI a unique name. 2. Select a measure group to associate the KPI with. 3. Define a Value expression in MDX. (A template is provided for us.) 4. Define a Goal expression in MDX. Commented suggestions are provided. 5. Define a Status expression in MDX. A patterned sample MDX code is provided. 6. Optionally, define a Trend expression in MDX. As with the Status expression, a patterned sample MDX code is provided.
232
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The Value, Goal, and Status expressions are self-explanatory. The templated code included for the Trend value needs more explanation, and it’s shown in the following code sample. This sample uses the MDX function ParallelPeriod to get a value from the same area of the time hierarchy for a different time period, such as “this fiscal week last year,” to support the trend. The ParallelPeriod function works as follows: The function returns a value from a prior period in the same relative position in the time dimensional hierarchy as the specified value. The three arguments are a dimension level argument (for example, DimTime.CalendarTime.Years), a numeric value to say how many periods back the parallel period is, and a specific value or member (for example, DimTime.CalendarTime.CurrentMember). ParallelPeriod is one of hundreds of powerful, built-in functions. CurrentMember is another built-in function used for the calculation of the trend value. It’s simpler to understand. As you’d probably guess, it retrieves the currently selected member value. /*The periodicity of this trend comparison is defined by the level at which the ParallelPeriod is evaluated.*/ IIf ( KPIValue( “Net Income” ) > ( KPIValue( “Net Income” ), ParallelPeriod ( [<<Time Dimension Name>>].[<<Time Hierarchy Name>>].[<<Time Year Level Name>>], 1, [<<Time Dimension Name>>].[<<Time Hierarchy Name>>].CurrentMember ) ), 1, -1 )
As we start to examine KPIs, you are getting a taste of the MDX query language. At this point, we prefer to introduce you to KPIs conceptually. In the chapters dedicated to MDX (Chapters 10 and 11), we’ll provide you with several complete KPI examples with the complete MDX syntax and a fuller explanation of that syntax. One last point is important when you’re adding KPIs. When designing your BI solution KPIs, you need to know whether to choose server-based KPIs, client-based KPIs, or a combination of both. If your user tools support the display of Analysis Services KPIs, server-based KPIs are most commonly used because they are created once (in the cube) and reused by all users. Server-based KPIs present a uniform view of business data and metrics—a view that is most often preferred to using client-based KPIs. If your requirements call for extensive dashboards that include KPIs, you might want to investigate products such as Microsoft PerformancePoint Server, which are designed to facilitate quick-and-easy creation of such dashboards. We cover client tools in general in Part IV. It’s also often a requirement to provide not only summarized KPIs on a dashboard for end users, but also to enable drillthrough back to detailed data behind the rollup. In the next section, you’ll see how to enable drillthrough (and other types of) actions for an OLAP cube.
Chapter 8
Refining Cubes and Dimensions
233
Note New to Microsoft SQL Server 2008 is the ability to programmatically create KPIs. We cover this in detail in Chapter 11.
Actions Actions are custom-defined activities that can be added to a cube and are often presented as options when an end user right-clicks in an area of a cube browser. A critical consideration is whether the client applications that you’ve selected for your BI solution support action invocations. Of course, if you’re creating your own client, such as by implementing a custom Windows Forms or Web Forms application, you can implement actions in that client in any way you choose. Common implementations we’ve seen include custom menus, custom shortcut menus, or both. The built-in cube browser in BIDS does support actions. This is convenient for testing purposes. To view, add, or change actions, you use the Actions tab in the OLAP cube designer. There are three possible types of actions: regular, drillthrough, and reporting. Actions are added to SSAS cubes and targeted at a particular section—that is, an entire dimension, a particular hierarchy, a particular measure group, and so on. Here is a brief explanation of the three action types: ■■
Regular action This type of action enables end users to right-click on either cube metadata or data and to perform a subsequent activity by clicking a shortcut menu command. This command passes the data value of the cell clicked (as a parameter) to one of the following action types: Dataset, Proprietary, Rowset, Statement, or URL. What is returned to the end user depends on which action type has been set up in BIDS. For example, if a URL action has been configured, a URL is returned to the client, which the client can then use to open a Web page. Rowset actions return rowsets to the client and Dataset actions return datasets. Statement actions allow the execution of an OLE DB statement.
■■
Reporting action This type of action allows end users to right-click on a cell in the cube browser and pass the value of the location clicked as a parameter to SQL Server Reporting Services (SSRS). Activating the action causes SSRS to start, using the custom URL; this URL includes the cell value and other properties (for example, the format of the report, either HTML, Excel, or PDF). An interesting property is the invocation type. Options for this property are interactive (or user-initiated), batch (or by a batch processing command), or on open (or application-initiated). This property is available for any action type and is a suggestion to the client application on how the action should be handled.
■■
Drillthrough action This type of action enables end users to see some of the detailed source information behind the value of a particular measure—for example, for this discount amount, what are the values for x,y,z dimensional attributes. As mentioned, this type of action is used frequently in conjunction with KPIs.
234
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
As with the KPI designer, the actions designer in BIDS includes an intelligent metadata browser. This contains a list of all actions created for the cube, cube metadata, functions, and action templates. The configuration required for the various action types depends on which type was selected. For example, for regular actions, you must give your action a unique name, set the target type, select the target object, set the action type, and provide an expression for the action. The Target Type option requires a bit more explanation. This is the type of object that the end user clicks to invoke the actions. You can select from the following list when configuring the Target Type option: Attribute Members, Cells, Cube, Dimension Members, Hierarchy, Hierarchy Members, Level, or Level Members. In Figure 8-7, the City Map action contains information for the Additional Properties options of Invocation (which allows you to select Batch, Interactive, or On Open from the drop-down list), Description (which provides a text box into which you can type in a description), and Caption (which has a text box for you to type a caption into). Note that the caption information is written in MDX in this example and the last optional parameter, Caption Is MDX, has been set to True.
FiguRe 8-7 When configuring regular actions in BIDS, you must specify the target type, object, condition,
and action content.
You can also optionally supply one or more conditions that are required to be met before the action is invoked. An example of a condition that could be used for our particular sample action could be to include an MDX expression that specified that target values (in this case, cities) had to originate from countries in North America only. Of course, Figure 8-7 does not display some key information—specifically, that of the Action Content section. The Action Content section provides a Type drop-down list, from which you can select Dataset, Proprietary, Rowset, Statement, or URL. As mentioned, the Action Content parameter specifies the return type of the action result. In our sample, shown in Figure 8-8, the action result is set to URL.
Chapter 8
Refining Cubes and Dimensions
235
So, for our sample, end users must right-click on a member associated with the Geography dimension and City attribute to invoke this particular action. As the Description text box in Figure 8-7 states, the result will display a map for the selected city. If you look at the action expression in Figure 8-8, you can see that it constructs a URL using a concatenated string that is built by using the MDX function CurrentMember.Name and the conditional Case…When…Then expression.
FiguRe 8-8 The Action Content section for a regular cube action contains an MDX statement to return a result of the type specified in the Type parameter.
The next type of action is a reporting action. This action functions similar to a regular action with a URL target. A particular area of the cube is clicked and then metadata is passed as a parameter to an SSRS report URL. In the Adventure Works sample, the reporting action is named Sales Reason Comparisons. Figure 8-9 shows the configuration parameters. Many of these parameters—such as Target Type, Target Object, and Condition—are identical to those for regular actions. The set of parameters starting with the Report Server section are unique to this action type. As you can see in Figure 8-9, you need to specify the following SSRS-related options: Server Name, Report Path, and Report Format. The Report Format drop-down list includes the following choices: HTML5, HTML3, Excel, and PDF. You can optionally supply additional parameters in the ProductCategory row. Finally, in the Additional Properties section, you have
236
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
options to specify invocation type, a description, and other information, just as you did for regular actions with URL targets.
FiguRe 8-9 A report cube action associates cube metadata with an SSRS report URL
Another consideration if you’re planning to use reporting actions is determining the type of credentials and number of hops that will have to be navigated between the physical servers on which you’ve installed your production SQL Server Analysis Services (SSAS) and SSRS instances. We cover the topic of SSRS security configuration (and integration with SSAS in general) in greater detail in Part IV. At this point, you should at least consider what authentication type (for example, Windows, Forms, or custom) users will use to connect to Reporting Services. When you implement custom actions, in addition to selecting and configuring authentication type, you’ll usually also create one or more levels of authorization (such as roles or groups). We typically see custom actions limited to a subset of the BI project’s end users (such as the analyst community). The last type of action available on the Actions tab is the drillthrough action. To understand what this action enables, look at the sample cube in the BIDS cube browser. Figure 8-10 shows a view of the cube that includes the results of right-clicking on a data cell (that is, a measure value) and clicking Drillthrough. Drillthrough refers to an ability to make source data that is included in the cube available (at the level of a particular column or group of columns from the DSV). Drillthrough does not mean enabling end users to drill back to the source systems. Of course, if you plan to use drillthrough, you must also verify that all end-user tools support its invocation. Excel 2007 supports drillthrough. SSRS doesn’t support actions.
Chapter 8
Refining Cubes and Dimensions
237
You can set up an SSRS report to look like it supports drillthrough, but you have to build this functionality (that is, the links between reports) specifically for each report. Note It’s easy to confuse drillthrough with drill down when designing client applications. Drillthrough is a capability of OLAP cubes that can be enabled in BIDS. Drillthrough allows end users to see more detail about the particular selected item. Drill down is similar but not the same. Drill down refers to a more general client reporting capability that allows summary-level detail to be expandable (or linkable) to more detailed line-by-line information. An example of drill down is a sales amount summary value for some top-level periods, such as for an entire year, for an entire division, and so on. Drill down allows end users to look at rowby-row, detailed sales amounts for days, months, individual divisions, and so on. Drillthrough, in contrast, enables end users to see additional attributes—such as net sales and net overhead— that contributed to the total sales number.
FiguRe 8-10 A drillthrough cube action allows clients to return the supporting detail data for a given cell.
Another consideration when you’re configuring the drillthrough action is whether to enable only columns that are part of the business requirements. More included columns equals more query processing overhead. We’ve seen situations where excessive use of drillthrough resulted in poor performance in production cubes. There are a few other oddities related to drillthrough action configurations.
238
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Unlike the process for creating regular actions and reporting actions, when you’re creating drillthrough actions you need to specify a target measure group (rather than object and level). Figure 8-11 shows an example of the syntax used to create a drillthrough action targeted at the Reseller Sales measure group. Note that you can optionally configure the maximum number of rows returned via the drillthrough query. For the reasons just mentioned, we recommend that you configure this value based on requirements and load testing.
FiguRe 8-11 Drillthrough actions are targeted at particular measure groups.
Note Sometimes our clients are confused about two features: drillthrough and writeback. Just to clarify, drillthrough is the ability to view selected source data behind a particular measure value. Writeback is the ability to view and change source data. Writeback requires significantly more configuration and is discussed later in this chapter. A final consideration when implementing drillthrough actions is that they require specific permission to be set for the roles you want to be able to use drillthrough. These permissions are set at the level of the cube. Figure 8-12 shows the role designer in BIDS at the Cubes tab. To set up drillthrough permissions for a particular role’s members, simply enable it via this tab.
Chapter 8
Refining Cubes and Dimensions
239
FiguRe 8-12 Drillthrough actions require role permission for drillthrough at the level of the cube.
As we did with KPIs, we’ll review the item we most often add to a customer’s OLAP cubes based on needs that surfaced during the business requirements gathering phases of the project. This type of object is quite complex and contains several types. It’s called a calculation in BIDS.
Calculations (MDX Scripts or Calculated Members) The next area to use to enhance your OLAP cube using the BIDS cube designer is the Calculations tab. This tab allows you to add three types of objects to your cube: calculated members, named sets, and script commands. These objects are all defined using MDX. We’ll take a light tour of this area in this section, with the intent being for you to get comfortable reading the included calculations in the Adventure Works sample. As mentioned, in a couple of subsequent chapters, we’ll examine the science of authoring native MDX expressions and queries. We understand that you might be quite familiar with (and eager) to code natively; however, we caution you that MDX is deceptively simple. Please heed our advice: learn to read it first, and then work on learning to write it. Note We’ve had many years of experience with this tricky language and still don’t consider ourselves to be experts. Why is MDX so difficult? The reason for this is not so much the language itself. MDX’s structure is (loosely) based on Transact-SQL, a language many of you are quite familiar with. Rather, the difficulty lies in the complexity of the OLAP store that you’re querying with MDX. Retrieving the appropriate subset from an n-dimensional structure is nearly impossible to visualize, which makes accurate query and expression writing very challenging.
To get started, we’ll open the Calculations tab for the Adventure Works sample in BIDS. The default view is shown in Figure 8-13. Notice that the tab is implemented in a similar way to others in BIDS (KPIs, Actions, and so on) in that there is a tree view on the upper left side that lists all objects. This is the Script Organizer section. There you can see the three types of objects, with cube icons indicating top-level queries, sheet-of-paper icons indicating script commands, and blue calculator icons indicating calculated members. Adventure Works also contains named sets; these are indicated with black curly braces.
240
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 8-13 The Calculations tab in BIDS provides you with a guided calculated member designer.
Below the Script Organizer section, you’ll find the Calculation Tools section, which includes the Metadata, Functions, and Template tabs. Functions and templates will be filtered to reflect the object types on this tab—that is, calculated members, and so on. The default view of the right side of the designer displays the object in a guided interface. In Figure 8-13, we’ve selected a calculated member, called [Internet Gross Profit Margin]. You can see from Figure 8-13 that you must complete the following information to define a calculated member: ■■
Name Must be unique. If the name uses embedded spaces, they must be surrounded by square brackets ([ ])
■■
Parent Properties This section includes options for specifying the names of the hierarchy and member where the new member will be defined. Most often this will be the Measures hierarchy. The Measures hierarchy is flat (that is, it has only one level), so if you’re defining your calculated member here (and most often you will be), there is no parent member because there is only one level in this dimensional hierarchy. In Chapters 10 and 11, which cover MDX in detail, we explain situations when you might want to define calculated members in hierarchies other than measures.
■■
Expression This is the MDX expression that defines how the value will be calculated. In the simple example shown in Figure 8-13, the MDX expression is very straightforward—that is, (Sales – Cost) / Sales = Gross Profit. In the real world, these calculations are often much more complex. Fortunately, MDX includes a rich function library to make these complex expressions easier to write. At this point, you can think of this
Chapter 8
Refining Cubes and Dimensions
241
statement as somewhat analogous to a cell formula in an Excel workbook. This is an oversimplification, but it will get you started. ■■
Additional Properties This section is optional. The options in this section include specifying the format type, visibility (where you can specify that you want to build calculated members using other calculated members and that sometimes you want to hide the intermediate measures to reduce complexity or to meet security requirements), and more. The Color Expressions and Font Expressions sections function similarly to conditional formatting in Excel, except that you write the expressions in MDX.
You might be surprised to see that the Adventure Works sample cube includes more than 20 calculated members, as well as a number of named sets and script commands. It’s important that you understand exactly how these objects are created when you’re deciding whether to include them or not. Table 8-1 summarizes this information. TAbLe 8-1
Types of Calculated Objects for OLAP Cubes
Object
Advantages
Disadvantages
Calculated members
Calculated at query time; results are not stored; does not add to cube processing or disk storage.
Although this approach is fast, they’re not as fast to query as data that is stored (that is, part of a fact table).
SSAS engine is optimized to execute; cubes can support hundreds of calculated members.
Must be defined by writing MDX expressions.
Can make other MDX queries that refer to named sets more readable.
Must be defined by writing MDX queries; query syntax must retrieve correct subset and do so efficiently.
Easy to understand, similar to relational views.
Must be retrieved in the correct order if multiple named sets are used as a basis for other MDX queries.
Very powerful, completely customizable, can limit execution scope very selectively.
Very easy to make errors when using them. Complex to write, debug, and maintain. If not written efficiently, can cause performance issues.
Named sets
Script commands
Must monitor member intersections if combining from non-measures hierarchy.
Before we leave our (very brief) introduction to calculations in BIDS, we’ll ask you to take another look at the Calculations tab. Click the twelfth button from the left on the embedded toolbar, named Script View. Clicking this button changes the guided designer to a single, all-up MDX script view that contains the MDX statements that define each object (whether it is a named set, calculated member, or other type of object) in a single file. Be sure to keep in mind that when you’re adding MDX calculated objects to your OLAP cube, the order in which you add them is the order in which the script creates them. This is because you could potentially be creating objects in multiple dimensions; if the objects are created in an order other than what you intended, inappropriate overwriting could occur.
242
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
In fact, by reading the MDX and the included comments, you’ll see several places in the Adventure Works sample where particular objects are deliberately positioned at a particular point in the execution order. In Figure 8-14, this information is thoroughly commented. As you complete your initial review of the included MDX calculations, you’ll want to remember this important point and be aware that in addition to the physical order of code objects, you can also use MDX keywords to control the overwrite behavior. We cover this topic in more detail in Chapters 10 and 11.
FiguRe 8-14 The Calculations tab in BIDS provides a complete view of all scripted objects in the order they
are executed.
As you begin to explore the MDX code window, you might see that there is a syntax validate button on the embedded toolbar. Oddly, it uses the classic spelling-checker icon—ABC plus a check mark. This seems rather inappropriate given the complexity of MDX! Also, if you have development experience coding with other languages using Visual Studio, you might wonder whether you’ll find any of the usual assistance in the integrated development environment when you begin to write MDX code manually. Fortunately, you’ll find several familiar features. For example, if you click in the left margin in the view shown in Figure 8-14, you can set breakpoints in your MDX calculations code. We’ve found this to be a very valuable tool when attempting to find and fix errors. We’ll explore this and other coding aspects of MDX in future chapters. Note Chapters 10 and 11 are devoted to a deeper look at MDX syntax. There we’ll look at both MDX expression and query syntax so that you can better understand, edit, or write MDX statements natively if your business needs call for that.
Chapter 8
Refining Cubes and Dimensions
243
New in SQL Server 2008 are dynamic named sets; the set values are evaluated at each call to the set. This behavior differs from that of named sets in SQL Server 2005, where named sets were evaluated only once, at the time of set creation. The new flexibility introduced in SQL Server 2008 makes named sets more usable. We’ll take a close look at dynamic named set syntax in Chapter 11.
using Cube and Dimension Properties Generally, there are a couple of ways to set advanced configurable SSAS object properties (that is, cube and dimension properties) using BIDS. One way is via the standard pattern used in Visual Studio: Open the object of interest in Solution Explorer, and then open the Properties window by pressing F4 or right-clicking on the appropriate object and choosing Properties. The Properties window displays (or can be set to display) the viewable and configurable object-specific properties. Tip If you right-click on an object in Solution Explorer and choose Properties, you’ll see only the container’s properties (the file or database object that the actual code is stored in). To see the actual object properties, you must open its designer in BIDS and select the appropriate object in the designer to see all the configurable properties.
Before we drill too deeply into advanced property configurations, we’ll explore yet another way to configure object properties: by using the Business Intelligence Wizard. This wizard is accessed by clicking what is usually the first button on the embedded toolbar on any SSAS object tab. This is shown in Figure 8-15. This option is available in both the cube and dimension designers.
FiguRe 8-15 The Business Intelligence Wizard is available on the embedded toolbars in BIDS.
After you open the Business Intelligence Wizard, you’ll see on the first page that there are eight included possible enhancements. Following is a list of the available options, which are also shown in Figure 8-16: ■■
Define Time Intelligence
■■
Define Account Intelligence
■■
Define Dimension Intelligence
244
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
■■
Specify A Unary Operator
■■
Create A Custom Member Formula
■■
Specify Attribute Ordering
■■
Define Semiadditive Behavior
■■
Define Currency Conversion
FiguRe 8-16 The Business Intelligence Wizard in the BIDS cube designer allows you to make advanced
property configuration changes easily.
What types of changes do these enhancements make to cubes and dimensions? They fall into two categories: advanced property configurations and MDX calculations, or some combination of both types of enhancements. We’ll start our tour with advanced property configurations because the most common scenarios have been encapsulated for your convenience in this wizard. We find this wizard to be a smart and useful tool; it has helped us to be more productive in a number of ways. One feature we particularly like is that the suggestions the wizard presents on the final page are in the format in which they’ll be applied—that is, MDX script, property configuration changes, and so on. On the final page, you can either confirm and apply the changes by clicking Finish, cancel your changes by clicking Cancel, or make changes to the enhancements by clicking Back and reconfiguring the pages for the particular enhancement that you’re working with. Note If you invoke the Business Intelligence Wizard from inside a Dimension Editor window, it displays a subset of options (rather than what is shown in Figure 8-16). The wizard displays only enhancements that are applicable to dimension objects.
Chapter 8
Refining Cubes and Dimensions
245
Time Intelligence Using the Define Time Intelligence option allows you to select a couple of values and then have the wizard generate the MDX script to create a calculated member to add the new time view. The values you must select are the target time hierarchy, the MDX time function, and the targeted measure or measures. In our example, we’ve selected Date/Fiscal, Year To Date, and Internet Sales Amount, respectively, for these values. In Figure 8-17, you can see one possible result when using the Adventure Works sample. After you review and accept the changes by clicking Finish, this MDX script is added to the Calculations tab of the cube designer. There you can further (manually) edit the MDX script if desired.
FiguRe 8-17 The Business Intelligence Wizard in the BIDS cube designer allows you to add calculated
members based on generated MDX.
We frequently use this wizard to generate basic calculated members that add custom time views based on business requirements. We also have used this wizard with developers who are new to MDX as a tool for teaching them time-based MDX functions.
246
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
SCOPE Keyword Notice in the preceding script that the calculated member definition in MDX includes a SCOPE…END SCOPE keyword wrapper. Why is that? What does the SCOPE keyword do? It defines a subset of a cube to which a calculation is applied. This subset is called a subcube. Three MDX terms are used to manage scope: CALCULATE, SCOPE, and THIS. The simplest way to understand how these keywords work is to examine the example presented in SQL Server Books Online: /* Calculate the entire cube first. */ CALCULATE; /* This SCOPE statement defines the current subcube */ SCOPE([Customer].&[Redmond].MEMBERS, [Measures].[Amount], *); /* This expression sets the value of the Amount measure */ THIS = [Measures].[Amount] * 1.1; END SCOPE;
As you can see by this code snippet, the CALCULATE keyword applies some type of custom calculation to the cells in the entire cube. The SCOPE statement defines a subset (or subcube) of the entire cube that calculations will be performed against. Finally, the THIS expression applies a calculation to the defined subcube. Note For more information, see the SQL Server Books Online topic “Managing Scope and Context (MDX)” at http://msdn.microsoft.com/en-us/library/ms145515.aspx.
Account Intelligence and Unary Operator Definitions Using the Define Account Intelligence option (which is available only in the Enterprise edition of SSAS) allows you to select a couple of values and have the wizard correctly configure advanced cube or dimension properties so that standard accounting attributes, such as Chart Of Accounts or Account Type, can be associated with dimension members. To use this option, you must first select the dimension to which the new attributes will be applied. For our example, we’ll select the Account dimension from Adventure Works. On the next page of the wizard, you need to select source attributes in the selected dimension that will be assigned to standard account values. Also, note that these attributes are set to be semiadditive by default. This means that the default aggregation (sum) will be overridden to reflect standard aggregation based on account type—for example, profit = sales – costs. These options are shown in Figure 8-18.
Chapter 8
Refining Cubes and Dimensions
247
FiguRe 8-18 The Business Intelligence Wizard in the BIDS cube designer allows you to map advanced
properties to dimension attributes.
For the wizard to correctly interpret the semiadditive aggregation behavior, you also have to configure source attributes to built-in account types, such as asset, balance, and so on. You do this on the next page of the wizard. The mapping is generated automatically by the wizard and can be altered if necessary. The last step in the wizard is to review the suggested results and confirm by clicking Finish. This is shown in Figure 8-19. As you can see by reviewing the results of the wizard, what you’ve done is configure two advanced dimension member properties for all selected attributes of the targeted dimension, which is Account in our sample. The two attributes are AggregationFunction and Alias. AggregationFunction applies a particular MDX function to override the default SUM function to Account dimension member attributes based on a standard naming strategy—for example, the LastNonEmpty function for the Flow member. Note also the AggregationFunction property value for the Amount measure of the Financial Reporting measure group has been set to the value ByAccount. The Account Intelligence feature is specialized for scenarios in which your business requirements include modeling charts of accounts. This is, of course, a rather common scenario. So if your business intelligence (BI) project includes this requirement, be sure to take advantage of the wizard’s quick mapping capabilities. You can further modify any of the configured properties by opening the dimension and attribute property sheets, locating the properties (such as AggregationFunction), and updating the property values to the desired configuration value from the property sheet itself.
248
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 8-19 The Business Intelligence Wizard in the BIDS cube designer allows you to define custom
aggregation behavior.
The Specify A Unary Operator feature is used in similar situations as the Account Intelligence feature—that is, where your business requirements include modeling charts of accounts. We’ve used the former feature when the modeled source data is financial, but in a more unstructured or nonstandard type of hierarchy. Another way to understand this difference is to keep in mind that unary operators are used when you have source attributes that map to the standard types listed in the Business Intelligence Wizard. You can see the Specify A Unary Operator page of the Business Intelligence Wizard in Figure 8-20. We’ll use the Account dimension from Adventure Works to demonstrate the use of the Specify A Unary Operator feature. Note that on the second page of the wizard the source data must be modeled in a particular way to use this feature. In Figure 8-21, you’ll see that the source dimension must include a key attribute, parent attribute, and column that contains the unary operator (which will define the aggregation behavior for the associated attribute). When the wizard completes, the property value UnaryOperatorColumn is set to the column (attribute) value that you specified previously in the wizard.
Chapter 8
Refining Cubes and Dimensions
249
FiguRe 8-20 The Business Intelligence Wizard in the BIDS cube designer allows you to define unary operators
to override default aggregation behavior.
FiguRe 8-21 Using the Specify A Unary Operator enhancement requires a particular structure for the
source data.
250
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Note Usual unary operator values are the following: +, –, or ~. The final value listed, the tilde, indicates to SSAS that the value should be ignored in the aggregation. Taking a look at the source data used in creating the Account dimension can help you to understand the use of a unary operator in defining a custom hierarchy and rollup for accounting scenarios. You can see in Figure 8-21 that the first value, Balance Sheet, should be ignored. The next 22 values (from Assets to Other Assets) should be summed. Line 25 (Liabilities And Owners Equity) should be subtracted. You can also see by looking at the sample data that the hierarchy in the data is defined by the value provided in the ParentAccount column.
Other Wizard Options Using the Define Semiadditive Behavior option allows you to define the aggregation method that you need to satisfy the project’s business requirements using the wizard. This option is even more flexible than the Account Intelligence or Unary Operator options that we just looked at. Using this option, you can simply override the automatically detected aggregation type for a particular attribute value in any selected hierarchy. BIDS applies the types during cube creation, using the attribute value names as a guide. Any changes that you make in the wizard update the attribute value named AggregationFunction. You can select from any one value in the drop-down list. These values are as follows: AverageOfChildren, ByAccount, Count, DistinctCount, FirstChild, FirstNonEmpty, LastChild, LastNonEmpty, Max, Min, None, and Sum. We remind you that all of these wizard-generated enhancements are available only in the Enterprise edition of SSAS. Using the Define Dimension Intelligence option allows you to map standard business types to dimension attributes. You use this option when your end-user applications are capable of adding more functionality based on these standard types. In the first step of the wizard, you select the dimension to which you want to apply the changes, and on the next page of the wizard you’ll map the attribute values to standard business types. The wizard suggests matches as well. After you complete the mapping, the wizard displays a confirmation page showing the suggested changes, and you can confirm by clicking Finish, or you can click Back and make any changes. Confirming will result in the Type property for the mapped attributes to be set to the values that you had previously selected. Using the Specify Attribute Ordering option allows you to revise the default dimensional attribute ordering property value. The possible values are (by) Name or (by) Key. As with other options, you select the dimension and then update the dimension attributes that you want to change. After you’ve made your selections, the final wizard page reflects the changes in the attribute ordering property value. This property is named OrderBy.
Chapter 8
Refining Cubes and Dimensions
251
Using the Create A Custom Member Formula option allows you to identify any source column as a custom rollup column. That will result in replacing the default aggregation for any defined attributes in the selected dimension with the values in the source column. It’s the most flexible method of overriding the default sum aggregation. To use this option, you’ll first select the dimension with which you want to work. Next you’ll map at least one attribute to its source column. The wizard configures the attribute property CustomRollupColumn to the value of the source column that you selected.
Currency Conversions Using the Define Currency Conversion option allows you to associate source data with currency conversions needed to satisfy your business requirements. Running the wizard results in a generated MDX script and generated property value configurations. As with the results of selecting the Define Time Intelligence option, the generated MDX script for the Define Currency Conversion option creates a calculated member for the requested conversions. This script is much more complex than the one that the Define Time Intelligence option generates. There are some prerequisite structures you must implement prior to running the wizard. They include at least one currency dimension, at least one time dimension, and at least one rate measure group. For more detail on the required structure for those dimensions, see the SQL Server Books Online topic “Currency Conversions.” Also, as shown in Figure 8-22, we’ve selected the source tables involved in currency conversion (at least for Internet Sales) and created a data source view (DSV) in BIDS so that you can visualize the relationship between the source data. Note the indirect relationship between the Internet Sales Facts and Currency tables via the key relationship in the Currency Rate Facts table in Figure 8-22. The Currency Rate Facts table relates the Date table to the currency so that the value of currency on a particular date can be retrieved. Start the Business Intelligence Wizard and select Define Currency Conversion on the first page. On the next page of the wizard, you’ll be asked to make three choices. First, you select the measure group that contains the exchange rate. Using our Adventure Works sample cube, that measure group is named Exchange Rates.
252
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 8-22 Using the Define Currency Conversion option requires a particular set of source tables in the
dimensions that are affected.
Next you select the pivot (or translating) currency, and then select the method you’ve used to define the exchange rate. Figure 8-23 shows this wizard page. Continuing our example, we’ve defined the pivot currency as US dollars. We’ve further specified that the exchange rates have been entered as x Australian dollars per each US dollar. On the Select Members page of the wizard, you select which members of the measures dimension you’d like to have converted. As you configure this option, the Exchange Rate measure values that you can select from are based on the table columns that are available from the measure group that you just selected. In our example, the measure group is Exchange Rates. The source table is Fact Currency Rate. The source columns from that table are AverageRate or EndOfDateRate. You can see these options in Figure 8-24. Note that you can select different exchange rate measure values for each selected measure. If you take a minute to consider the possibilities, you’ll understand that this wizard is powerful. For example, it can accommodate sophisticated business scenarios, such as the following: Freight cost values should be translated based on end-of-day currency conversion rates, while Internet sales values should be translated based on average currency conversion rates.
Chapter 8
Refining Cubes and Dimensions
253
FiguRe 8-23 On the Set Currency Conversion Options page, you select the pivot and value currencies.
FiguRe 8-24 On the Select Members page, you define the members and method for currency translation.
You’ll also note by looking at Figure 8-24 that you can apply currency conversions to members of dimensions other than the measures dimension. This functionality is particularly useful if your requirements include modeling accounting source data that is expressed in multiple currencies.
254
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
On the next page of the wizard, you’re presented with three options for performing the currency localization: many-to-many, many-to-one, or one-to-many. Although there is a description of each option on the wizard page, we’ll summarize what the options do here: ■■
Many-to-many The source currencies are stored in their local (or original) formats, and translated using the pivot currency into multiple destinations (or reporting) formats. An example would be to translate currency A, B, C using the pivot currency (US dollars) into multiple destination currencies D, E, F.
■■
Many-to-one The source currencies are stored in their local (or original) formats, and translated using the pivot currency. All sources use the pivot currency value as the destination (or reporting) value. An example would be to translate currency A, B, C all into US dollars.
■■
One-to-many The source currency is stored using the pivot currency. It’s translated into many reporting currencies. An example would be that all sources store values as US dollars and translate US dollars into multiple destination currencies.
After you select the cardinality, you have two more choices in this wizard. You must identify whether a column in the fact table or an attribute value in a dimension should be used to identify localized currency values. The last step is to select your reporting (or destination) currencies. The wizard now has enough information to generate an MDX script to execute your currency calculation. As mentioned, the wizard will also configure a couple of property values, such as Destination Currency Code (type). You can see the property changes and the MDX script in the final step of the wizard. If you take a minute to review the resultant MDX script, you can see that it’s relatively complex. It includes the MDX SCOPE keyword, along with several other MDX functions. You might also remember that after you click Finish to confirm the creation of the currency conversion, this script is added to the Calculations tab of the cube designer so that you can review or update it as needed.
Advanced Cube and Dimension Properties After you’ve completed your OLAP cube by adding translations, perspectives, actions, KPIs, and calculations and adding BI with the wizard, you’re ready to build and deploy it. During development, you usually just use all the default cube processing options so that you can quickly test and view the results. However, when you move to production, you’ll probably want to use at least some of the myriad possible configuration settings. The reason for this is that during development you’ll often just use a small subset of data when processing your cube. In this situation, cube processing times will probably be short (usually in the minutes or even seconds). Also, probably no one other than the developers will be accessing the cube, so if the cube is unavailable for browsing during frequent (test) processing, few people will be concerned about this.
Chapter 8
Refining Cubes and Dimensions
255
This situation will change when you deploy your cube to production. There you could be working with massive amounts of data (some clients have cubes in the multi-TB range), as well as hundreds or even millions of end users who expect nearly constant access to the cubes. In these situations, you tune the dimension and cube processing settings. For this reason, we turn our attention to how to do just that in the next chapter. Tip We suggest that you download a free tool from CodePlex called BIDS Helper, which can be found at http://www.codeplex.com/bidshelper. This tool contains a number of useful utilities that we use to help us with advanced cube design techniques. The utilities include the following: Aggregation Manager, Calculation Helpers, Column Usage Reports, Delete Unused Aggregations, Deploy Aggregation Designs, Deploy MDX Script, Dimension Data Type Discrepancy Check, Dimension Health Check, Dimension Optimization Report, Measure Group Health Check, NonDefault Properties Report, Printer Friendly Dimension Usage, Printer Friendly Aggregations, Tri-State Perspectives, Similar Aggregations, Smart Diff, Show Extra Properties, Update Estimated Counts, Validate Aggregations, and Visualize Attribute Lattice.
Summary We’ve covered quite a bit of the functionality available for building OLAP cubes in BIDS, and we did this in two (albeit long) chapters! Did you notice that we talked about concepts for six chapters and implementation for only two chapters? The fact that BIDS is easy to use if you fully understand OLAP concepts before you start developing your BI solutions is a key point to remember from reading this book. We’ve repeatedly seen projects in trouble because of the (natural) tendency to just open BIDS and get on with it. If we’ve done our job, we’ve given you enough information to develop a very useful OLAP cube by now. We still have more to cover, though. In the next chapter, we’ll dive into the world of cube processing. After that, we’ll look at MDX in detail. Still later, we’ll get to data mining. You have covered a good bit of ground already, and the basics of OLAP are under your belt at this point!
Chapter 9
Processing Cubes and Dimensions Now that you have developed and refined an OLAP cube, it’s time to learn how to build and process Microsoft SQL Server 2008 Analysis Services (SSAS) objects and deploy them from the Business Intelligence Development Studio (BIDS) environment to the SSAS server. To enable you to understand what steps are involved in cube and dimension processing, we’ll first define the two different types of information that you’ll work with in creating OLAP cubes: metadata and cube data. Then we’ll explain aggregations and examine the role that XMLA plays in creating OLAP cubes and dimensions. We’ll close the chapter with a detailed look at the possible cube data storage modes: multidimensional OLAP (MOLAP), hybrid OLAP (HOLAP), relational OLAP (ROLAP), or a combination of all three.
Building, Processing, and Deploying OLAP Cubes After you’ve completed your OLAP cube by optionally adding translations, perspectives, actions, KPIs, calculations, and business intelligence logic, you’re ready to build and deploy it. During development, you typically use the default cube processing options so that you can quickly test and view the results. However, when you move to production, you probably want to use at least some of the myriad possible configuration settings. The reason for this is that during development you’ll often just use a small subset of data when processing your cube. In this situation, cube processing times will probably be short (usually measured in minutes or even seconds). Also, probably no one other than the developers will be accessing the cube, so if the cube is unavailable for browsing during frequent test full cube processing, then few people will be concerned. The situation changes when you deploy your cube to production. There you could be working with massive amounts of data (some clients have cubes in the multiterabyte size) and hundreds or even millions of users who expect nearly constant access to the cubes. To prepare for these situations, you must tune the dimension and cube processing settings. BIDS SSAS database projects contain one or more cubes. Each cube references one or more dimensions. Dimensions are often shared by one or more cubes in the same project. The type of object or objects you choose to process—that is, all objects in a particular solution, a particular cube, all cubes, a particular dimension, all dimensions, and so on—will vary in production environments depending on your business requirements. The most typical scenario we encounter is some sort of cube process (usually an update that adds only new records) that is run on a nightly basis. This type of cube update automatically includes all associated dimensions to that cube (or those cubes). Because the available 257
258
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
options for processing individual dimensions are similar to those available for a cube—for example, Process Full or Process Data—we’ll focus in this chapter on cube processing options. Be aware that most of the options we present are available for processing one or more dimensions as well. The most common business reason we’ve encountered for processing single dimensions separate from entire cubes is size—that is, the number of rows in source tables and the frequency of updates in dimension tables. An example of this is near real-time reporting by customer of shipping activity by a major global shipping company. Your choices for deploying and processing changes depend on how you’re working in BIDS. If you’re working in a disconnected environment with a new cube or dimension object, you have the following choices for the solution: build, rebuild, deploy, or process. If you’re working with a connected object, you need only to process cubes or dimensions. In the latter case, there is no need to build or deploy because the object or objects you’re working on have already been built and deployed to the SSAS Service at least once prior to your current work session.
Differentiating Data and Metadata As we begin our discussion of SSAS object processing, we need to review whether the information that makes up the objects is considered data or metadata. We are reviewing this concept because it is important that you understand this difference when you are making cube or dimension processing mode choices. The simplest way to grasp this concept is as follows. The data, or rows, in the fact table are data; everything else is metadata. For most developers, it’s easy to understand that the XMLA that defines the dimensions, levels, hierarchies, and cube structure is metadata. What is trickier to grasp is that the rows of information in the dimension tables are also metadata. For example, in a Customer dimension, the name of the table, names of the columns, values in the columns, and so on are metadata. These partially define the structure of the cube. But how big is the side (or dimension) of the cube that is defined by the Customer dimension? This can only be determined by the number of rows of data in the customer source table. For this reason, these rows are part of the metadata for the cube. Another consideration is that in most cubes the physical size (that is, number of rows) in the fact table is larger by an order of magnitude than the size of any one dimension table. To understand this concept, think of the example of customers and sales amount for each one. In most businesses, repeat customers cause the customer source table rows to have a one-to-many relationship with the sales (item instance) rows in the fact table. Fact tables can be huge, which is why they can be divided into physical sections called partitions. We’ll explore logical and physical partitioning later in this chapter. At this point, it’s important that you understand that the data (rows) in the fact table have different processing options available than the rows in the dimension tables—the latter being considered metadata by SSAS.
Chapter 9
Processing Cubes and Dimensions
259
Working in a Disconnected Environment We’ll start by taking a look at the disconnected instance. You can build or rebuild an SSAS project by right-clicking on the project in the Solution Explorer window or by selecting the Build menu on the toolbar. What does build or rebuild do? In the BIDS environment, build means that the XMLA metadata that you’ve generated using the tools, designers, and wizards is validated against a series of rules and schemas. There are two types of error conditions in BIDS. One is indicated by a blue squiggly line and indicates that a violation of the Analysis Management Objects (AMO) best practice design guidelines has occurred. AMO design guidelines are a new feature in SQL Server 2008. The other type of build error is indicated by a red squiggly line. This type of error indicates a fatal error and your solution will not build successfully until after you correct any and all of these errors. To view or change the default error conditions, right-click on the solution name and then click Edit Database. You see a list of all defined error conditions on the Warnings tab, as shown in Figure 9-1.
Figure 9-1 The Warnings tab in BIDS lists all defined design error conditions.
Blue errors are guidelines only. You can correct these errors or choose to ignore them, and your solution will still build successfully. As we noted, red errors are fatal errors. As with other types of fatal build errors (for example, those in .NET), Microsoft provides tooltips and detailed error messages to help you correct these errors. When you attempt to build, both types of errors will be displayed in the Error List window. If you click a specific error in the error list, the particular designer in BIDS where you can fix the error will open, such as a particular dimension in the dimension designer. A successful build results when all the information can be validated against rules and schemas. There is no compile step in BIDS. Note The Rebuild option attempts validation to the metadata that has changed since the last successful build. It is used when you want to more quickly validate (or build) changes you’ve made to an existing project.
260
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
After you’ve successfully built your project in a disconnected BIDS instance, you have two additional steps: process and deploy. The Deploy option is available only when you rightclick the solution name in Solution Explorer. If you select Deploy, the process step is run automatically with the default processing options. When you select this option, all objects (cubes, dimensions) are deployed to the server and then processed. We’ll get into details on exactly what happens during the process step shortly. Deployment progress is reflected in a window with that same name. At this point, we’ll summarize the deploy step by saying that all metadata is sent to the server and data is prepared (processed) and then loaded into a particular SSAS instance. This action results in objects—that is, dimensions and cubes—being created, and then those objects are loaded with data. The name and location of the target SSAS server and database instance is configured in the project’s property pages. Note the order in which the objects are processed and loaded—all dimensions are loaded first. After all dimensions are successfully loaded, the measure group information is loaded. You should also note that each measure group is loaded separately. It’s also important to see that in the Deployment Progress window each step includes a start and end time. In addition to viewing this information in the Deployment Progress window, you can also save this deployment information to a log file for future analysis. You can capture the processing information to a log file by enabling the Log\Flight Recorder\Enabled option in SSMS under the Analysis Server properties. This option captures activities, except queries, on SSAS. To enable query activity logging, you can either enable a SQL Server Profiler trace or enable Query Logging on SSAS by changing the Log\QueryLog server properties using SSMS. It is only after you’ve successfully built, processed, and deployed your cube that you can browse it in the BIDS cube browser. Of course, end users will access the cube via various client tools, either Microsoft Office Excel or a custom client. Subsequent processing will overwrite objects on the server. Understand that, during processing, dimensions or entire cubes can be unavailable for end user browsing. The different process modes—such as full, incremental, and so on—determine this. From a high level, the Full Process option causes the underlying object (cube or dimension) to be unavailable for browsing, because this type of process is re-creating the object structure, along with repopulating its data. Because of this limitation, there are many ways for you to tune dimension and cube processing. We will examine the varying levels of granularity available for object processing in more detail as we progress through this chapter. During development, it’s typical to simply deploy prototype and pilot cubes quickly using the default settings. As mentioned, this type of activity usually completes within a couple of minutes at most. During default processing, objects are overwritten and the cube is not available for browsing. As mentioned, during production, you’ll want to have much finer control over SSAS object processing and deployment. To help you understand your configuration options, we’ll first take a closer look at what SSAS considers metadata and what it considers to be data. The reason for this is that each of these items has its own set of processing and deployment settings.
Chapter 9
Processing Cubes and Dimensions
261
Metadata and data have separate, configurable processing settings. Before we leave the subject of metadata and data, we have to consider one additional possible type of data that we might have to contend with that determines how to process our cube. This last type of data is called aggregations.
Working in a Connected Environment As mentioned, when working with SSAS databases in BIDS in a connected mode, the Build, Rebuild, and Deploy options are not available. Your only choice is to process the changes that you’ve made to the XMLA (metadata). Why is this? You’ll recall that the Build command causes SSAS to validate the information you’ve updated in BIDS against the internal schema to make sure that all updates are valid. When you’re working in connected mode, this validation occurs in real time. That is, if you attempt to make an invalid change to the object structure, either a blue squiggly line (warning) or a red squiggly line (error) appears at the location of the invalid update after you save the file or files involved. There is no deploy step available when you are working in connected mode because you’re working with the live metadata files. Obviously, when working in this mode it’s quite important to refrain from making breaking changes to live objects. We use connected mode only on development servers for this reason. The process step is still available in connected mode, because if you want to make a change to the data rather than the metadata, you must elect to process that data by executing a Process command (for the cube or for the dimension). We’ll be taking a closer look at the available processing mode options later in this chapter.
Understanding Aggregations As we begin to understand the implications of cube processing options, let’s explore a bit more about the concept of an OLAP aggregation. What exactly is an aggregation? It’s a preaggregated, (usually summed) stored value. Remember that data is loaded into the cube from the rows in the fact table. These rows are loaded into the source fact table from the various source systems at the level of granularity (or detail) defined in the grain statements. For example, it’s typical to load the fact rows at some time granularity. For some clients, we’ve loaded at the day level—that is, sales per day; for others, we’ve loaded at the minute level. In some ways, an OLAP aggregation is similar to an index of a calculated column of a relational table—that is, the index causes the results of the calculations to be stored on disk, rather than the engine having to calculate them each time they are requested. The difference, of course, is that OLAP cubes are multidimensional. So another way to think of an aggregation is as a stored, saved intersection of aggregated fact table values. For the purposes of processing, aggregations are considered data (rather than metadata). So, the data in a cube includes the source fact table rows and any defined aggregations.
262
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Another aspect of aggregations is that the SSAS query engine can use intermediate aggregations to process queries. For example, suppose that sales data is loaded into the cube from the fact table at the day level, and you can add aggregations at the month level to sum the sales amounts. A query to the year level could then use the month-level aggregations rather than having to retrieve the row-level (daily) data from the cube. Because of this, including appropriate aggregations is extremely important to optimal production query performance. Of course, aggregations are not free. Adding aggregations requires more storage space on disk and increases cube processing time. In fact, creating too many aggregations (which is known as overaggregation) is a common mistake that we see. Overaggregation results in long cube processing times and higher disk storage requirements without producing noticeable query response performance gains. Because of the importance of appropriate aggregation, Microsoft has added a new tab to the cube designer in SQL Server 2008 to help you to view and refine aggregations. This tab is shown in Figure 9-2.
Figure 9-2 The BIDS cube designer includes a new Aggregations tab.
Figure 9-2 shows the default aggregation design for the Adventure Works cube. By default, there are no aggregations designed when you create a new BI project. You might be wondering why. The reason is that the SSAS query engine is highly optimized and because the underlying cube structure is designed for fast query, it may be acceptable to deploy production cubes with no aggregations. You’ll remember that cube processing times are increased when you add aggregations. So the default is 0 (zero) aggregations. You use this setting so that the cube processing time is the absolute fastest that it can be. For some smaller clients, we have found that this default setting results in acceptable query performance with very fast cube processing times. Of course, there is data latency in this scenario because new data is introduced only into the cube when it is processed. We’ve seen many clients who work with the default processing settings because they are simple and nearly automatic (that is, no custom configuration is required), and because the data update cycle fits with their business requirements. Typically, the cube is updated with new data once nightly when no one is connecting to it because the business runs only during a working day (that is, from 9 A.M. to 5 P.M.).
Chapter 9
Processing Cubes and Dimensions
263
Even though small-sized cubes can work with no aggregations, we find that the majority of our clients choose to add some amount of aggregation to their cubes. Some of these scenarios include the following: ■■
Medium-sized cubes—20 GB or more—with complex designs, complex MDX queries, or a large number of users executing queries simultaneously
■■
Very large, or even huge, cubes—defined as 100 GB or more
■■
Demand for cube availability 24 hours a day, 7 days a week—leaving very small maintenance windows
■■
Requirement for minimal data latency—in hours, minutes, or even seconds
For more conceptual information about OLAP aggregations, see the SQL Server Books Online topic “Aggregations and Aggregation Designs.” Closely related to the topic of aggregations is that of partitions. You’ll need to understand the concepts and implementation of cube partitions before exploring the mechanics of aggregation design. After we look at partitions, we’ll return to the topic of aggregations, learning how to create, implement, and tune custom aggregation designs. We’ll have more to say about working with aggregations later in this chapter (in the “Implementing Aggregations” section). To understand how to do this, you’ll first have to understand another concept—cube partitioning. We’ll take a closer look at that next.
Partitioning A partition is a logical and physical segmentation of the data in a cube. Recall that OLAP data is defined as detail rows and any associated aggregations for data retrieved from a particular source fact table. Partitions are created to make management and maintenance of cube data easier. Importantly, partitions can be processed and queried in parallel. Here are some more specific examples: ■■
Scalability Partitions can be located in different physical locations, even on different physical servers.
■■
Availability Configuring processing settings on a per-partition basis can speed up cube processing. Cubes can be unavailable for browsing during processing, depending on the type of processing that is being done. (We’ll say more about this topic later in this chapter.)
■■
Reducing storage space Each partition can have a unique aggregation design. You’ll recall that aggregations are stored on disk, so you might choose one aggregation design for a partition to minimize disk usage, and a different design on another partition to maximize performance.
264
Part II ■■
Microsoft SQL Server 2008 Analysis Services for Developers
Simplifying backup In many scenarios, only new data is loaded (that is, no changes are allowed to existing data). In this situation, often the partition with the latest data is backed up more frequently than partitions containing historical data.
If you look at the disconnected example of the Adventure Works cube designer Partitions tab shown in Figure 9-3, you can see that we have one or more partitions for each measure group included in the cube. By default, a single partition is created for each measure group. Each of the sample cube partitions is assigned a storage type of MOLAP (for multidimensional OLAP storage, which we’ll cover next) and has zero associated aggregations. We’ll work with the Internet Sales measure group to examine the existing four partitions for this measure group in this sample cube.
Figure 9-3 The Partitions tab shows existing OLAP cube partitions
To examine a partition for a cube, you’ll need to click on the Source column on the Partitions tab of the cube designer. Doing this reveals a build (…) button. When you click it, the Partition Source dialog box opens. There, you see that the binding type is set to Query Binding. If you wanted to create a new partition, you could bind to a query or to a table. Reviewing the value of the first partition’s Query Binding, you see that the query splits the source data by using a WHERE clause in the Transact-SQL query. This is shown in Figure 9-4. Partitions are defined in a language that is understood by the source system. For example, if you’re using SQL Server as source fact table data, the partition query is written in TransactSQL. The most common method of slicing or partitioning source fact tables is by filtering (using a WHERE condition) on a time-based column. We usually partition on a week or month value. This results in either 52 partitions (for weeks) or 12 partitions (for months) for each year’s worth of data stored in the fact table. Of course, you can use any partitioning scheme that suits your business needs. After you’ve split an original partition, you can define additional new partitions. You can easily see which partitions are defined on a query (rather than on an entire table) on the Partitions tab of BIDS. The source value shows the query used rather than the table name, as you can see in Figure 9-5.
Chapter 9
Processing Cubes and Dimensions
265
Figure 9-4 Transact-SQL query with the WHERE clause
Figure 9-5 The Source column of a partition reflects the query upon which its definition is based.
To define additional partitions, click the New Partition link under the particular partition that you want to split. Doing this will start the Partition Wizard. In the first step of the Partition Wizard, you are asked to select the source measure group and source table on which to define the partition. Next you enter the query to restrict the source rows. Note Verify each query you use to define a partition. Although the wizard validates syntax, it does not check for duplicate data. If your query does not cleanly split the source data, you run the risk of loading your cube with duplicate data from the source fact table.
266
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
In the next step, you select the processing and physical location for the partition. The available options are shown in Figure 9-6. In the Processing Location section, you can choose the Current Server Instance option (the default) or the Remote Analysis Services Data Source option. The Storage Location section enables you to choose where the partition will be stored. You can select either the Default Server Location option (and specify a default location in the accompanying text box) or the Specified Folder option (and, again, specify the location in the accompanying text box). Any nondefault folder has to be set up as an approved storage folder on the server ahead of time.
Figure 9-6 When defining OLAP partitions, you can change the physical storage location of the new
partition.
After you’ve defined the partitioning query to restrict the rows and you’ve picked the processing and storage locations, in this same wizard you can then design aggregation schemes for the new partition. As mentioned, we’ll return to the subject of custom aggregation design later in this chapter. You can also optionally process and deploy the new partition at the completion of the wizard. The next step in the custom processing configuration options can be one of two. You can begin to work with the three possible storage modes for data and metadata: MOLAP, HOLAP, or ROLAP. Or you can design aggregations. We’ll select the former as the next phase of our tour and move on to an explanation of physical storage mode options.
Chapter 9
Processing Cubes and Dimensions
267
Choosing Storage Modes: MOLAP, HOLAP, and ROLAP A next step in cube processing customization is to define the storage method for a particular partition. You’ll recall that, by default, each fact table creates exactly one partition with a storage type of MOLAP. There are three possible types of storage for a particular partition: MOLAP, HOLAP, and ROLAP.
MOLAP In multidimensional OLAP (MOLAP), a copy of the fact table rows (or facts) is stored in a format native to SSAS. MOLAP is not a one-for-one storage option. Because of the efficient storage mechanisms used for cubes, storage requirements are approximately 10 to 20 percent of the size of the original data. For example, if you have 1 GB in your fact table, plan for around 200 MB of storage on SSAS. Regardless of the high level of efficiency when using MOLAP, be aware that you are choosing to make a copy of all source data. In addition, any and all aggregations that you design are stored in the native SSAS format. The more aggregations that are designed, the greater the processing time for the partition and the more physical storage space is needed. We occasionally use storage options other than MOLAP specifically to reduce partition processing times. MOLAP is the default storage option because it results in the fastest query performance. The reason for this is that the SSAS query engine is optimized to read data from a multidimensional source store. We find that it’s typical to use MOLAP for the majority of the partitions in a cube. We usually add some aggregations to each partition. We’ll discuss adding aggregations in the next topic of this chapter.
HOLAP Hybrid OLAP (HOLAP) does not make a copy of the source fact table rows in SSAS. It reads this information from the star schema source. Any aggregations that you add are stored in the native SSAS format on the storage location defined for the SSAS instance. This storage mode results in a reduction of storage space needed. This option is often used for partitions that contain infrequently queried historical data. Because aggregations are stored in the native format and result in fast query response time, it’s typical to design a slightly larger number of aggregations in this scenario.
ROLAP Relational OLAP (ROLAP) does not make a copy of the facts on SSAS. It reads this information from the star schema source. Any aggregations that are designed are written back to tables on the same star schema source system. Query performance is significantly slower than that of partitions using MOLAP or HOLAP; however, particular business scenarios can be well
268
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
served by using ROLAP partitions. We’ll discuss these situations in more detail later in this chapter. These include the following: ■■
Huge amounts of source data, such as cubes that are many TBs in size
■■
Need for near real-time data—for example, latency in seconds
■■
Need for near 100 percent cube availability—for example, downtime because of processing limited to minutes or seconds
Figure 9-7 shows a conceptual view of ROLAP aggregations. Note that additional tables are created in the OLTP source RDBMS. Note also that the column names reflect the positions in the hierarchy that are being aggregated and the type of aggregation performed. Also, it is interesting to consider that these aggregation tables can be queried in Transact-SQL, rather than MDX, because they are stored in a relational format.
Figure 9-7 Conceptual view of ROLAP aggregations, shown in aggTable1
OLTP Table Partitioning If your star schema source data is stored in the Enterprise edition of SQL Server 2005 or later, you have the ability to do relational table partitioning in your star schema source tables. Note RDBMS systems other than SQL Server also support relational table portioning. If your source system or systems support this type of partitioning, you should consider implementing this feature as part of your BI project for easier maintainability and management.
Chapter 9
Processing Cubes and Dimensions
269
An OLTP table partitioning strategy can complement any partitioning you chose using SSAS (that is, cube partitions), or you can choose to partition only on the relational side. You’ll need to decide which type (or types) of partitioning is appropriate for your BI solution. Table partitioning is the ability to position data from the same table on different physical locations (disks) while having that data appear to continue to originate from the same logical table from the end user’s perspective. This simplifies management of very large databases (VLDBs)—in particular, management of very large tables. The large tables we’re concerned about here are, of course, fact tables. It’s not uncommon for fact tables to contain millions or tens of millions of rows. In fact, support for especially huge (over four billion rows) fact tables is one of the reasons that the data type BIGINT was introduced in SQL Server 2005. Relational table partitioning can simplify administrative tasks and general management of these, often large or even huge, data sources. For example, backups can be performed much more efficiently on table partitions than on entire (huge) fact tables. Although relational table partitioning is relatively simple, several steps are involved in implementing it. Here’s a conceptual overview of the technique: 1. Identify the tables that are the best candidates for partitioning. For OLAP projects, as mentioned, this will generally be the fact tables. 2. Identify the value (or column) to be used for partitioning. This is usually a date field. A constraint must be implemented on this column of the tables that will participate in partitioning. 3. Implement the physical architecture needed to support partitioning—that is, install the physical disks. 4. Create file groups in the database for each of the new physical disks or arrays. 5. Create .ndf files (or secondary database files) for the SQL Server 2005 (or later) database where the tables to be partitioned are contained, and associate these .ndf files with the file groups you created in step 4. 6. Create a partition function. Doing this creates the buckets to distribute the sections of the table into. The sections are most often created by date range—that is, from xxx to yyy date, usually monthly or annually. 7. Create a partition scheme. Doing this associates the buckets you created previously with a list of file groups, one file group for each time period, such as month or year. 8. Create the table (usually the fact table) on the partition scheme that you created earlier. Doing this splits the table into the buckets you’ve created. Note If your source data is stored in the Enterprise edition of SQL Server 2008 and on a server with multiple CPUs, you can also take advantage of enhancements in parallel processing of fact table partitions that are the result of changes in the query optimizer.
270
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Other OLAP Partition Configurations One other consideration in the world of partitions returns us to SSAS cube partitions. Here it is possible to define cube partitions as local (the default) or remote. You define this location when you create the partition using the Partition Wizard shown previously in Figure 9-6, or you can manually configure the partition properties after you create it using its Properties dialog box. The primary reason to consider using remote partitions is to do a kind of load balancing in the SSAS environment. You use remote partitions to implement load balancing in situations where your primary SSAS server is stressed (usually) because a large number of users are executing complex queries. By using remote partitions, you can split the processing work across multiple physical servers. There are also other things you must consider when using remote partitions. Remote partitions can use MOLAP, HOLAP, or ROLAP storage. Remote partitions store some information on both the local server and the remote servers. If you’re using remote MOLAP, data and aggregations for the remote partition are stored on the remote server. If you’re using remote HOLAP, aggregations for the remote partition are stored on the remote server while data is read from the OLTP source. If you’re using remote ROLAP, nothing is stored on the remote server; both data and aggregations are read from the OLAP source. Before we review the details of how and why to change a partition from the default storage mode of MOLAP to HOLAP or ROLAP, let’s return to the topic of aggregations. A key driver of storage mode change is processing time. Processing time is affected by both the quantity of source fact table rows and the quantity of defined aggregations. We find there is need to balance query response performance with processing time. Most clients prefer the fastest query response time, even if that means that processing time results in a bit of latency—in other words, MOLAP with some added aggregations. Now that you understand storage modes, we’ll return to aggregations. In addition to determining storage mode for source data, storage modes determine storage location and type for any aggregations associated with a partition.
implementing Aggregations Why is it so important to add the correct amount of aggregations to your cube’s partitions? As stated previously, it might not be. Keep in mind that some SSAS cubes do not require aggregations to enable them to function acceptably. Similar to the the idea that small OLTP databases need no relational indexes, if your cube is quite small (under 5 GB) and you have a small number of end users (100 or less), you might not have to add any aggregations at all. Also, adding aggregations to a particular partition is done for the same reason that you add indexes to an RDBMS—to speed up query response times. The process to add these aggregations is only marginally similar to the process of adding indexes to an RDBMS. BIDS
Chapter 9
Processing Cubes and Dimensions
271
includes tools, wizards, and a new Aggregations tab in the cube designer to give you power and control over the aggregation design process. However, we find that many customers who have relational database backgrounds are quite confused by the aggregation design process. Here are some key points to remember: ■■
The core reason to add aggregations to a partition is to improve query response time. Although you could tune the MDX source query, you’re more likely to add aggregations. The reason is that the cost of aggregations is relatively small because they’re usually quick and easy to add, particularly if you’re using MOLAP storage.
■■
Do not overaggregate. For MOLAP, 20 to 30 percent aggregation is usually sufficient. Heavy—for example, over 50 percent—aggregation can result in unacceptably long partition processing times. Remember that the SSAS query engine makes use of intermediate aggregations to get query results. For example, for facts loaded at the day level, aggregated at the month level, and queried at the year level, month-level aggregations are used to answer the query request.
■■
Use the aggregation tools and wizards prior to manually adding aggregations. If source MDX queries are not improved by adding aggregations recommended by the tools, consider rewriting MDX queries prior to adding aggregations manually.
■■
Consider the following facts when adding aggregations to a cube: aggregations increase cube processing times, and aggregations increase the storage space required for the cube on disk.
■■
The storage type affects the amount of aggregations you’ll add. You’ll need to add the smallest percentage of disk space for MOLAP storage because the source data is available to the SSAS query engine in the native multidimensional format. HOLAP storage usually requires the largest percentage of aggregations. This is done to preclude the need for the SSAS query engine to retrieve data from the RDBMS source system.
Note The new AMO design warnings tool generates warnings when you attempt to build an OLAP cube that includes nonstandard aggregation designs. In the next few sections, we look closely at the following wizards and tools that help you design appropriate aggregations: Aggregation Design Wizard, Usage-Based Optimization Wizard, SQL Server Profiler, and the Advanced view of the aggregations designer.
Aggregation Design Wizard The Aggregation Design Wizard is available in BIDS (and SQL Server Management Studio). You access the wizard by clicking on the measure group that contains the partition (or partitions) you want to work with and then clicking the Design Aggregations button on the toolbar on the Aggregations tab in BIDS. This is shown in Figure 9-8. Doing this opens the
272
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Aggregation Design Wizard, which asks you to select one or more partitions from those defined in the measure group that you originally selected.
Design Aggregations Figure 9-8 The new Aggregations tab in BIDS allows you to define aggregation schemes for your OLAP cube
partitions.
In the next step of the wizard, you review the default assigned settings for aggregations for each dimension’s attributes. You have four options: Default, Full, None, and Unrestricted. We recommend leaving this setting at Default unless you have a specific business reason to change it. An example of a business reason is a dimensional attribute that is browsed rarely by a small number of users. In this case, you might choose the None option for a particular attribute. In particular, we caution you to avoid the Full setting. In fact, if you select it, a warning dialog box is displayed that cautions you about the potential overhead incurred by using this setting. If you choose Default, a default rule is applied when you further configure the aggregation designer. This step, which you complete in the Review Aggregation Usage page of the Aggregation Design Wizard, is shown in Figure 9-9 for the Internet Sales partition from the measure group with the same name.
Figure 9-9 The Aggregation Design Wizard allows you configure the level of desired aggregation for individual dimensional attributes.
Chapter 9
Processing Cubes and Dimensions
273
In the next step of this wizard, you have to either enter the number of rows in the partition or click the Count button to have SSAS count the rows. You need to do this because the default aggregation rule uses the number of rows as one of the variables in calculating the suggested aggregations. You are provided the option of manually entering the number of rows so that you can avoid generating the count query and thereby reduce the overhead on the source database server. After you complete this step and click Next, the wizard presents you with options for designing the aggregations for the selected partition. You can choose one of the following four options: ■■
Estimated Storage Reaches With this option, you fill in a number, in MB or GB, and SSAS designs aggregations that require up to that size limit for storage on disk.
■■
Performance Gain Reaches With this option, you fill in a percentage increase in query performance speed and SSAS designs aggregations until that threshold is met.
■■
I Click Stop When you choose this option, the wizard stops adding aggregations when you click stop.
■■
Do Not Design Aggregations any aggregations.
If you select this option, the wizard does not design
After you select an option and click Start, the wizard presents you with an active chart that shows you the number of aggregations and the storage space needed for those aggregations. If you select the Performance Gain Reaches option on the Set Aggregation Options page, a good starting value for you to choose is 20 percent. As discussed earlier, you should refrain from overaggregating the SSAS cube. Overaggregating is defined as having more than 50 percent aggregations. You can see the results of a sample execution in Figure 9-10. For the sample, we selected the first option, Estimated Storage Reaches. Note that the results tell you the number of aggregations and the storage space needed for these aggregations. You can click Stop to halt the aggregation design process. After the wizard completes its recommendations and you click Next, you can choose whether to process the partition using those recommendations immediately or save the results for later processing. This is an important choice because the aggregations that you’ve designed won’t be created until the cube is processed. You can also apply the newly created aggregation design to other partitions in the cube. You do this by clicking the Assign Aggregation Design button (the third button from the left) on the toolbar on the Aggregations tab in BIDS.
274
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 9-10 The Aggregation Design Wizard allows you to choose from four options to design aggregations.
Usage-Based Optimization Wizard The Usage-Based Optimization Wizard works by saving actual queries sent to the Analysis Services database. The saved queries are based on parameter values that you specify, such as start and end time, user name, and so on. The wizard then uses an algorithm to figure out which aggregations will best improve the performance of the queries that are run and that fall within the configured parameters. Because query performance is determined as much by the selected (and filtered) queries coming from your client applications as it is by the data, using the Usage-Based Optimization Wizard effectively is an intelligent approach. By using this tool, you are causing SSAS to create aggregations specifically for the particular queries, rather than just using the blanket approach that the Aggregation Design Wizard uses. There are three SSAS properties you must configure prior to running the wizard. The first is called the QueryLogConnectionString. You set this value to the database connection string where you’d like to store the query log table. The data stored in this table will be retrieved by the Usage-Based Optimization Wizard. (This process is similar to the usage of a trace table by the database tuning advisor for OLTP relational index optimization.) To set this property, open SQL Server Management Studio (SSMS), right-click the SSAS instance, and then select Properties. On the general page, locate the property in the list labeled Log\QueryLog\ QueryLogConnectionString. Click the build (...) button in the value column for this property, and specify a connection string to a database. If you use Windows Authentication for this connection, be aware that the connection will be made under the credentials of the SSAS service account.
Chapter 9
Processing Cubes and Dimensions
275
The second property is the CreateQueryLogTable property. Set this to True to have SSAS create a table that logs queries. This table will be used to provide queries to the wizard. This process is similar to using a trace table to provide queries to SQL Server Profiler for relational database query tuning. You can optionally change the default name of the query log table for the database you previously defined. This value is set to OlapQueryLog by default and can be changed by setting the QueryLogTableName property. The third property to set is QueryLogSampling. The default is to only capture 1 out of 10 queries. You’ll probably want to set this to 1 for your sampling purposes so that every query within the defined parameter set is captured. However, just like using SQL Server Profiler, capturing these queries incurs some overhead on the server, so be cautious about sampling every query on your production servers. You can configure all of these properties by using the properties window for SSAS inside of SSMS. You can run the wizard by connecting to SSAS in SSMS, right-clicking on a cube partition in Object Explorer, and then selecting Usage Based Optimization. It can also be run from the Aggregations tab in the cube designer in BIDS. After you start the wizard, select which partitions you want to work with, and then ask SSAS to design aggregations based on any combination of the following parameter values: beginning date, ending date, specific user or users, and quantity of frequent queries by percentage of total. After you’ve configured the previous settings, the Usage-Based Optimization Wizard presents you with a sample list of queries to select from. You then select which of these queries you’d like SSAS to design aggregations for. As you complete the configuration and run the wizard, it will produce a list of suggested aggregations. These can immediately be implemented or be saved as a script for later execution.
SQL Server Profiler Another interesting tool you can use to help you design aggregations more intelligently is SQL Server Profiler. In Chapter 4, “Physical Architecture in Business Intelligence Solutions,” we looked at SQL Server Profiler’s ability to capture activity on the SSAS server. There we saw that you could configure the capture, which is called a trace, to capture only specific types of information. We saw that you can filter for MDX queries and other such items. In the context of aggregation design, SQL Server Profiler has a couple of uses. The first is to help you go beyond the results of the Usage-Based Optimization Wizard by creating a trace that captures MDX queries and then filter the results to find the problematic (that is, long-running) queries. To improve the query response times for such queries, you can either attempt to rewrite the MDX statement (or expression) or design specific aggregations. In most situations, adding aggregations produces improved query results with less effort than rewriting the MDX query.
276
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
If you select Show All Events on the Events Selection tab of the Trace Properties dialog box, you’ll see that you have a number of advanced capture options in the Query Processing area. These are shown in Figure 9-11. Note that included in these options is Get Data From Aggregation. Selecting this option is a very granular way for you to verify that particular MDX queries are using particular aggregations.
Figure 9-11 Selecting the Show All Events view displays aggregation-specific events.
Note As with RDBMS query tuning, OLAP query tuning is a time-consuming process. We’ve presented the information in this chapter in the order in which you should use it—that is, use the wizards and tools prior to undertaking manual query tuning. Also, we caution that query tuning can only enhance performance—it cannot compensate for poor (or nonexistent) star schema source design. In most production situations, we don’t need to use any of the advanced procedures we discuss here because performance is acceptable without any of these additional processes. In SQL Server 2008, several enhancements to speed up query execution have been introduced. Some of these enhancements are completely automatic if your queries use the feature—such as more efficient subspace calculations. In other words, SSAS 2008 divides the space to separate calculated members, regular members, and empty space. Then it can better evaluate cells that need to be included in calculations. Other types of internal optimizations require meeting a certain set of prerequisites.
Chapter 9
Processing Cubes and Dimensions
277
After you’ve identified problematic queries and possible aggregations to be added to your OLAP cube’s partitions, how do you do create these specific aggregations? This is best done via a new capability on the BIDS cube designer Aggregations tab.
Aggregation Designer: Advanced View To access the capability to manually create either aggregation designs (collections of aggregations) or individual aggregations, you switch to advanced view in the aggregations designer in BIDS by clicking on the Advanced View button (the fifth button from the left) on the toolbar. After selecting a measure group and either creating a new aggregation design or selecting an existing one, you can create aggregations one by one by clicking on the New Aggregation button on the same toolbar. Figure 9-12 shows the toolbar of this designer in the advanced view. New Aggregation
Figure 9-12 The advanced view of the BIDS cube aggregations designer allows you to create individual
aggregations.
In this view, you’ll create aggregation designs. These are groups of individual aggregations. You can also add, copy, or delete individual aggregations. After you’ve grouped your newly created aggregations into aggregation designs, you can then assign these named aggregation designs to one or more cube partitions. The advanced view provides you with a very granular view of the aggregations that you have designed for particular attributes. We will use these advanced options as part of query tuning for expensive and frequently accessed queries. In addition to designing aggregations at the attribute level, you also have the option of configuring four advanced properties for each attribute. The advanced designer and the Properties sheet for the selected attribute are where you would make these configurations.
278
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Although this advanced aggregation designer is really powerful and flexible, we again remind you that, based on our experience, only advanced developers will use it to improve the performance of a few queries by adding aggregations to affected dimensional attributes.
implementing Advanced Storage with MOLAP, HOLAP, or rOLAP When you want to modify the type of storage for a particular cube partition, simply click on the Partition tab for that partition in the BIDS cube designer. Alternatively, you can create new partitions for an existing cube by using this tab as well. Then click the Storage Settings link below the selected partition. This opens a dialog box that allows you to adjust the storage setting either by sliding the slider bar or setting custom storage options, which you access by clicking the Options button. Figure 9-13 shows the default setting, MOLAP. As we’ve discussed, for many partitions the default of MOLAP results in the best query performance with acceptable cube availability (because of appropriate partition processing times). You should change this option only if you have a particular business reason (as discussed earlier in this chapter) for doing so.
Figure 9-13 You configure storage options for each partition in BIDS by using the Measure Group Storage
Settings dialog box.
Chapter 9
Processing Cubes and Dimensions
279
Although the slider provides a brief explanation of the other settings, you probably need a more complete explanation to effectively select something other than the default. Note that the proactive caching feature is enabled for all storage modes other than the default (simple MOLAP). We’ll cover proactive caching in the next section of this chapter. Here’s an explanation of the impact of each setting in the Measure Group Storage Settings dialog box: ■■
MOLAP (default) Source data (fact table rows) is copied from the star schema to the SSAS instance as MOLAP data. Source metadata (which includes cube and dimension structure and dimension data) and aggregations are copied (for dimension data) or generated (for all other metadata and aggregations). The results are stored in MOLAP format on SSAS, and proactive caching is not used.
■■
MOLAP (nondefault) Source data is copied. Metadata and aggregations are stored in MOLAP format on SSAS. Proactive caching is enabled. This includes scheduled, automatic, and medium- and low-latency MOLAP.
■■
HOLAP Source data is not copied, metadata and aggregations are stored in MOLAP format on SSAS, and proactive caching is enabled.
■■
ROLAP This option is for cubes. Source data is not copied. Metadata is stored in MOLAP format on SSAS. Aggregations are stored in the star schema database. For dimensions, metadata is not copied; it is simply read from the star schema database table or tables. Proactive caching is enabled.
Because proactive caching is invoked by default (although you can turn it off) for all changes (from the default to some other setting) to partition storage settings, we’ll take a closer look at exactly what proactive caching is and how it can work in your BI project.
Proactive Caching Wouldn’t it be terrific if your BI solution allowed end users to access data with all the query response speed and flexibility of SSAS, yet also allowed them to use a solution that didn’t require the typical latency (often one business day) between the OLTP source data and OLAP data? That’s the concept behind proactive caching. Think of configuring proactive caching as the method by which you manage the MOLAP cache. What is the MOLAP cache? It’s an in-memory storage location created automatically by SSAS. The cache includes actual data and, sometimes, aggregations. This information is placed in the cache area after MDX queries are executed against the SSAS cubes. Figure 9-14 shows a conceptual rendering of the MOLAP cache.
280
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
How often should the MOLAP cache be updated? MOLAP Cache End-User Client Applications
Star Schemas Based on OLTP Sources
Analysis Server Cube
Figure 9-14 Proactive caching settings allow you to manage the update/rebuild frequency of the MOLAP
cache.
Note One of the reasons queries to the SSAS store are so much faster than queries to RDBMS query engines is that the former uses a MOLAP structure and cache. When you are considering configuring manual settings to manage cache refreshes, you can use the MSAS 2008:Cache object in the Microsoft Windows Performance Monitor to measure which queries are being answered by cache hits and which are being answered by disk calls. Because every businessperson will tell you that it’s preferable to have minimal data latency, why wouldn’t you use proactive caching in every BI solution? Proactive caching occurs in near real time, but not exactly real time. And, importantly, the nearer you configure the MOLAP cache refreshes to real time, the more overhead you add to the both the SSAS and the OLTP source systems. These considerations are why SSAS has six options for you to choose from when configuring proactive caching using the Measure Group Storage Settings dialog box (using the slider). In addition to the slider configuration tool, you have the possibility of still more finely grained control using the Custom Options dialog box accessed by clicking the Options button in the Measure Group Storage Settings dialog box. Or you can gain even more control by manually configuring the individual partition property values by using the Properties dialog box. Of course, if you want to configure different storage and caching settings for a subset of a cube, you must first define multiple partitions for fact tables upon which the cube is based. Note Proactive caching is not for every BI solution. Using it effectively necessitates that you either read your OLTP data directly as the source for your cube or read a replicated copy of your data. Another option is to read your OLTP data using the new snapshot isolation level available in SQL Server 2005 or later. To use any of these options, your data source must be very clean. If you need to do cleansing, validation, or consolidation during extract, transform, and load (ETL) processing, proactive caching is not the best choice for your solution.
Chapter 9
Processing Cubes and Dimensions
281
Let’s start with a more complete explanation of the choices available in the Measure Group Storage Settings dialog box (shown in Figure 9-13) as they relate to proactive caching settings. The first choice you’ll make is whether to use MOLAP, HOLAP, or ROLAP data storage for your cube. In most cases, because of the superior query performance, you’ll select some version of MOLAP. The proactive caching configuration choices for MOLAP are as follows: ■■
Scheduled MOLAP When you select this setting, the MOLAP cache is updated according to a schedule (whether the source data changes or not). The default is once daily. This sets the rebuild interval to one day. This default setting is the one that we use for the majority of our projects.
■■
Automatic MOLAP When you select this setting, the cache is updated whenever the source data changes. It configures the silence interval to 10 seconds and sets a 10-minute silence override interval. We’ll say more about these advanced properties shortly.
■■
Medium-Latency MOLAP With this setting, the outdated caches are dropped periodically. (The default is a latency period of four hours.). The cache is updated when data changes. (The defaults are a silence interval of 10 seconds and a 10-minute silence override interval.)
■■
Low-Latency MOLAP With this setting, outdated caches are dropped periodically. (The default is a latency period of 30 minutes.) The cache is updated when data changes. (The defaults are a silence interval of 10 seconds and a 10-minute silence override interval.)
Tip To understand the silence interval property, ask the following question: “How long should the cache wait to refresh itself if there are no changes to the source data?” To understand the silence override interval property, ask the following question: “What is the maximum amount of time after a notification (of source data being updated) is received that the cache should wait to start rebuilding itself?”
If you select HOLAP or ROLAP, proactive caching settings are as follows: ■■
Real-Time HOLAP If you choose this setting, outdated caches are dropped immediately—that is, the latency period is configured as 0 (zero) seconds. The cache is updated when data changes. (The defaults are a silence interval of 0 (zero) seconds and no silence override interval.)
■■
Real-Time ROLAP With this setting, the cube is always in ROLAP mode, and all updates to the source data are immediately reflected in the query results. The latency period is set to 0 (zero) seconds.
As mentioned, if you’d like even finer-grained control over the proactive caching settings for a partition, click the Options button in the Measure Group Storage Settings dialog box. You
282
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
then can manually adjust the cache settings, options, and notification values. Figure 9-15 shows these options.
Figure 9-15 The Storage Options dialog box for proactive caching
Notification Settings for Proactive Caching You can adjust the notification settings (regarding data changes in the base OLTP store) by using the Notifications tab of the Storage Options dialog box. There are three types of notifications available in this dialog box: ■■
SQL Server If you use this option with a named query (or the partition uses a query to get a slice), you need to specify tracking tables in the relational source database. If you go directly to a source table, you use trace events. This last option also requires that the service account for SSAS has dbo permissions on the SQL database that contains the tracking table.
■■
Client Initiated Just as you do for SQL Server, you need to specify tracking tables in the relational source database. This option is used when notification of changes will be sent from a client application to SSAS.
■■
Scheduled Polling If you use this option, you need to specify the polling interval time value and whether you want to enable incremental updates and add at least one polling query. Each polling query is also associated with a particular tracking table. Polling queries allow more control over the cache update process.
Chapter 9
Processing Cubes and Dimensions
283
Fine-Tuning Proactive Caching Finally, here’s the most specific way to set proactive caching settings—use the Properties dialog box for a particular measure group, as shown in Figure 9-16.
Figure 9-16 Setting proactive caching properties through the Properties dialog box for a measure group
Here’s a summary of the settings available for proactive caching: ■■
AggregationStorage tions only).
■■
Enabled
■■
ForceRebuildInterval This setting is a time value. It indicates the maximum amount of time to rebuild the cache whether the source data has changed or not. The default is –1, which equals infinity.
■■
Latency This setting is a time value. It indicates the maximum amount of time to wait to rebuild the cube. The default is –1, which equals infinity.
■■
OnlineMode You can choose either Immediate or OnCacheComplete. This setting indicates whether a new cache will be available immediately or only after it has been completely rebuilt.
■■
SilenceInterval This setting is a time value. It indicates the maximum amount of time for which the source data has no transactions before the cache is rebuilt. The default is –1, which equals infinity.
■■
SilenceOverrideInterval This setting is a time value. It indicates the maximum amount of time to wait after a data change notification in the source data to rebuild the cache and override the SilenceInterval value. The default is –1, which equals infinity.
You can choose either Regular or MOLAP Only (applies to parti-
You can choose either True or False (turns proactive caching on or off).
Proactive caching is a powerful new capability that you might find invaluable in enhancing the usability of your BI solution. As we mentioned earlier, the key consideration when deciding whether to use proactive caching is the quality of your source data. It must be pristine for this feature to be practical. In the real world, we’ve yet to find a client who has met this important precondition.
284
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
We turn next to another approach to cube storage and processing. This is called a ROLAP dimension.
ROLAP Dimensions Recall that ROLAP partition-mode storage means that source data (fact table rows) is not copied to the SSAS destination. Another characteristic of ROLAP partition storage is that aggregations are written back to relational tables in the source schema. The primary reason to use ROLAP partition storage is to avoid consuming lots of disk space to store seldomqueried historical data. Queries to ROLAP partitions execute significantly more slowly because the data is in relational, rather than multidimensional, format. With these considerations noted, we share that ROLAP dimensions are used in a couple of situations: when using rapidly (nearly constantly) changing dimensional metadata, and when using huge dimensions. Huge dimensions are those that contain millions or even billions of members. An example is dimensions used by FedEx that track each customer worldwide for an indefinite time period. The storage limits for SQL Server tables (that is, the maximum number of rows) are still larger than those in SSAS dimensions. An example of rapidly changing dimension data is a dimension that contains employee information for a fast food restaurant chain. The restaurant chain might have a high employee turnover rate, as is typical in the fast food industry. However, it might be a business requirement to be able to retrieve the most current employee name from the Employee dimension at all times, and that there can be no latency. This type of requirement might lead you to choose a ROLAP dimension. Tip Despite the fact that you might have business situations that warrant using ROLAP dimensions, we encourage you to test to make sure that your infrastructure (that is, hardware and software) will provide adequate performance given the anticipated load. Although the performance in SSAS 2008 has been improved from previous versions, some of our customers still find that it’s too slow when using ROLAP dimensions for production cubes. If you’re considering this option, be sure to test with a production level of data before you deploy this configuration into a production environment.
Like so many advanced storage features, ROLAP dimensions require the Enterprise edition of SSAS. Because you typically use this feature only for dimensions with millions of members, the dimensional attribute values will not be copied to and stored on SSAS. Rather, they will be retrieved directly from the relational source table or tables. To set a dimension as a ROLAP dimension, open the Dimension editor in BIDS, and in the Properties window for that dimension change the StorageMode property from the default MOLAP to ROLAP. As mentioned in the introduction to this section, although ROLAP dimensions increase the flexibility of your cube, we’ve not seem them used frequently in production BI solutions. The
Chapter 9
Processing Cubes and Dimensions
285
reason is that any queries to the relational source will always be significantly slower than queries to MOLAP data or metadata.
Linking A couple of other configuration options and features you might consider as you’re preparing your cubes for processing are linked objects and writeback. We’ll also review error-handling settings (in the upcoming “Cube and Dimension Processing Options” section) because they are important to configure according to your business requirements and their configuration values affect cube processing times. Let’s start with linked objects. Linked objects are SSAS objects—for example, measure groups or dimensions from a different SSAS database (Analysis Services 2008 or 2005)—that you want to associate with the SSAS database you are currently working on. Linked objects can also include KPIs, actions, and calculations. The linked objects option can be used to overcome the SSAS 2008 limit of basing a cube on a single data source view. It also allows you a kind of scalability because you can use multiple servers to supply data for queries. The ability to use linked objects in SSAS is disabled by default. If you want to use this option, you need to enable the property by connecting to SSAS in SSMS, right-clicking the SSAS server instance, selecting Properties, and then enabling linking. The properties you need to enable are Feature\LinkToOtherInstanceEnabled and Feature\LinkFromOtherInstanceEnabled. After you’ve done that, you can use the Linked Object Wizard in BIDS, which you can access by right-clicking on the Dimensions folder in BIDS Solution Explorer and then by clicking on New Linked Dimension. You’ll next have to select the data source from which you want to link objects. Then you’ll access the Select Objects page of the Linked Object Wizard, where you’ll select which object from the linked database you want to include in the current cube. If objects have duplicate names—that is, dimensions in the original SSAS database and the linked instance have the same name—the linked object names will be altered to make them unique (by adding an ordinal number, starting with 1, to the linked dimension name). As with using many other advanced configuration options, you should have a solid business reason to use linking because it adds complexity to your BI solution. Also, you should test performance during the pilot phase with production levels of data to ensure that query response times are within targeted ranges.
Writeback Writeback is the ability to store “what if” changes to dimension or measure data in a change table (for measures) or in an original source table (for dimensions). With writeback, the delta (or change value) is stored, so if the value changes from an original value of 150 to a new value of 200, the value 50 is stored in the writeback table. If you are using the Enterprise
286
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
edition of SSAS 2008, you can enable writeback for a dimension or for a partition if certain conditions are met. To enable writeback for a dimension, you set the WriteEnabled property value of that dimension to True. You can also use the Add Business Intelligence Wizard to enable writeback for a dimension. You cannot enable writeback for a subset of a dimension—that is, for individual attributes. Writeback is an all-or-nothing option for a particular dimension. An important consideration with writeback dimensions is for you to verify that your selected client applications support writeback. Also, you must confirm that allowing writeback is consistent with your project’s business requirements. Another consideration is that you must specifically grant read/write permissions in their defined SSAS security roles to write-enabled dimensions for end users who you want to have the ability to write to a dimension. In our experience, writeback is not commonly enabled for BI projects. One business case where it might be worth using it is when a cube is used for financial forecasting, particularly in “what if” scenarios. Note Writeback is not supported for dimensions of the following types: Referenced (snowflake), Fact (degenerate), Many-to-Many, or Linked. The dimension table must be a single table, directly related to the fact. As mentioned, you can write-enable only an entire dimension; there is no mechanism to write-enable only specific attributes of a dimension.
You can also enable writeback for measure group partitions that contain only aggregates that use the SUM aggregate value. An example of this from the Adventure Works sample cube is the Sales Targets measure group, which includes one named partition called Sales_Quotas. To enable writeback on a partition, navigate to the partition on the BIDS cube designer Partitions tab, right-click on the partition, and then choose the Writeback Settings menu option. Doing this opens the Enable Writeback dialog box, as shown in Figure 9-17. There you configure the writeback table name, the data source, and the storage mode. Note that the option to store writeback information in the more efficient and fast-to-query MOLAP mode is new to SQL Server 2008. This results in significant performance improvement over the previous version of SSAS, which allowed only ROLAP storage of measure group writebackenabled partitions. As mentioned, if you intend to enable writeback for measure group partitions, you must enable read/write access for the entire cube rather than for a particular measure group partition in the SSAS security role interface. We recommend you verify that this condition is in line with your project’s overall security requirements before you enable measure group writeback.
Chapter 9
Processing Cubes and Dimensions
287
Figure 9-17 The Enable Writeback dialog box allows you to configure writeback options for qualifying
measure group partitions.
Cube and Dimension Processing Options Now that we’ve covered storage, aggregations, partitions, and caching, we’re (finally!) ready to review cube and dimension processing option types. Dimensions must be completely and correctly processed either prior to or at the beginning of a cube process. The best way to understand this is to keep in mind that dimensional data is the metadata or the structure of the cube itself. So the metadata must be available before the data can be loaded into the cube. During development, you will most often do a full process on your cube whenever you need to view the results of a change that you’ve made. This option completely erases and rebuilds all data and metadata. For some customers, this simple method of updating the cube can be used in production as well. What is happening here, of course, is a complete overwrite on rebuild. This is practical only for the smallest cubes—those that are a couple of GBs in size at a maximum. Most of the time, you’ll choose to use the more granular processing options after you move your cube to a production environment, which will result in shorter processing times and more consistent cube availability. The first aspect of processing that you’ll have to determine for production cubes is whether you’ll choose to separate processing of dimensions from the cube. Our real-world experience is that about 50 percent of the time we advise clients to process dimensions before processing cubes. The choice to separate the process is usually due to dimension processing complexity or size.
288
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
In this section, we’ll review the process for processing using BIDS. In future chapters, we’ll discuss automating these cube and dimension refreshes using SSIS packages. We’ll start with cube processing options. To process a cube in BIDS, right-click on the cube name in Solution Explorer in BIDS and then select Process. You’ll see the dialog box shown in Figure 9-18.
Figure 9-18 To process a cube, right-click on the cube name in BIDS and then select the process type and
other options in the Process Cube dialog box.
You can also process cubes and dimensions from SSMS by using this same process. Following is a more complete explanation of the available selections for process options for both cubes and dimensions. Some options are available only for cubes or only for dimensions. We note that in the following list: ■■
Default With this option, SSAS detects the current state of the cube or dimension, and then does whatever type of processing (that is, full or incremental) that is needed to return the cube or dimension to a completely processed state.
■■
Full With this option, SSAS completely reprocesses the cube or dimension. In the case of a cube, this reprocessing includes all the objects contained within it—for example, dimensions. Full processing is required when a structural change has been made to a cube or dimension. An example of when Full processing is required for a dimension is when an attribute hierarchy is added, deleted, or renamed. The cube is not available for browsing during a Full processing.
■■
Data If you select this option, SSAS processes data only and does not build any aggregations or indexes. SSAS indexes are not the same thing as relational indexes. They are generated and used by SSAS internally during the aggregation process.
■■
Unprocess If you select this option, SSAS drops the data in the cube or dimension. If there are any lower-level dependent objects—for example, dimensions in a cube— those objects are dropped as well. This option is often used during the development phase of a BI project to quickly clear out erroneous results.
Chapter 9
Processing Cubes and Dimensions
289
■■
Index With this option, SSAS creates or rebuilds indexes for all processed partitions. This option results in no operation on unprocessed objects.
■■
Structure (cubes only) With this option, SSAS processes the cubes and any contained dimensions, but it does not process any mining models.
■■
Incremental (cubes only) With this option, SSAS adds newly available fact data and processes only the affected partitions. This is the most common option used in day-today production.
■■
Update (dimensions only) If you select this option, SSAS forces an update of dimension attribute values. Any new dimension members are added, and attributes of existing members are updated.
Note Aggregation processing behavior in dimensions depends on the AttributeRelationship RelationshipType property. If this property is set to the default value (Flexible), aggregations are dropped and re-created on an incremental process of the cube or update of the dimension. If it is set to the optional (or nondefault) value (Rigid), aggregations are retained for cube/dimension incremental updates. Also, if you set the dimension ProcessingMode for a dimension to LazyAggregations, flexible aggregations are reprocessed as a background task and end users can browse the cube while this processing is occurring. An optimization step you can take to reduce processing times for your dimensions is to turn off the AttributeHierarchyOptimizedState property for dimensional attributes that are only viewed infrequently by end users. Tip To identify infrequently queried attributes, you can either use a capture of queries from SQL Server Profiler or read the content of the LogTable after running the Query Optimization Wizard.
To adjust the AttributeHierarchyOptimizedState property, open the Properties dialog box for the particular dimension attribute and then set that property value to NotOptimized. Setting the value to NotOptimized causes SSAS to not create supplementary indexes (such as those that are created by default) for this particular attribute during dimension or cube processing. This can result in slower query times, so change this setting only for rarely browsed attributes. The final considerations when processing cubes and dimensions are whether you’ll need to adjust any of the processing options. You access these options by clicking the Change Settings button in the Process Cube dialog box. Clicking this button displays the Change Settings dialog box, which is shown in Figure 9-19. This dialog box contains two tabs: Processing Options and Dimension Key Errors.
290
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 9-19 The Dimension Key Errors tab in the Change Settings dialog box allows you to specify custom
error behavior responses when processing a cube.
On the Processing Options tab, you can set the following values: ■■
Parallel Processing or Sequential Processing (if parallel, maximum number of parallel tasks must be specified)
■■
Single or multiple transactions (for sequential processing)
■■
Writeback Table (either Use Existing, Create, or Create Always)
■■
Process Affected Objects (either Off or On)
The Dimension Key Errors tab, shown in Figure 9-19, allows you to configure the behavior of errors during processing. By reviewing this tab, you can see that you can either use the default error configuration or set a custom error configuration. When using a custom error configuration, you can specify actions to take based on the following settings: ■■
The Key Error Action drop-down list enables you to specify a response to key errors. Figure 9-19 shows the Convert To Unknown option selected in the drop-down list.
■■
The Processing Error Limit section enables you to specify a limit for the number of errors that are allowed during processing. Figure 9-19 shows the Number Of Errors option set to 0 (zero) and the On Error Action drop-down list with the Stop Processing item selected. This configuration stops processing on the first error.
Chapter 9
Processing Cubes and Dimensions
291
■■
The Specific Error Conditions section includes the Key Not Found, Duplicate Key, Null Key Converted To Unknown, and Null Key Not Allowed options.
■■
The Error Log Path text box allows you to specify the path for logging errors.
Although you had probably been processing test cubes for a while prior to reading this chapter, you’ve probably gained a bit more insight into what actually happens when you run the process action. As we’ve seen, when you execute the Process command on a cube or dimension, the step-by-step output of processing is shown in the Process Progress dialog box. In production, you usually automate the cube/dimension processing via SSIS packages, using the cube or dimension processing tasks that are included as a part of the SSIS control flow tasks. As we mentioned previously in this chapter, you can choose to process one or more dimensions rather than processing an entire OLAP cube. Cube processing automatically includes associated dimension processing. In fact, processing a cube will execute in the following order: dimension (or dimensions) processing and then cube processing. It should be clear by this point why the processing is done in this order. The metadata of the cube includes the dimension source data. For example, for a Customer dimension, the source rows for each customer in the customer table create the structure of the cube. In other words, the number of (in our example, customer) source rows loaded during dimension processing determines the size of the cube container. You can visualize this as one of the sides of the cube—that is, the more source (customer) rows there are, the longer the length of the particular side will be. Of course, this is not a perfect analogy because cubes are n-dimensional, and most people we know can’t visualize anything larger than a four-dimensional cube. If you’re wondering how to actually visualize a four-dimensional cube, think of a threedimensional cube moving through time and space. Because the source rows for a dimension are metadata for an OLAP cube, the cube dimensions must be successfully processed prior to loading cube data (which is data from any underlying fact tables). When you select a particular type of cube processing—that is, Full, Incremental, Update, and so on—both the related dimensions and then the cube are processed using that method. We find that complete cube processing has been the most common real-world scenario, so we focused on completely explaining all the options available in that approach in this chapter. As we mentioned previously, we have occasionally encountered more granular processing requirements related to one or more dimensions. SSAS does support individual dimension processing using the processing options that we described in the list earlier in this section.
292
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Summary Although there are myriad storage and processing options, we find that for most projects the default storage method of MOLAP works just fine. However, partitions do not create themselves automatically, nor do aggregations. Intelligent application of both of these features can make your cubes much faster to query, while still keeping processing times to an acceptably fast level. If you choose to implement any of the advanced features, such as proactive caching or ROLAP dimensions, be sure you test both query response times and cube processing times during development with production-load levels of data. In the next two chapters, we’ll look at the world of data mining. After that, we’ve got a got bit of material on the ETL process using SSIS to share with you. Then we’ll explore the world of client tools, which include not only SSRS, but also Office SharePoint Server 2007, PerformancePoint Server, and more.
Chapter 10
Introduction to MDX In this chapter and the next one, we turn our attention to Multidimensional Expressions (MDX) programming. MDX is the query language for OLAP data warehouses (cubes). Generally speaking, MDX is to OLAP databases as Transact-SQL is to Microsoft SQL Server relational databases. OLAP applications use MDX to retrieve data from OLAP cubes and to create stored and reusable calculations or result sets. MDX queries usually comprise several items: ■■
Language statements (for example, SELECT, FROM, and WHERE)
■■
OLAP dimensions (a geography, product, or date hierarchy)
■■
Measures (for example, dollar sales or cost of goods)
■■
Other MDX functions (Sum, Avg, Filter, Rank, and ParallelPeriod)
■■
Sets (ordered collections of members)
We take a closer looks at all of these items, mostly by examining progressively more complex MDX queries and statements. Although MDX initially appears similar to Transact-SQL, there are a number of significant differences between the two query languages. This chapter covers the fundamentals of MDX and provides brief code samples for most of the popular MDX language features. The next chapter provides some richer examples of how you can leverage MDX in your business intelligence (BI) solution. In this chapter, we first take a closer look at core MDX syntax and then discuss several commonly used MDX functions. Unlike the rest of this book, this chapter and the next one focus on language syntax. We feel that you’ll learn best by being able to try out successively more complex MDX queries. As with Transact-SQL, MDX query processing relies not only on you writing the most efficient data access query statements, but also on that code making efficient use of the internal query processing mechanisms. Although the focus of this chapter is on understanding and writing efficient code, we also introduce some core concepts related to query processing architecture in SSAS here.
The Importance of MDX So far in this book, we haven’t emphasized MDX syntax. Although we have implemented some BI solutions that included only minimal manual MDX coding, we find that understanding the MDX language is important for building a successful BI project. One reason for this is because tools, such as SQL Server Management Studio (SSMS) and Business Intelligence Development Studio (BIDS), that expose OLAP cubes for reporting/dashboard purposes 293
294
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
contain visual designers to create output. Most of these designers (for example, the designers in SSRS and PerformancePoint Server) generate MDX code. Why then is it still important, even critical sometimes, for you to know MDX? Although most tools do an adequate job of allowing users to produce simple output from OLAP cubes, these tools aren’t designed to be able to automatically generate MDX queries for every possible combination of options that might be required for certain reports that are part of your project’s business requirements. Here are some examples of real-world problems for which you might need to write manual MDX queries: ■■
To create a certain complex sort order for output
■■
To rank items in a report by some criteria
■■
To display hierarchical data in a way that the design tool doesn’t fully support
■■
To create some trend-based measures in a report, such as dollars sold for the same time period of the previous year or dollar sales to date
■■
To create KPIs with complex logic that compares the KPI value of one period to the KPI value of another period
Even when OLAP reporting tools provide designers that automatically generate MDX code, you might still find yourself needing to modify or rewrite the code. Even if you need to add only one line of code or just tweak a generated MDX expression, you still need to know how and where to do it. We believe that successful OLAP developers need to learn both the fundamentals of MDX as well as the more advanced features of the language. In this chapter, we focus on MDX queries that produce a result set. When you’re working with OLAP cubes, you’ll use both MDX queries and MDX expressions. To understand the difference between an MDX query and an MDX expression, you can think of a Microsoft Office Excel spreadsheet. An OLAP cube MDX expression produces a calculated cell as a result. In this way, it’s similar to an Excel cell formula—it calculates new values and adds them to the cube output that is displayed. The difference is that these calculations are automatically applied to multiple cells in the cube as specified in the scope of the MDX expression. An OLAP cube MDX query produces a result set, which is called a cellset. This cellset is normally displayed in a matrix-type output. This is similar to applying a filter to an Excel workbook: the filter produces new output, which is some reduced subset of the original data. Note We find that when developers with a Transact-SQL background initially see MDX syntax, they conclude that MDX will be similar to different dialects of SQL. Although that conclusion is quite understandable, it’s actually presumptuous and, in fact, counterproductive. We find that developers who can put aside comparisons between Transact-SQL and MDX learn MDX more quickly. The reason for this is that attempting to find an equivalent for each MDX concept in Transact-SQL is actually counterproductive because there aren’t always direct comparisons.
Chapter 10
Introduction to MDX
295
Writing Your First MDX Queries We’ll begin by entering queries directly into the query editor window in SSMS. To do this, start SSMS, and connect to your SQL Server Analysis Services (SSAS) instance and Adventure Works DW sample. Next, select the OLAP cube from the SSMS Object Explorer, and then click the New Query button to open a new MDX query editor window. The principal use of new queries is twofold: to generate particular types of reports or as the result of directly manipulating cube data in a client interface, such as a pivot table that supports direct query generation. A basic MDX query contains a SELECT statement, a definition for the COLUMNS axis, a definition for the ROWS axis, and the source cube. So a simple but meaningful query we can write against Adventure Works is shown here as well as in Figure 10-1: SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Customer].[State-Province] ON ROWS FROM [Adventure Works]
FIgure 10-1 Simple MDX query
You must specify the COLUMNS axis value first and the ROWS axis value second in an MDX query. You can alternatively use the axis position—that is, COLUMNS is axis(0) and ROWS is axis(1)—in your query. It’s helpful to understand that you can return a single member or a set of members on each axis. In this case, we’ve written a simply query—one that returns a single member on both axes. In the FROM clause, you’ll reference a cube name or the name of what is called a dimension cube. The latter name type is prefaced with a dollar sign ($). We’ll talk more about this in a later section. Tip It is common in MDX query writing to capitalize MDX statement keywords, and to use camel casing (that is, PeriodsToDate) for MDX functions. This is a best practice and makes your code more readable for other developers. MDX itself (other than when performing string matches) is not case sensitive. MDX is also not space sensitive (other than for string matches), so it’s common to place keywords on a new line, again for readability.
296
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
MDX Object Names Individual object names are surrounded by brackets. The syntax rules require square brackets around object names in only the following circumstances: ■■
Object names contain embedded spaces.
■■
Object names are the same as MDX keywords.
■■
Object names begin with a number rather than a string.
■■
Object names contain embedded symbols.
Also, object names that are member names can be either explicitly named—for example, [Customer].[State-Province].California, CA—or referenced by the key value—for example, [Customer].[State-Province].&[CA]&[US]. The ampersand (&) is used with key values to identify the dimension member. The key can be a multipart key, as in the preceding example (where CA is one value in the key and US is the other). A query will always run if you choose to include all object names in square brackets, and we consider this to be a best syntax practice. Note that dimensions and their related objects (that is, members, levels, and hierarchies) are separated by a period. Object names are also called tuples. Here is the definition of a tuple from SQL Server Books Online: A tuple uniquely identifies a cell, based on a combination of attribute members that consist of an attribute from every attribute hierarchy in the cube. You do not need to explicitly include the attribute member from every attribute hierarchy. If a member from an attribute hierarchy is not explicitly listed, then the default member for that attribute hierarchy is the attribute member implicitly included in the tuple. If your query consists of a single tuple, the delimiter used to designate a tuple—parentheses—is optional. However, if your query contains multiple tuples from more than one dimension, you must separate each tuple by a comma and enclose the group of tuples in parentheses. The order of the members in the tuple does not matter; you’re uniquely identifying a cell in the cube simply by listing the values and then enclosing your list in parentheses.
Other Elements of MDX Syntax Here are a few other basic things to know about MDX: ■■
Single-line code comments are created with the double forward slash (//) or double hyphen (--) delimiters, just as they are in Transact-SQL.
■■
Multiline comments repeat this syntax for each line of the comment: /* line 1 */ . . . /* line 2 */ . . . /* line n */
Chapter 10
Introduction to MDX
297
■■
Operators—for example, the plus sign (+), minus sign (–), forward slash (/), and others— work the same way that they do in Transact-SQL. However, the asterisk (*) can have a special meaning, which will be discussed later in this chapter. Operator precedence is also the same in MDX as it is in Transact-SQL. Using angled brackets (< >) means “not equal to” in MDX.
■■
MDX contains reserved keywords. You should avoid naming objects using these words. If used in a query, they must be delimited with square brackets. See the SQL Server Books Online topic “MDX Reserved Words” for a complete list.
■■
MDX contains functions. These functions perform set-creation operations, hierarchy navigation, numeric or time-based calculations, and more. We’ll take a closer look at many functions throughout these two chapters.
If you make a syntax error in your MDX query, the query tool in SSMS shows you a red squiggly line under the code in error. Executing the query with an error in it will result in some kind of information about the error being displayed in the Messages pane. Query error information is much less detailed than you’re used to if you have experience with Transact-SQL. The query shown earlier in Figure 10-1 generates an aggregated, one-column result of a little over $29 million. Although that result might be helpful if you simply want to know the total of Internet sales for the entire OLAP cube, you’ll usually need to “slice and dice” the Adventure Works cube by using different conditions. This chapter demonstrates how to break down the cube by different criteria. First, let’s produce a list of Internet sales by state. (State is one of the Adventure Works OLAP Customer dimension levels.) Our query, shown in the following code sample, lists the states on the ROWS axis: SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Customer].[State-Province].Members ON ROWS FROM [Adventure Works]
You can type this query into a query window (such as the one in SSMS), or you can use the designer to drag and drop measures and dimension elements from the cube list into the query window. For better productivity, we recommend that you drag and drop object names from the metadata listing into the position in the query text where you want to insert them. Object names can become quite long and cumbersome to type, particularly when you’re including objects that are part of a deep hierarchy, such as Customer->Customer Geography->Country->State->City, and so on. In Figures 10-2 and 10-3, we show the metadata available for the Adventure Works sample cube. In Figure 10-2, we expanded the Measures metadata for the Adventure Works sample cube, showing the contents of the Internet Sales measure group. This group includes two types of measures: stored measures and calculated members. Stored measures are indicated by a bar chart icon. The first stored measure shown is Internet Extended Amount. Calculated members
298
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
are indicated by a bar chart plus a calculator icon. The first calculated member shown is Growth In Customer Base. From an MDX query perspective, both types of measures are queried in an equivalent way.
FIgure 10-2 Available measures from the Adventure Works cube
Tip Queries to stored measures return results more quickly than those to calculated members. We recommend that you test queries to calculated members under production-load levels during prototyping. Based on the results, you might want to convert some calculated members to stored measures. You will have this option available only with some types of calculated members. For example, ratios typically have to be calculated dynamically to return accurate results. Of course, converting calculated members to stored measures increases the cube storage space and processing times, so you must base your design on business requirements. In Figure 10-3, we’ve opened the Metadata tab to expose the items contained in a particular dimension—in this case, Customer. These items include all of the attribute hierarchy display folders, dimension members, dimension levels, and dimension hierarchies. Recall that all dimensions have members and at least one level. It’s optional for dimensions to have hierarchies and display folders. If a dimension has a defined hierarchy, to query objects in that
Chapter 10
Introduction to MDX
299
dimension, you must specify the particular hierarchy as part of the object name. Hierarchy display folders are for display only; they are not used in MDX queries.
FIgure 10-3 Available dimensions from the Adventure Works cube
The number of dots next to the level names—in our example, Country, State-Province, and so on—indicates the position in the hierarchy, with smaller numbers indicating a higher position in the hierarchy. The grouping of dots in the shape of a pyramid—in our case, next to Customer Geography—indicates that the object is a dimensional hierarchy.
MDX Core Functions Figure 10-4 shows the results after we changed the query to return a set of members on the ROWS axis. We did this by specifically listing the name of a dimension level in the MDX query (State-Province) and by using the MDX Members function. The Members function returns a list of all members in the State dimension level, plus an All Customers total. The All member is included in every dimension by default unless it has been specifically disabled, or hidden. The All member is also the default returned member when none is specified. The default member can be changed in the design of a dimension or by associating specific default members with particular security groups—that is, members of the WestRegion security group will see the West Region as the default returned member, members of EastRegion security group will see the East Region as the default returned member, and so on.
300
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-4 Result set for all members of the Customer/State dimension
If we simply wanted the states, without a total row (or the All member), we could change Members to Children as shown here: SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Customer].[State-Province].Children ON ROWS FROM [Adventure Works]
Note that some states or provinces contain null values, indicating there is no value for that cell. You can filter out the null values with a NON EMPTY keyword as shown in the following code. The query results are shown in Figure 10-5. Adding the NON EMPTY keyword makes your queries more efficient because the results generated are more compact. SELECT [Measures].[Internet Sales Amount] ON COLUMNS, NON EMPTY [Customer].[State-Province].Children ON ROWS FROM [Adventure Works]
FIgure 10-5 Results of querying with NON EMPTY
Chapter 10
Introduction to MDX
301
Of course, you’ll sometimes want more than one column to appear in a result set. If you want to include multiple columns, place them inside curly braces ({ }). Using curly braces explicitly creates a set result in MDX terminology. A set consists of one or more tuples from the same dimension. You can create sets explicitly, by listing the tuples, as we’ve done here, or you can use an MDX function to generate a set in a query. Explicit set creation is shown in the following code, and the query result is shown in Figure 10-6: SELECT { [Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, NON EMPTY [Customer].[State-Province].Children ON ROWS FROM [Adventure Works]
FIgure 10-6 The results of retrieving multiple columns with a NON EMPTY query
You might be wondering, “But what if I want to break out dollar sales and gross profit by the Product dimension’s Category level as well as the Customer dimension’s State level?” You simply join the two dimensions together (creating a tuple with multiple members) with parentheses on the ROWS axis of the MDX query as shown in the following code. Unlike creating a set, by using curly braces as you just did on the COLUMNS axis, here you’re simply asking the MDX query processor to return more than one set of members on the ROWS axis by using the comma delimiter and the parentheses to group the sets of members. The query results are shown in Figure 10-7. SELECT { [Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, NON EMPTY ( [Customer].[State-Province].Children, [Product].[Category].Children ) ON ROWS FROM [Adventure Works]
302
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-7 Results of creating a tuple
Alternatively, you could use the asterisk (*) symbol to join the two dimensions. The updated query is shown here, and the query results are in Figure 10-8: SELECT { [Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, NON EMPTY [Customer].[State-Province].Children * [Product].[Category].Children ON ROWS FROM [Adventure Works]
FIgure 10-8 Result set with two columns and two dimension sets
Up to this point, the result sets have appeared in the order specified in the cube—that is, the dimensional attribute order configured in the original meta data. You can sort the results with the Order function. The Order function takes on three parameters: the set of members you want to display, the measure you’re sorting on, and the sort order itself. So if you want to
Chapter 10
Introduction to MDX
303
sort Customer State-Province on the Internet Sales Amount in descending order, you’d write the following query, which produces the output shown in Figure 10-9: SELECT {
[Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, NON EMPTY Order( [Customer].[State-Province].Children, [Measures].[Internet Sales Amount],DESC) ON ROWS FROM [Adventure Works]
FIgure 10-9 Sorted results can be obtained by using the Order function.
If you want to include an additional value on the ROWS axis, such as Product Category, you would write the following query. The results are shown in Figure 10-10. SELECT {
[Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, NON EMPTY Order( [Product].[Category].Children * [Customer].[State-Province].Children , [Measures].[Internet Sales Amount],DESC) ON ROWS FROM [Adventure Works]
FIgure 10-10 Sorting can be done on multiple dimension sets.
304
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
However, when you generate the result set with this query, you’ll see State-Province states sorted by Internet Sales Amount, but within a dimension sort for the Category.Product categories. This is the equivalent of the Transact-SQL construction of ORDER BY Product Category, Sales Amount DESC. If you want to sort from high to low on every Product Category/Customer State-Province combination, regardless of the dimension order, you use the BDESC keyword instead of the DESC keyword. The BDESC keyword effectively breaks any dimension or hierarchy definition and sorts purely based on the measure. You can retrieve specific dimension members by including them in braces, which you’ll recall from earlier discussion in this chapter, explicitly creates a set. For instance, if you want to retrieve the sales amount for Caps, Cleaners, Fenders, and Gloves, you write the following query: SELECT {[Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, { [Product].[SubCategory].[Caps] , [Product].[SubCategory].[Cleaners] , [Product].[SubCategory].[Fenders] , [Product].[SubCategory].[Gloves] } ON ROWS FROM [Adventure Works]
The results are shown in Figure 10-11.
FIgure 10-11 Retrieving specific dimension members
If Caps, Cleaners, Fenders, and Gloves are consecutively defined in the dimension level, you can use the colon (:) symbol to retrieve all the members in between the two expressions just as you use it in Excel to retrieve all members of a list, as shown here: SELECT {[Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, { [Product].[SubCategory].[Caps] : [Product].[SubCategory].[Gloves] } ON ROWS FROM [Adventure Works]
Chapter 10
Introduction to MDX
305
The : symbol is used mostly for date ranges. Here is an example that displays dates from a date range. The results are shown in Figure 10-12. SELECT {[Date].[Fiscal].[February 2002] : [Date].[Fiscal].[May 2002]} * { [Measures].[Internet Sales Amount], [Measures].[Internet Tax Amount]} ON COLUMNS, NON EMPTY [Product].[SubCategory].CHILDREN ON ROWS FROM [Adventure Works]
FIgure 10-12 Results of using the colon to include a date range
Finally, you can place a WHERE clause at the end of a query to further “slice” a result set. In the following example, you won’t see references to “2005” or “Marketing” in the result set; however, the aggregations for sales amount and gross profit will include only data from fiscal year 2005 and where the Sales Reason includes Marketing. The query results are shown in Figure 10-13. Although the WHERE clause acts like a kind of a filter for the result set, MDX has a separate Filter function that performs a different type of filtering. We’ll cover that in the next section. SELECT {
[Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, NON EMPTY Order( [Product].[Category].Children * [Customer].[State-Province].Children , [Measures].[Internet Sales Amount],BDESC) ON ROWS FROM [Adventure Works] WHERE ([Date].[Fiscal Year].[FY 2005], [Sales Reason].[Marketing])
FIgure 10-13 Results of adding a WHERE clause
306
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Filtering MDX result Sets MDX provides many different options for filtering result sets. The specific filter option you’ll use will be based on the type of filter you want to execute. Once again, seeing specific examples can go a long way toward understanding the language. For example, if you want to filter on only product subcategories with a total Internet Gross Profit of at least $1 million and Internet sales of at least $10 million, you can use the Filter function. In the following example, you simply wrap the values on ROWS inside a Filter function and then specify a filter expression. You can connect filter expressions with AND/OR operands, as shown here: SELECT { [Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, Filter([Product].[SubCategory].Children , [Measures].[Internet Gross Profit] > 1000000 AND [Measures].[Internet Sales Amount] > 10000000) ON ROWS FROM [Adventure Works]
The results are shown in Figure 10-14.
FIgure 10-14 Results of using the Filter function
To help you understand the difference between using the WHERE keyword and the Filter function, we’ll combine both in a query. You can combine a Filter statement with a WHERE clause, as shown here. The results are shown in Figure 10-15. SELECT { [Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, Filter([Product].[SubCategory].Children , [Measures].[Internet Gross Profit] > 1000 AND [Measures].[Internet Sales Amount] > 10000) ON ROWS FROM [Adventure Works] WHERE ( [Customer].[Customer Geography].[Country].[Canada], [Promotion].[Promotion Type].[New Product] )
FIgure 10-15 Results of using the WHERE keyword and the Filter function
At this point, you might be wondering, “In what situations should I use WHERE, and when should I use Filter?” Good question! The answer is that you should use Filter when you’re filtering results against a measure (for example, dollar sales greater than a certain amount, average price less than a certain amount, and so on). If you want to filter on specific
Chapter 10
Introduction to MDX
307
dimension members, you’re best off placing them in a WHERE statement. The one exception to this is if you want to filter on some substring of a dimension member. For example, suppose you want to filter on product subcategories that have the word BIKE in the description. You can use the InStr function inside a filter—and you must also drill down to the CurrentMember.Name function of each dimension member, as shown in the following code. The CurrentMember function returns information about the select member. The returned object contains a couple of properties, such as Name and Value. These properties allow you to specify exactly what type of information you want to return about the currently selected member from the cube. The results of this query are shown in Figure 10-16. SELECT { [Measures].[Internet Sales Amount], [Measures].[Internet Gross Profit] } ON COLUMNS, Filter([Product].[SubCategory].Children , [Measures].[Internet Gross Profit] > 1000 AND [Measures].[Internet Sales Amount] > 10000 AND InStr([Product].[SubCategory].CurrentMember.Name, 'BIKE') > 0) ON ROWS FROM [Adventure Works]
FIgure 10-16 Results of using the InStr function
Calculated Members and Named Sets Although OLAP cubes contain summarized and synthesized subsets of your raw corporate data, your business requirements might dictate that you add more aggregations. These are called calculated members because you most often create them by writing MDX expressions that reference existing members in the measures dimension. For instance, if you want to create a calculated member on Profit Per Unit (as Internet Gross Profit divided by Internet Order Quantity) and use that as the sort definition in a subsequent MDX query, you can use the WITH MEMBER statement to create a new calculation on the fly. For instance, the following query creates a calculated member that not only appears in the COLUMNS axis, but is also used for the Order function. The query results are shown in Figure 10-17. Note also that you do not need to separate multiple calculated members with commas. Also in this code example, we introduce the FORMAT_STRING cell property. This allows you to apply predefined format types to cell values returned from a query. In our case, we want to return results formatted as currency.
308
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
WITH MEMBER [Measures].[Profit Per Unit] AS [Measures].[Internet Gross Profit] / [Measures].[Internet Order Quantity], FORMAT_STRING = 'Currency' MEMBER [Measures].[Profit Currency] AS [Measures].[Internet Gross Profit], format_string = 'CURRENCY' SELECT { [Measures].[Internet Order Quantity], [Measures].[Profit Currency], [Measures].[Profit Per Unit] } ON COLUMNS, NON EMPTY Order( [Product].[Product].Children , [Measures].[Profit Per Unit],DESC) ON ROWS FROM [Adventure Works]
FIgure 10-17 WITH MEMBER results
In addition to creating calculated members, you might also want to create named sets. As we learned in Chapter 8, “Refining Cubes and Dimensions,” named sets are simply aliases for groups of dimension members. You can create named sets in an OLAP cube using the BIDS interface for OLAP cubes—specifically, the Calculations tab. We look at the MDX code used to programmatically create named sets in the next query. There are some enhancements to named sets in SQL Server 2008 that we’ll cover in more detail in the next chapter. The core syntax is similar to that used to create calculated members—that is, CREATE SET…AS or WITH SET…AS—to create a session or query specific named set. In the following code, we enhance the preceding query by adding the Filter function to the second calculated member and placing it in a named set. So, in addition to ordering the results (which we do by using the Order function), we also filter the values in the calculated set in this query. In MDX, the SET keyword allows you to define and to create a named subset of data from the source. SET differs from the Members function in that using the latter restricts you to returning one or more values from a single dimension, whereas the former allows you to return values from one or more dimensions. We’ll use the SET keyword in this query to define a set of products ordered by Profit Per Unit and filtered to include only products where the Profit Per Unit is less than 100.
Chapter 10
Introduction to MDX
309
The query results are shown in Figure 10-18. WITH MEMBER [Measures].[Profit Per Unit] AS [Measures].[Internet Gross Profit] / [Measures].[Internet Order Quantity], FORMAT_STRING = 'Currency' MEMBER [Measures].[Profit Currency] AS [Measures].[Internet Gross Profit], FORMAT_STRING = 'Currency' SET [OrderedFilteredProducts] AS Filter( Order( [Product].[Product].Children , [Measures].[Profit Per Unit],DESC), [Measures].[Profit Per Unit] < 100) SELECT { [Measures].[Internet Order Quantity], [Measures].[Profit Currency], [Measures].[Profit Per Unit] } ON COLUMNS, NON EMPTY [OrderedFilteredProducts] ON ROWS FROM [Adventure Works]
FIgure 10-18 Results of the enhanced WITH MEMBER query
Creating Objects by Using Scripts There are several places in MDX where you can use a new object created by a script. These objects can be persistent or temporary. The general rule is that using WITH creates a temporary object—such as a temporary calculated member or named set—whereas CREATE creates a persistent object—such as a calculated member or named set. You should choose the appropriate type of object creation based on the need for reuse of the new object—WITH creates query-specific objects only. Another way to think of this is to consider that WITH creates the objects in the context of a specific query only. CREATE creates objects for the duration of a user’s session. In the next chapter, we take a closer look at the syntax for and use of persistant objects.
310
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The TopCount Function The next group of MDX functions we’ll examine are straightforward. They include TopCount, which is useful, for example, if you want to retrieve the top 10 products by Internet Sales Amount. The TopCount function takes three parameters: the data you want to retrieve and display, the number of members, and the measure to be used for the sort criteria. So, in essence, a TopCount value of 10 on Internet Sales Amount is saying, “Give me the 10 products that sold the most, in descending order.” TopCount (and many other MDX functions) works best with named sets. Like calculated members, named sets allow you to abstract the definition of what is to be displayed down the left column (that is, row) of a query result set (or across the top—columns—if need be). In this example, we’re creating a named set that retrieves the top 10 best-selling products, and then we’re using that named set in the ROWS axis. We use named sets to illustrate this concept. The results are shown in Figure 10-19. WITH SET [Top10Products] AS TopCount([Product].[Product].Children, 10, [Measures].[Internet Sales Amount]) SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Top10Products] ON ROWS FROM [Adventure Works]
FIgure 10-19 Results of using TopCount
As we saw previously, where we sorted on multiple columns, you can use TopCount to retrieve the top 10 combinations of products and cities, as shown here. The results are shown in Figure 10-20. WITH SET [Top10Products] AS TopCount( [Product].[Product].Children * [Customer].[City].Children , 10, [Measures].[Internet Sales Amount]) SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Top10Products] ON ROWS FROM [Adventure Works]
Chapter 10
Introduction to MDX
311
FIgure 10-20 Multiple results on the ROWS axis using TopCount
Not surprisingly, there is a BottomCount statement, which allows you to retrieve the lowest combination of results. Additionally, there are TopPercent and TopSum statements (and bottom name counterparts). TopPercent allows you to (for example) retrieve the products that represent the top 10 percent (or 15 percent or whatever you specify) of sales. TopSum allows you to (for example) retrieve the highest selling products that represent the first $1 million of sales (or some other criteria). This next query is a bit more complex. Suppose we want the top five states by sales, and then for each state, the top five best-selling products underneath. We can use the MDX Generate function to perform the outer query (for the states) and the inner query for the products, and then use State.CurrentMember to join the two. The Generate function produces a new set as a result based on the arguments that you specify for it. Generate derives the new set by applying the set defined in the second argument to each member of the set defined in the first argument. It returns this joined set, and eliminates duplicate members by default. Here is an example of this type of query: SELECT [Measures].[Internet Sales Amount] ON COLUMNS, Generate ( TopCount ( [Customer].[State-Province].Children,5 , [Internet Sales Amount]), ({[Customer].[State-Province].CurrentMember}, Topcount([Product].[SubCategory].Children,5, [Internet Sales Amount] )),ALL ) ON ROWS FROM [Adventure Works]
The results are shown in Figure 10-21.
312
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-21 Results of using Generate with TopCount
Rank Function and Combinations Not only can we use TopCount to retrieve the top n members based on a measure, we can also use the Rank function to assign a sequential ranking number. Rank allows you to assign the specific order number of each result. Just like with TopCount, you’ll want to use a named set to fully optimize a Rank statement. For example, suppose we want to rank states by Internet Sales Amount, showing sales for those states with sales that are non-zero. We need to do two things. First, we create a named set that sorts the states by Internet Sales Amount, and then we create a calculated member that ranks each state by the ordered set. The full query, which we’ll break down later, is shown here. The query results are shown in Figure 10-22. WITH SET [SalesRankSet] AS Filter( Order( [Customer].[State-Province].Children , [Measures].[Internet Sales Amount], BDESC ), [Measures].[Internet Sales Amount] <> 0) MEMBER [Measures].[SalesRank] AS Rank([Customer].[State-Province].CurrentMember, [SalesRankSet]) SELECT { [Measures].[SalesRank], [Measures].[Internet Sales Amount] } ON COLUMNS, [SalesRankSet] ON ROWS FROM [Adventure Works] WHERE ([Product].[Bikes], [Date].[Fiscal].[FY 2003])
Chapter 10
Introduction to MDX
313
FIgure 10-22 Results of using Rank
Now let’s drill into the query just shown. First, because all ranking should ideally be performed against an ordered set, you’ll want to create a predefined (named) set on states that are sorted by descending dollar amount in descending order, as shown here: WITH SET [SalesRankSet] AS Filter( Order( [Customer].[State-Province].Children , [Measures].[Internet Sales Amount], BDESC ), [Measures].[Internet Sales Amount] <> 0)
Next, because the ranking result is nothing more than a calculated column, you can create a calculated member that uses the Rank function to rank each state against the ordered set, like this: MEMBER [Measures].[SalesRank] AS Rank([Customer].[State-Province].CurrentMember, [SalesRankSet])
This code uses the CurrentMember function. In SQL Server 2005 and 2008, the Current Member function is implied, so you could instead write the member calculation without the CurrentMember function like this: MEMBER [Measures].[SalesRank] AS Rank([Customer].[State-Province], [SalesRankSet])
To extend the current example, it’s likely that you’ll want to rank items across multiple dimensions. Suppose you want to rank sales of products, but within a state or province. You’d simply include (that is, join) Product SubCategory with State-Province in the query, as shown here: WITH SET [SalesRankSet] AS Filter( Order( ( [Customer].[State-Province].Children, [Product].[SubCategory].Children ) , [Measures].[Internet Sales Amount], BDESC ), [Measures].[Internet Sales Amount] <> 0)
314
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
MEMBER [Measures].[SalesRank] AS Rank( ( [Customer].[State-Province].CurrentMember, [Product].[SubCategory].CurrentMember), [SalesRankSet]) SELECT { [Measures].[SalesRank], [Measures].[Internet Sales Amount] } ON COLUMNS, [SalesRankSet] ON ROWS FROM [Adventure Works] WHERE ( [Date].[Fiscal].[FY 2004])
The query produces the result set shown in Figure 10-23. Note that because we used BDESC to break the dimension members for state apart, we have California for two product subcategories, and then England and New South Wales, and then California again. So we could have switched State and SubCategory and essentially produced the same results.
FIgure 10-23 Results of using Rank with Filter, Order, and more
Suppose you want to show rankings for each quarter of a year. This presents an interesting challenge because a product might be ranked first for one quarter and fourth for a different quarter. You can use the MDX functions LastPeriods and LastChild to help you retrieve these values. LastPeriods is one of the many time-aware functions included in the MDX library. Other such functions include OpeningPeriod, ClosingPeriod, PeriodsToDate, and ParallelPeriods. LastPeriods takes two arguments (the index or number of periods to go back and the starting member name), and it returns a set of members prior to and including a specified member. LastChild is just one of the many family functions included in the MDX library. These functions allow you to retrieve one or more members from a dimensional hierarchy based on position in the hierarchy of the starting member or members and function type. Other functions in this group include Parent, Ancestor, Siblings, and so on. LastChild returns the dimension member that is the last child (or the last member in the hierarchy level immediately below the specified member) of the specified member. An example of this query follows, and the results are shown in Figure 10-24.
Chapter 10
Introduction to MDX
315
WITH SET [OrderedSubCats] AS Order([Product].[Subcategory].Children, [Measures].[Internet Sales Amount],BDESC) MEMBER [Measures].[ProductRank] AS Rank( [Product].[Subcategory].CurrentMember , [OrderedSubCats], [Measures].[Internet Sales Amount]) SET [last4quarters] AS LastPeriods(4,[Date].[Fiscal Quarter].LastChild) SELECT {
[last4Quarters] * {[Measures].[Internet Sales Amount], [ProductRank]}} ON COLUMNS, Order([Product].[Subcategory].Children, ([Measures].[Internet Sales Amount],[Date].[Fiscal Quarter].LastChild),DESC) ON ROWS FROM [Adventure Works]
FIgure 10-24 Results of using Rank with LastPeriods
Head and Tail Functions Next we’ll look at the MDX Head and Tail functions. Suppose, within a set of top-10 results, you want to retrieve the final three in the list (that is, results eight through ten, inclusive). You can use the Tail function, which returns a subset of members from the end of a set, depending on how many you specify. For example, if you wanted the bottom three results from a top-10 listing by products and cities, you could write the following query: WITH SET [Top10Products] AS Tail( TopCount( [Product].[Product].Children * 10, [Measures].[Internet Sales Amount]), 5)
[Customer].[City].Children ,
SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Top10Products] ON ROWS FROM [Adventure Works]
The result set is shown in Figure 10-25.
316
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-25 Results of using the Tail function
As you might imagine, the Head function works similarly, but allows you to return the first n number of rows from a set. It is a common BI business requirement to retrieve the top or bottom members of a set, so we frequently use the Head or Tail function in custom member queries in OLAP cubes.
Hierarchical Functions in MDX One useful feature of OLAP cubes is their ability to use hierarchies to retrieve result sets by selecting a specific dimension member (a specific market, product group, or time element such as a quarter or month) and drilling down (or up) to see all the child (or parent) data. After you’ve established hierarchies in your OLAP dimensions, you can use several MDX functions to navigate these hierarchies. Let’s take a look at some examples. For starters, you can use the MDX Children function to retrieve all records for the next level down in a hierarchy, based on a specific dimension member. For example, you can retrieve sales for all the subcategories under the category of Bikes with the following query. The results are shown in Figure 10-26. SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, [Product].[Product Categories].[Category].[Bikes].Children FROM [Adventure Works]
ON ROWS
FIgure 10-26 Results of using the Children function
You can take one of the results of the preceding query—for example, Road Bikes—and find sales for the children of Road Bikes with the following query, the results of which are shown in Figure-10-27: SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, [Product].[Road Bikes].Children ON ROWS FROM [Adventure Works]
Chapter 10
Introduction to MDX
317
FIgure 10-27 Results of using the Children function at a lower level of the hierarchy
Sometimes you might want to take a specific item and find all the items with the same parent. In other words, you might want to see all of the siblings for a specific value. If you want to find the sales data for the Road-150 Red, 44, as well as all other products that share the same subcategory parent as the Road-150 Red, 44, you can use the Siblings function as shown here: SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, [Product].[Road-150 Red, 44].Siblings ON ROWS FROM [Adventure Works]
The results are shown in Figure 10-28.
FIgure 10-28 Results of using the Siblings function
So we’ve looked at the Children function to drill down into a hierarchy and the Siblings function to look across a level. Now we’ll look at the Parent function to look up the hierarchy. Suppose you want to know the sales for the parent member of any member value, such as the sales for the subcategory value that serves as the parent for the Road-150 Red, 44. Just use the following Parent function: SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, [Product].[Road-150].Parent ON ROWS FROM [Adventure Works]
318
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The results are shown in Figure 10-29.
FIgure 10-29 Results of using the Parent function
Now that you understand some basic hierarchical functions, we can look at more advanced functions for reading dimension hierarchies. Although MDX language functions such as Parent, Children, and Siblings allow you to go up (or down) one level in the hierarchy, sometimes you need more powerful functions to access several levels at once. For instance, suppose you need to retrieve all data from the second level (for example, product brand) down to the fourth level (for example, product item). You could use a combination of Parent and Parent.Parent (and even Parent.Parent.Parent) to generate the result. Fortunately, though, MDX provides other functions to do the job more intuitively. One such MDX function is Descendants, which allows you to specify a starting point and an ending point, and option flags for the path to take. For instance, if we want to retrieve sales data for Hamburg (which is at the State-Province level of the Customer.Customer Geography hierarchy) and all children down to the postal code level (which would include cities in between), we can write a query like this: SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, NON EMPTY ( Descendants( [Customer].[Customer Geography].[Hamburg], [Customer].[Customer Geography].[Postal Code], SELF_AND_BEFORE )) ON ROWS FROM [Adventure Works]
The results of this query are shown in Figure 10-30. Notice the SELF_AND_BEFORE flag in the preceding query. These flags allow you to scope the results for a specific level or the distance between the two levels specified in the first two parameters. Essentially, you can decide what specifically between Hamburg and Hamburg’s postal codes you want to return (or if you want to go even further down the hierarchy). Here is a list of the different flags you can use in MDX queries, along with a description for each one: ■■
SELF Provides all the postal code data (lowest level)
■■
AFTER Provides all the data below the postal code level for Hamburg (that is, Customers)
■■
BEFORE
■■
BEFORE_AND_AFTER Gives us Hamburg the state, plus all cities, plus the data below the postal code for Hamburg (Customers)
Gives Hamburg as the state, plus all cities for Hamburg, but not postal codes
Chapter 10
Introduction to MDX
■■
SELF_AND_AFTER (Customers)
■■
SELF_AND_BEFORE for Hamburg
■■
SELF_BEFORE_AFTER Gives everything from Hamburg the state all the way to the lowest level (essentially ignores the Postal Code parameter)
■■
LEAVES
319
Gives postal codes, plus data below the postal code for Hamburg Gives everything between Hamburg the state and all postal codes
Same as SELF (all postal codes for Hamburg)
FIgure 10-30 Results of using Descendants
At this point, you might be impressed by the versatility of the Descendants function. We’d like to point out that although this function is very powerful, it does have some limits. In particular, it does not allow you to navigate upward from a starting point. However, MDX does have a similar function called Ancestors, that does allow you to specify an existing member and decide how far up a particular defined hierarchy you would like to navigate. For example, if you want to retrieve the data for one level up from the State-Province of Hamburg, you can use the Ancestors function in the following manner: SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, Ancestors( [Customer].[Customer Geography].[Hamburg], [Customer].[Customer Geography].[Country]) ON ROWS FROM [Adventure Works]
The results are shown in Figure 10-31.
320
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-31 Results of using Ancestors
Now let’s look at a query that gives us the proverbial “everything but the kitchen sink.” We want to take the city of Berlin and retrieve both the children sales data for Berlin (Postal Code and Customers), as well as the sales data for all parents of Berlin (the country and the All Customer total). We can achieve this by doing all of the following in the query. The results are shown in Figure 10-32 ■■
Use Descendants to drill down from Berlin to the individual customer level
■■
Use Ascendants to drill up from Berlin to the top level of the Customer hierarchy
■■
Use Union to filter out any duplicates when combining two sets into one new result set. (If you don’t use Union, the city of Berlin would appear twice in the result set.) You can use the optional ALL flag with the Union function if you want to combine two sets into one new set and preserve duplicates while combining the sets.
■■
Use Hierarchize to fit all the results back into the regular Customer hierarchy order. Hierarchize restores the original order of members in a newly created set by default. You can use the optional flag POST to sort the members of the newly created set into a “post natural” (or “child before parent members”) order.
SELECT { [Measures].[Internet Sales Amount]} ON COLUMNS, Hierarchize( Union( Ascendants( [Customer Geography].[Berlin] ), Descendants( [Customer Geography].[Berlin], [Customer].[Customer Geography].[Customer], SELF_AND_BEFORE ))) ON ROWS FROM [Adventure Works]
FIgure 10-32 Results of combining MDX navigational functions
Chapter 10
Introduction to MDX
321
Date Functions Regardless of what other dimensions are in your data warehouse, it’s almost impossible to imagine an OLAP cube without some form of date dimensions. Most business users need to “slice and dice” key measures by month, quarter, year, and so on—and they also need to perform trend analysis by comparing measures over time. MDX provides several functions for business users to break out and analyze data by date dimensions. Let’s take a look at some of the most common date functions. Before we get started with MDX date functions, let’s look at a basic example of the Children function against a specific date dimension member expression. For example, if you want to retrieve all the child data for FY 2004, you can issue the following query. The query results are shown in Figure 10-33. SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Date].[Fiscal].[FY 2004].Children ON ROWS FROM [Adventure Works]
FIgure 10-33 Results of using Children with date values
If you want to get everything between the Fiscal Year and Fiscal Month levels for 2004, you can take advantage of the Descendants function as shown in this query: SELECT [Measures].[Internet Sales Amount] ON COLUMNS, Descendants ( [Date].[Fiscal].[FY 2004], [Date].[Fiscal].[Month] , SELF_AND_BEFORE ) ON ROWS FROM [Adventure Works]
The query results are shown in Figure 10-34.
322
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-34 Results of using Descendants with date values
Next, let’s say you want to look at sales for Q4 2004 and sales for “the same time period a year ago.” You can use the ParallelPeriod function, which allows you to retrieve members from some other period (at any defined level) in the hierarchy, in our case back one year prior (or four quarters prior, in this instance) as shown in the following query. The results are shown in Figure 10-35. WITH MEMBER [SalesFromLYQuarter] AS ( [Measures].[Internet Sales Amount], ParallelPeriod( [Date].[Fiscal].[Fiscal Quarter], 4) ) SELECT
{ [Measures].[Internet Sales Amount], [SalesFromLYQuarter] } ON COLUMNS, [Product].[Bikes].Children ON ROWS FROM [Adventure Works] WHERE [Date].[Fiscal].[Q4 FY 2004]
FIgure 10-35 Results of using the ParallelPeriod function
As mentioned previously, OpeningPeriod is one of the many time-based functions included in the MDX library. It is a common business requirement to get a baseline value from an opening time period. There is also a corresponding ClosingPeriod MDX function. OpeningPeriod takes two arguments: the level and the member from which you want to retrieve the values. You can also use the OpeningPeriod function to retrieve values from, in our case, the first month and first quarter in a particular period as shown in the following query:
Chapter 10
Introduction to MDX
323
WITH MEMBER [First Month] AS ([Measures].[Internet Sales Amount], OpeningPeriod ( [Date].[Fiscal].[Month], [Date].[Fiscal])) , FORMAT_STRING = 'CURRENCY' MEMBER [First Quarter] AS ([Measures].[Internet Sales Amount], OpeningPeriod ( [Date].[Fiscal].[Fiscal Quarter], [Date].[Fiscal])) , FORMAT_STRING = 'CURRENCY' SELECT {[First Month], [First Quarter], [Measures].[Internet Sales Amount]} ON COLUMNS, NON EMPTY [Product].[SubCategory].Children ON ROWS FROM [Adventure Works] WHERE [Date].[Fiscal].[FY 2004]
The query results are shown in Figure 10-36.
FIgure 10-36 Results of using OpeningPeriod
You might simply want to show sales for a certain date period (month, quarter, and so on), and then also show sales for the prior period. You can also use the PrevMember statement in a calculated member, as in the following query. The results are shown in Figure 10-37. WITH MEMBER [SalesPriorDate] AS ([Measures].[Internet Sales Amount], [Date].[Fiscal].PrevMember), FORMAT_STRING = 'CURRENCY' SELECT {[Measures].[Internet Sales Amount], [SalesPriorDate]} ON COLUMNS, Order( [Customer].[State-Province].Children, [SalesPriorDate],BDESC) HAVING [SalesPriorDate] > 300000 ON ROWS FROM [Adventure Works] WHERE [Date].[Fiscal].[FY 2004] -- will also show 2003
324
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIgure 10-37 Results of using PrevMember to display specific values
Finally, you can use LastPeriods and LastChild to retrieve the last four quarters of available data. Note the use of the Order function to sort the fiscal quarters in descending quarter names in descending order by using the BDESC keyword in the Order function call as shown in the query. The BDESC keyword is one of the optional keywords that you can add to affect the sort results attained by applying the Order function. The options are ascending (ASC), descending (DESC), ascending breaking the natural hierarchy (BASC), or descending breaking the natural hierarchy (BDESC). By “breaking the natural hierarchy” we mean sort the results by the value specified and do not maintain the sort within the sort. The default is ASC with the Order function. In our example, BDESC causes the results to be sorted by the order of the member values in the fiscal quarter rather than by the order of the sales amount. The results are shown in Figure 10-38. WITH SET [Last4Quarters] AS Order( LastPeriods(4, [Date].[Fiscal Quarter].LastChild), [Date].[Fiscal Quarter].CurrentMember.Properties ('Key'),BDESC) SELECT [Measures].[Internet Sales Amount] ON COLUMNS, [Last4quarters] ON ROWS FROM [Adventure Works]
FIgure 10-38 Results of using LastChild with date values
Using Aggregation with Date Functions The Sum function is not new to you, but we’ll start with it as a basis for showing examples of other statistical functions included in the MDX library. We also use one of the time-based functions in the query—PeriodsToDate. This function creates a set that is passed to the Sum function. In addition to PeriodsToDate, MDX also includes the shortcut functions Wtd, Mtd, Qtd, and Ytd. These are simply variants of the PeriodsToDate function that are created to work with time data from a specific level—such as weeks, months, quarters, or years.
Chapter 10
Introduction to MDX
325
This query shows a simple Sum aggregate in the calculated member section: WITH MEMBER [SalesYTD] AS Sum( PeriodsToDate ([Date].[Fiscal].[Fiscal Year], [Date].[Fiscal].CurrentMember) , [Measures].[Internet Sales Amount]) SELECT { [Measures].[Internet Sales Amount], [Measures].[SalesYTD] } ON COLUMNS, [Product].[Category].Children ON ROWS FROM [Adventure Works] WHERE [Date].[Q3 FY 2004]
The results are shown in Figure 10-39.
FIgure 10-39 Results of using Sum with date values
Here’s an interesting challenge. Suppose you want to list sales for each month and also show the 12-month moving average of sales (in other words, for each month, the average monthly sales for the prior 12 months). You can aggregate using the Avg function, and then use the LastPeriods function to go back 12 months as shown in the following query. The results are shown in Figure 10-40. WITH MEMBER [12MonthAvg] AS Avg(LastPeriods(12,[Date].[Calendar].PrevMember), [Measures].[Internet Sales Amount]) SELECT {[Measures].[Internet Sales Amount], [12MonthAvg]} ON COLUMNS, [Date].[Calendar].[Month] ON ROWS FROM [Adventure Works] WHERE [Date].[FY 2003]
FIgure 10-40 Results of using the Avg function with date values
326
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Of course, there are statistical functions other than Sum and Average available in the MDX library. Some of the other statistical functions that we commonly use are these: Count, Max, Median, Rank, Var, and the functions that relate to standard deviation, Stdev and StdevP.
About Query Optimization There are many different factors involved in query processing in SSAS. These include physical configuration of OLAP data (such as partitions), server hardware and configuration (memory, CPU, and so on), and internal execution processes. We’ll take a closer look at a key aspect of these internal execution processes next. As with Transact-SQL query execution in the relational SQL Server engine, SQL Server SSAS makes use of internal mechanisms to determine the most optimal MDX query execution plans. These mechanisms include multiple internal caches to efficiently produce results. These caches are scoped quite differently, so you’ll want to understand these scopes when evaluating and tuning MDX queries. In general, when you execute MDX queries, if data is in a cache, SSAS will retrieve it from a cache rather than from on disk. On-disk data includes both calculated aggregations and fact data. Cache scopes include query context, session context, and global context. We include this information because each query can make use of the cache of only a single context type in a particular execution. A concept that you’ll work with when evaluating MDX query performance is that of a subcube. Subcubes are subsets of cube data that are defined by MDX queries. It is important for you to understand that each MDX query is broken into one or more subcubes by the SSAS query optimizer. When you are evaluating query performance using diagnostic tools such as SQL Server Profiler, you’ll examine these generated subcubes to understand the impact of the MDX query execution. The most efficient scope to access is the global scope because it has the broadest reuse possibilities. Of course, other factors (such as physical portioning and aggregation design) affect query performance. Those topics were covered in Chapter 9, “Processing Cubes and Dimensions,” and this chapter has focused on writing efficient MDX query syntax. We can’t cover every scenario in this chapter, so you might also want to review the SQL Server Books Online topic “Performance Improvements for MDX in SQL Server 2008 Analysis Services” at http://msdn. microsoft.com/enus/library/bb934106.aspx. When you are evaluating the effectiveness of your MDX statement, there are advanced capture settings available in SQL Server Profiler—such as Query Processing/Query Subcube Verbose—that you can use to evaluate which cache (if any) was used when executing your query. For more information, see the white paper titled “SQL Server Analysis Services Performance Guide” at http://www.microsoft.com/downloads/ details.aspx?FamilyID=3be0488de7aa4078a050ae39912d2e43&DisplayLang=en.
Chapter 10
Introduction to MDX
327
Summary In this chapter, we reviewed MDX syntax and provided you with many examples. These code samples included many of the functions and keywords available in the MDX language. We kept the code examples as simple as possible so that we could demonstrate the functionality of the various MDX statements to you. In the real world, of course, business requirements often dictate that you’ll work with queries of much greater complexity. To that end, in the next chapter we’ll take a more in-depth look at how you can use MDX to solve common analytical problems in a data warehouse environment.
Chapter 11
Advanced MDX Now that you’ve seen MDX examples in the previous chapter, we turn our attention to more advanced uses of MDX, including using MDX in real-world applications. In this chapter, we take a look at a number of examples related to advanced MDX query writing. These include querying dimension properties, creating calculated members, using the IIf function, working with named sets, and gaining an understanding of scripts and SOLVE_ORDER. We also look at creating KPIs programmatically. We close the chapter with an introduction to working with MDX in SQL Server Reporting Services (SSRS) and PerformancePoint Server.
Querying Dimension Properties We spent considerable time in the last chapter talking about hierarchies and drilling down from countries and product groups to see sales data summarized at various (dimension) levels, such as state, product subcategory, and so on. Although that’s obviously an important activity in an OLAP environment, there are other ways to “slice and dice” data. For instance, in the Adventure Works DW 2008 Customer dimension, there are many demographic attributes you can use to analyze sales data. In Figure 11-1, you see dimension members such as Education, Marital Status, and Number Of Cars Owned.
Figure 11-1 Various dimension hierarchy attributes
329
330
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
In previous versions of SQL Server Analysis Services (SSAS), these infrequently used attributes were often implemented as member properties rather than as dimension attributes. Starting with the redesigned SSAS implementation in 2005, most developers choose to include these values as dimension attributes because query performance is improved. Usually, there is no aggregate hierarchy defined on these attributes—that is, they are presented as single-line attributes in the cube with no rollup navigational hierarchies defined. It is a common business requirement—for example, for a marketing team—to be able to analyze past data based on such attributes so that they can more effectively target new sales campaigns. You can write MDX queries that include these dimensional attributes. For example, if you wanted to see sales revenue for bike sales in France, broken down by the number of cars owned by the customer, you can write the following MDX query: SELECT [Internet Sales Amount] on CoLumnS, [Customer].[number of Cars owned].[number of Cars owned].members on rowS from [Adventure works] whErE ([Product].[Bikes] , [Customer].[france])
Suppose you want to produce a query that generates sales for the last 12 months of available data and lists the sales in reverse chronological sequence (most recent month at the top). You might be tempted to write the following MDX query: wITh SET [Last12months] AS order(LastPeriods(12, Tail([Date].[fiscal].[month].members,1).Item(0).Item(0)), [Date].[fiscal],BDESC) SELECT [Internet Sales Amount] on CoLumnS, [Last12months] on rowS from [Adventure works]
Does this generate the desired output? The numbers don’t lie, and they show that the desired outcome was not achieved—take a look at Figure 11-2:
Figure 11-2 Incorrect results when trying to order by descending dates
There are two issues with the result set. First, you didn’t achieve the desired sort (of descending months). Second, you have the empty months of August 2004 and November 2006. The
Chapter 11
Advanced MDX
331
second problem occurs because the MDX query used the MDX Tail function to retrieve the last member in the Date.Fiscal.Month hierarchy; however, there isn’t any actual data posted for those two months. Now that you see the result set, you can see that what you really want is the last 12 months of available data, where the first of the 12 months is the most recent month where data has been posted (as opposed to simply what’s in the month attribute hierarchy). Let’s tackle the problems in reverse order. First, let’s change the query to determine the last month of available data by filtering on months against the Internet Sales Amount and using the Tail function to retrieve the last member from the list: wITh SET [Lastmonth] AS Tail( filter([Date].[Calendar].[month],[Internet Sales Amount]),1) SET [Last12months] AS order(LastPeriods(12,[Lastmonth].Item(0).Item(0)), [Date].[fiscal],BDESC) SELECT [Internet Sales Amount] on CoLumnS, [Last12months] on rowS from [Adventure works]
That takes care of the second issue (the most recent month in the result set is now July 2004 instead of November 2006), but you still have the issue of the sort. So why doesn’t the code in the first two listings, which sorts on the [Date].[Fiscal] level, work correctly? Generally speaking, sorting on dimension attributes is different than sorting on measures. Each dimension attribute has a KeyColumns collection that you can use for ordering. In the case of the Month attribute, the KeyColumns collection contains two definitions (the year and month stored as integers), as shown in Figure 11-3. So any MDX query that sorts on the actual month values must reference both KeyColumns collection properties. You can reference the key properties with Properties (“Key0”) and Properties (“Key1”), as shown in the following code listing. Note that because the second key is an integer key representing the month, you need to right-justify and zero-fill it, using a Microsoft Visual Basic for Applications function. This is so that the year and month combined will be represented consistently (that is, 200809 for September 2008, 200810 for October 2008) for sorting purposes. wITh SET [Lastmonth] AS Tail( filter([Date].[Calendar].[month],[Internet Sales Amount]),1) SET [Last12months] AS order(LastPeriods(12,[Lastmonth].Item(0).Item(0)), [Date].[fiscal].Currentmember.Properties("Key0") + vBA!right("0" + [Date].[fiscal].Currentmember.Properties("Key1") ,2),BDESC) SELECT [Internet Sales Amount] on CoLumnS, [Last12months] on rowS from [Adventure works]
332
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 11-3 Reviewing dimension attribute properties in Business Intelligence Development Studio (BIDS)
The preceding code generates the results in Figure 11-4—mission accomplished! The great thing about this query is that you can use this approach any time you need to generate a report for a user that shows the last 12 months of available data.
Figure 11-4 The correct results when ordering by descending dates
Looking at Date Dimensions and MDX Seasonality Many organizations have business requirements related to retrieving data that has seasonality, and they evaluate sales performance based on different time frames across years. One common scenario is a user wanting to know how sales are faring when compared to the fiscal year’s goals. Figure 11-5 shows the Calculations tab in BIDS with an MDX expression that retrieves the sum of the periods to date in the year for Internet Sales Amount.
Chapter 11
Advanced MDX
333
Figure 11-5 Performing a calculation of all PeriodsToDate in BIDS
As mentioned in Chapter 10, “Introduction to MDX,” MDX includes several shortcut functions for common time-based queries. These functions are variants of the PeriodsToDate function. Let’s take a closer look at how one of these functions works. For our example, we’ll use the Ytd function. This function returns a set of sibling members from the same level as a given member, starting with the first sibling and ending with the given member, as constrained by the Year level in the Time dimension. The syntax is Ytd([«Member»]). This function is a shortcut function to the PeriodsToDate function that defines that function’s «Level» argument to be Year. If no particular member is specified, a default value of Time.CurrentMember is used. Ytd(«Member») is equivalent to PeriodsToDate(Year, «Member»). Other examples of time-aware MDX functions are the week-to-date, month-to-date, and quarter-to-date (Wtd, Mtd, and Qtd) functions.
Creating Permanent Calculated Members Up until now, you’ve been placing calculations and named sets in-line, as part of the MDX query. In an actual production environment, if you want to reuse these definitions, it’s a good idea to store these calculated members and named sets inside the OLAP cube. As we discussed in Chapter 8, “Refining Cubes and Dimensions,” you create permanent calculated members using the Calculations tab in BIDS. For example, you might want to create a calculation for sales for the previous fiscal period (which could be the previous month, previous quarter, or even previous year from the current date selection) using the following MDX code.
334
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
wITh mEmBEr [SalesPriorfiscalPeriod] AS ([measures].[Internet Sales Amount], [Date].[fiscal].Prevmember) SELECT {[SalesPriorfiscalPeriod], [Internet Sales Amount]} on CoLumnS, non EmPTY [Product].[Category].members on rowS from [Adventure works] whErE [Date].[march 2004]
To store this calculated member (SalesPriorFiscalPeriod) permanently in an OLAP cube, you can use the BIDS interface to create a calculated member. If you do not want the member to be permanent but do want to use it over multiple queries, you can create a calculated member that will persist for the duration of the session. The syntax to do this is CREATE MEMBER rather than WITH MEMBER.
Creating Permanent Calculated Members in BIDS As we’ve shown previously, when you create a new calculated member using the BIDS cube designer Calculations tab, you first enter the name for the new calculated member and then the associated MDX expression. (See Figure 11-6.)
Figure 11-6 BIDS interface to create a new calculated member
After you save the changes, you can reference the new calculated member in any MDX query in the same way you reference any other member that is defined in the cube. (See the following code sample and its results, which are shown in Figure 11-7.) This query returns the prior month’s (February 2004) Internet Sales Amount in the SalesPriorFiscalPeriod member. Note that in the query, you could have specified a quarter or a year instead of a month, and
Chapter 11
Advanced MDX
335
the calculated member would give you the prior corresponding period. For example, if you specified FY 2004 as the date, the SalesPriorFiscalPeriod member would return the Internet Sales Amount for 2003. This calculated member is computed when an MDX query containing this member is executed. As we mentioned in Chapter 8, calculated member values are not stored on disk; rather, results are calculated on the fly at query execution time. Of course, internal caching can reuse calculated member results very efficiently. You’ll recall from our discussion at the end of Chapter 10 that SSAS uses three internal cache contexts (query, session, and global) to store results. You can investigate use of query caches using SQL Server Profiler traces as well. SELECT {[SalesPriorfiscalPeriod], [Internet Sales Amount]} on CoLumnS, non EmPTY [Product].[Category].members on rowS from [Adventure works] whErE [Date].[march 2004] --whErE [Date].[Q2 fY 2004] --whErE [Date].[fY 2004]
Figure 11-7 The results of querying a permanent calculated member
Creating Calculated Members Using MDX Scripts Now that you’ve created a permanent saved script that will create a calculated member using the BIDS interface, let’s create a second calculated member using a CREATE MEMBER statement. (In fact, when you use the BIDS interface to create a calculated member, BIDS actually generates scripts behind the scenes. You can toggle between the form view in BIDS and the script listing by using the Form View and Script View icons on the Calculations toolbar.) This second calculated member will determine the percentage of sales growth from one period to the prior period (and will actually use the SalesPriorFiscalPeriod member from the previous section). You can execute this calculated member code from SQL Server Management Studio (SSMS) after connecting to the corresponding SSAS instance and sample database. CrEATE mEmBEr [Adventure works].[measures].[fiscalSalesPctGrowth] AS ( [measures].[Internet Sales Amount] - [SalesPriorfiscalPeriod]) / [SalesPriorfiscalPeriod] , formAT_STrInG = "Percent", vISIBLE = 1;
336
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Finally, you can write an MDX query to use both calculations, which will produce a result set arranged by product category for March 2004, showing the dollar sales for the month, the dollar sales for the previous month, and the percent of change from one month to the next. (See Figure 11-8.) Once again, note that you could have selected other date dimension members in the WHERE clause and the calculated members would have behaved accordingly. SELECT {[SalesPriorfiscalPeriod], [fiscalSalesPctGrowth], [Internet Sales Amount]} on CoLumnS, non EmPTY [Product].[Category].members on rowS from [Adventure works] whErE [Date].[march 2004] --whErE [Date].[Q2 fY 2004] --whErE [Date].[fY 2004]
Figure 11-8 Results of calculated member created through scripting
Keep in mind that calculated members do not aggregate, so they do not increase storage space needs for the cube. Also, for this reason, you do not need to reprocess the associated OLAP cube when you add a calculated member to it. Although it’s easy to add calculated members, you must carefully consider the usage under production load. Because member values are calculated upon querying (and are not stored on disk), the query performance for these values is slower than when you are accessing stored members. New to SQL Server 2008 is the ability to dynamically update a calculated member using the UPDATE MEMBER syntax. The only member types that you can update using this syntax are those that are defined in the same session (scope). In other words, UPDATE MEMBER cannot be used on the BIDS cube designer’s Calculations (MDX script) tab; rather, it can be used only in queries from SSMS or in custom code solutions. Tip Also new to SQL Server 2008, the CREATE MEMBER statement allows you to specify a display folder (property DISPLAY_FOLDER) and an associated measure group (property ASSOCIATED_ MEASURE_GROUP). Using these new properties can make your calculated members more discoverable for end users.
Chapter 11
Advanced MDX
337
using IIf You might encounter a common problem when you use the calculated members that you created in the last section. Let’s take a look at the results if you were to run the query for the first month (or quarter, year, and so on) of available data. If you’re wondering what the problem is, think about how MDX would calculate a previous member (using the PrevMember statement) for month, quarter, year, and so on when the base period is the first available period. If no previous member exists, you get nulls for the SalesPriorFiscalPeriod member, and division by null for the FiscalSalesPctGrowth member. (See Figure 11-9.)
Figure 11-9 The results when querying for the first period, when using a PrevMember statement
Fortunately, MDX provides an immediate if (IIf) function so that you can test for the presence of a previous member before actually using it. So the actual calculation for SalesPriorFiscalPeriod is as follows: IIf( ([measures].[Internet Sales Amount], ([measures].[Internet Sales Amount], 'n/A')
[Date].[fiscal].Prevmember), [Date].[fiscal].Prevmember),
So, in this example, you get an N/A in the SalesPriorFiscalPeriod member any time data does not exist for the previous member in the Date dimension. You can perform a similar IIf check for the FiscalSalesPctGrowth member, using a similar code pattern as shown in the preceding example, and then generate a better result set. (See Figure 11-10.)
Figure 11-10 The results when implementing an IIf function to check for the existence of a previous member
of the Date dimension
There are alternatives to using an IIf statement. One such alternative is to rewrite the query using the MDX CASE keyword. For more information and syntax examples of using CASE, go to http://msdn.microsoft.com/en-us/library/ms144841.aspx. Also note that conditional logic captured in an MDX query using CASE rather than IIf (particularly if they are nested) often executes more efficiently. Another way to circumvent the potential performance issues associated with the IIf function is by creating expressions as calculated members using the SCOPE keyword to limit the scope
338
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
of the member definition to a subcube. This type of calculation is an advanced technique, and you should use this only when your IIf statement does not perform adequately under a production load. For more detail, see the following blog entry: http://blogs.msdn.com/azazr/ archive/2008/05/01/ways-of-improving-mdx-performance-and-improvements-with-mdx-inkatmai-sql-2008.aspx. Note In SQL Server 2008, Microsoft has improved the performance of several commonly used MDX functions, such as IIf and others. However, to realize these performance improvements, you must avoid several conditions. One example is the usage of defined cell security in an MDX expression with an optimized function. For more detail, see the SQL Server Books Online topic “Performance Improvements in MDX for SQL Server 2008 Analysis Services” at http://msdn.microsoft.com/en-us/library/bb934106.aspx.
About Named Sets We introduced named sets back in Chapter 10. Named sets are a collection of tuples (often an ordered collection) from one or more dimensions. So, as a reminder example, you can build a list of the top 10 products by profit, use a Rank function to generate a ranking number for each product, and then display the Internet Gross Profit and rank number for each product, for sales in Canada (as shown in the next code sample and Figure 11-11). wITh SET [Top10ProductsByProfit] AS TopCount( [Product].[Product Categories].[Product].members, 10,[measures].[Internet Gross Profit]) mEmBEr [ProductProfitrank] AS rank([Product].[Product Categories].Currentmember, [Top10ProductsByProfit]) SELECT {[measures].[Internet Gross Profit], [measures].[ProductProfitrank]} on CoLumnS, [Top10ProductsByProfit] on rowS from [Adventure works] whErE [Customer].[Country].[Canada]
Figure 11-11 Basic top-10 list generated when using TopCount and Rank in-line
Chapter 11
Advanced MDX
339
So, now that you’ve defined a useful named set in MDX, let’s store the script that creates it permanently inside the OLAP cube so that you can reuse it. Using steps similar to those in the previous section, you can use BIDS to permanently create the MDX script that will create the named set and calculated member. Alternatively you can execute an MDX script in SSMS to create a named set that will persist for the duration of the user session. After you create the defined named sets, you can test that script by writing a small MDX query (shown in the following code sample) that uses the (now persistent) calculated member ProductProfitRank and the (now persistent) named set Top10ProductsByProfit: SELECT {[measures].[Internet Gross Profit], [measures].[ProductProfitrank]} on CoLumnS, [Top10ProductsByProfit] on rowS from [Adventure works]
This code generates the result set shown in Figure 11-12.
Figure 11-12 Correct results when using a persistent named set, with no subsequent dimension slicing
The results in Figure 11-12 seem fine. Of course, you didn’t do any dimension slicing, so the TopCount and Rank functions are running against the entire OLAP database (by Product). Let’s run our persistent named set against sales for Canada (as shown in the following code sample), which produces the result set shown in Figure 11-13. SELECT {[measures].[Internet Gross Profit], [measures].[ProductProfitrank]} on CoLumnS, [Top10ProductsByProfit] on rowS from [Adventure works] whErE [Customer].[Country].[Canada]
So, what’s wrong with this query? Although the numbers are certainly smaller, the profit ranking and order of items are not correct. Here is the reason why: In SQL Server 2005, persistent named sets were static in nature. Unlike persistent calculated members, which are always dynamically evaluated when dimension slicing occurs, persistent named sets were only evaluated once, when the set is created. So the actual TopCount ordered set (and the Rank function that used the named set) are working off the original order, before any dimension slicing occurs. This was a significant drawback of persistent named sets in SQL Server 2005.
340
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 11-13 Incorrect results when using a persistent named set, with subsequent dimension slicing
Fortunately, SQL Server 2008 introduces a new feature called dynamic named sets, which solves this issue. Named sets marked with the new keyword DYNAMIC will evaluate for each query run. In BIDS, you can choose to create named sets as either dynamic or static. (See Figure 11-14.)
Figure 11-14 Creating a dynamic named set in SQL 2008 to honor subsequent dimension slicing
The BIDS designer actually generates the following code for the named set: CrEATE DYnAmIC SET CurrEnTCuBE.[Top10ProductsByProfit] AS TopCount( [Product].[Product Categories].[Product].members, 10,[measures].[Internet Gross Profit]) ;
After changing the named set and redeploying the cube, you can re-execute the query shown in the code sample that precedes Figure 11-13 and get the correct results (as shown in Figure 11-15). Tip MDX expert Mosha Pasumansky has written a useful blog entry on MDX dynamic named sets at http://sqljunkies.com/WebLog/mosha/archive/2007/08/24/dynamic_named_sets.aspx.
Chapter 11
Advanced MDX
341
Figure 11-15 Correct results when using a persistent dynamic named set with dimension slicing
About Scripts You’ll remember from previous chapters that in addition to using calculated members, you can also add MDX scripts to your SSAS cube via the Calculations tab in BIDS. Here you can use the guided interface to add calculated members and more, or you can simply type the MDX into the script code window in BIDS. It is important that you understand that the script you create uses at least one instance of the MDX SCOPE keyword. Using SCOPE in an MDX script allows you to control the scope in which other MDX statements are applied. A script with a SCOPE keyword allows you to define a subset of your cube (which is sometimes called a subcube). Unlike a named set, this subcube is usually created so that you can read it as well as make changes to it or write to it. Note You can also use many MDX keywords—such as CALCULATE, CASE, FREEZE, IF, and others—in an MDX script. For more information, see the SQL Server Books Online topics “The Basic MDX Script” and “MDX Scripting Statements.”
A common business scenario for using subcubes is the one shown in the following example— that is, budget allocations based on past history and other factors. Subcubes are convenient for these kinds of scenarios because it is typical for business budgeting to be based on a number of factors—some past known values (such as actual sales of a category of products over a period of time for a group of stores by geography) combined with some future predicted values (such as newly introduced product types, for which there is no sales history). These factors often need to be applied to some named subset (or subcube) of your enterprise data. There are two parts to a SCOPE command. The SCOPE section defines the subcube that the subsequent statements will apply to. The This section applies whatever change you want to make to the subcube. We’ll also look at the FREEZE statement as it is sometimes used in scripts with the SCOPE command.
342
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The sample script that is part of the Adventure Works cube, called Sales Quota Allocation, is a good example of using script commands. Switch to the script view on the Calculations tab in BIDS and you’ll see two complete scripts (using both the SCOPE statement and This function) as shown in the following code sample: /*-------------------------------------------------------------| Sales Quota Allocation | --------------------------------------------------------------*/ /*-- Allocate equally to quarters in h2 fY 2005 --------------*/ SCoPE ( [Date].[fiscal Year].&[2005], [Date].[fiscal].[fiscal Quarter].members, [measures].[Sales Amount Quota] ) ; This = ParallelPeriod ( [Date].[fiscal].[fiscal Year], 1, [Date].[fiscal].Currentmember ) * 1.35 ; /*--- Allocate equally to months in fY 2002 --------------------*/ SCoPE ( [Date].[fiscal Year].&[2002], [Date].[fiscal].[month].members ) ; This = [Date].[fiscal].Currentmember.Parent / 3 ; End Scope ;
Here is a bit more detail on the This function and FREEZE statement, which are often used in conjunction with the SCOPE keyword: ■■
This This function allows you to set the value of cells as defined in a subcube (usually by using the MDX keyword SCOPE to define the subcube). This is illustrated in a script in the preceding code sample.
■■
FREEZE This statement (not a function) locks the specified value of the current subcube to the specified values. It’s used in MDX scripts to pin a subcube (that is, exempt it from being updated) during the execution of subsequent MDX statements using the SCOPE statement and the This function. An example is shown in the following code sample: frEEZE ( [Date].[fiscal].[fiscal Quarter].members, [measures].[Sales Amount Quota] );
Chapter 11
Advanced MDX
343
An important consideration when using the new Calculations tab in BIDS to design MDX script objects is the order in which you add the script objects. Scripts are executed in the order (top to bottom) listed in the Script Organizer window. You can change the order of execution by right-clicking any one script and then clicking Move Up or Move Down. You can also change the order of execution for calculated members (or cells) by using the MDX keyword SOLVE_ORDER (explained in the next section of this chapter) inside the affected scripts.
understanding SOLVe_OrDer Suppose you want to produce a result set that shows sales amount, freight, and freight per unit as columns, and for these columns, you want to show Q3 2004, Q4 2004, and the difference between the two quarters as rows. Based on what you’ve done up to this point, you might write the query as follows (which would produce the result set shown in Figure 11-16): wITh mEmBEr [measures].[freightPerunit] AS [measures].[Internet freight Cost] / [measures].[Internet order Quantity] , formAT_STrInG = '$0.00' mEmBEr [Date].[fiscal].[Q3 to Q4Growth] AS [Date].[fiscal].[fiscal Quarter].[Q4 fY 2004] [Date].[fiscal].[fiscal Quarter].[Q3 fY 2004] SELECT {[Internet Sales Amount],[Internet freight Cost], [freightPerunit] } on CoLumnS, {[Date].[fiscal].[fiscal Quarter].[Q3 fY 2004], [Date].[fiscal].[fiscal Quarter].[Q4 fY 2004], [Date].[fiscal].[Q3 to Q4Growth] } on rowS from [Adventure works]
Figure 11-16 First result set, for an all customer total
Do the results for the query look correct? Specifically, take a look at the FreightPerUnit calculation for the third row (that shows the difference between the two quarters). The cell should contain a value of 72 cents ($8.42 minus $7.70). The cell, however, contains $12.87. Although that value represents “something” (the growth in freight cost divided by the growth in order quantity), the bottom row should contain only values that represent the change in each column. So, for the FreightPerUnit column, it should be the freight per unit for Q4 minus the freight per unit for Q3.
344
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
So why isn’t the correct calculation being generated? Before you answer that question, stop and think about the query. The requirements for this query ask you to do something you previously haven’t done—perform calculations on both the row and column axes. Prior to this, you’ve generally created only new calculated members in one dimension—namely, the measures dimension. In this case, however, calculated measures are created in non-measure dimensions, so you must consider the order of execution of these measures. Specifically, you need to tell MDX that you want to calculate the FreightPerUnit member first, and then the Growth member second. Stated another way, you need to set the calculation order, or solve order. MDX contains a keyword, SOLVE_ORDER, that allows you to set the solve order for each calculated member. So you can add the following code to the two calculated members as shown in the following code sample, with the results shown in Figure 11-17. wITh mEmBEr [measures].[freightPerunit] AS [measures].[Internet freight Cost] / [measures].[Internet order Quantity] , formAT_STrInG = '$0.00', SoLvE_orDEr = 0 mEmBEr [date].[fiscal].[Q3 to Q4Growth] AS [Date].[fiscal].[fiscal Quarter].[Q4 fY 2004] [Date].[fiscal].[fiscal Quarter].[Q3 fY 2004] , SoLvE_orDEr = 10 SELECT {[Internet Sales Amount],[Internet freight Cost], [freightPerunit] } on CoLumnS, {[Date].[fiscal].[fiscal Quarter].[Q3 fY 2004], [Date].[fiscal].[fiscal Quarter].[Q4 fY 2004], [Date].[fiscal].[Q3 to Q4Growth] } on rowS from [Adventure works]
Figure 11-17 First result set for an all customer total
When you create calculated members on both the row and column axes, and one depends on the other, you need to tell MDX in what order to perform the calculations. In our experience, it’s quite easy to get this wrong, so we caution you to verify the results of your SOLVE_ ORDER keyword. Note For more on solve orders, see the MSDN topic “Understanding Pass Order and Solve Order (MDX)” at http://msdn.microsoft.com/en-us/library/ms145539.aspx.
Chapter 11
Advanced MDX
345
Creating Key Performance indicators You can create key performance indicators (KPIs) in SSAS cubes using MDX code using the KPIs tab, as you’ve seen in Chapter 8, and then you can use those KPIs in client tools such as Microsoft Office Excel 2007 or PerformancePoint Server 2007. Because MDX code is the basis for KPIs, let’s take a look at a basic KPI. Here you’ll use calculated members as part of your KPI definition. The Adventure Works database tracks a measure called Total Product Cost. Suppose you want to evaluate the trend in Total Product Cost. First you need a calculated member that determines Total Product Cost for the previous period based on the current period. Figure 11-18 shows a calculated member that slices Total Product Cost to the previous member of the Date.Fiscal dimension hierarchy based on the current Date member selection.
Figure 11-18 Calculated member to determine the product costs for the previous period
Next you have a second calculated member (shown in Figure 11-19) that determines the percent of Total Product Cost increase from the previous period (which could be last month, last quarter, and so on) to the current period. Note that you’re evaluating the denominator before performing any division to avoid any divide-by-zero exceptions. Finally, on the KPIs tab (shown in Figure 11-20), you can create the KPI that evaluates the percent of change in product cost. For this KPI, let’s start with the basics. If the product cost has increased by only 5 percent or less from the prior month, quarter, or year, you’ll display a green light, which means you consider a 5 percent or less increase to be good. If the cost has increased by anywhere above 5 percent but less than 10 percent, you’ll display a yellow light, which means you consider that marginal. If the cost has increased by more than 10 percent, you consider that bad and will show a red light.
346
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 11-19 Second calculated member to determine the percent of increase (uses the first calculated
member)
Note You’ll recall from our discussion of KPI creation in Chapter 8 that the actual values that the KPI returns are as follows: 1 for exceeded value, 0 for at value, or –1 for below value. The KPI designer in BIDS allows you to select an icon set—such as green, yellow, or red traffic lights—to represent these values visually.
Figure 11-20 The BIDS interface to create a KPI
Chapter 11
Advanced MDX
347
So, as a rule, a KPI contains (at least) the following: ■■
The name of the KPI.
■■
The associated measure group.
■■
The value associated with the KPI that the KPI will evaluate. (This value should be an existing calculated member, not an in-line expression, which we’ll cover later in our discussion of KPI tips.)
■■
The goal (which could be a static number, calculated member, dimension property, or measure).
■■
The status indicator (traffic light, gauge, and so on).
■■
The possible values for the status indicator. (If we’ve met our goal, return a value of 1 for green, and so on.)
After deploying the changes, you can test the KPI in SSMS with the following code (as shown in Figure 11-21): SELECT { [measures].[Internet Sales Amount], [measures].[ProductCostPriorfiscalPeriod], [measures].[ProductCostPctIncrease], KPIvALuE("KPIProductCostPctIncrease"), KPISTATuS("KPIProductCostPctIncrease") } on CoLumnS, order( filter([Product].[Product].Children,[Internet Sales Amount] > 0), [ProductCostPctIncrease],BDESC) on rowS from [Adventure works] whErE [Date].[Q3 fY 2004]
Figure 11-21 Testing the KPI results in SSMS with a test MDX query
348
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Creating KPIs Programmatically New in SQL Server 2008 is the ability to create KPIs programmatically. This is a welcome enhancement for BI/database developers who preferred to script out KPI statements instead of designing them visually. This is accomplished by the addition of the CREATE KPI statement. As with the CREATE MEMBER statement, running the CREATE KPI statement from a query tool, such as SSMS, creates these KPIs for the duration of that query session only. There is also a new DROP KPI statement, which allows you to programmatically delete KPIs. The KPI script capability in SQL Server 2008 allows you to write the same statements that you placed in the designer back in Figure 11-19. The following code sample shows an example of how you’d script out the same KPI definitions (for example, goal statements, status expressions, and so on) you saw in Figure 11-19: CrEATE KPI [Adventure works].[KPIProductCostPctIncrease] AS [measures].[ProductCostPctIncrease] , GoAL = .05 , STATuS = CASE whEn KPIvALuE("KPIProductCostPctIncrease") <= KPIGoAL("KPIProductCostPctIncrease") ThEn 1 whEn KPIvALuE("KPIProductCostPctIncrease") <= KPIGoAL("KPIProductCostPctIncrease") * 2 ThEn 0 ELSE -1 EnD , STATuS_GrAPhIC = 'Traffic Light' , CAPTIon = 'Product Cost Pct Increase' , DISPLAY_foLDEr = 'KPIs';
Tip For more on creating KPIs programmatically, you can check out the following link: http://msdn.microsoft.com/en-us/library/bb510608.aspx.
Additional Tips on KPIs Here are a few tips for creating and testing KPIs: ■■
Some developers place calculation expressions in the KPI value expression. Although this works, it also couples the calculated member to the KPI. In some reports, you might want to use the calculated expression without actually displaying the KPI. So a more manageable approach is to build the calculated expression as a separate calculated member. From there, you can refer to the calculated member by name when you define your KPI value, and you can also use the calculated member independently of the KPI.
■■
Create KPIs in your OLAP cubes rather than in client environments such as Excel, Office SharePoint Server 2007, or PerformancePoint Server. Although you can create KPIs using client tools, we prefer to create KPIs centrally because creating them outside of the SSAS OLAP cube creates a potential maintenance issue.
Chapter 11 ■■
Advanced MDX
349
The most effective way to test KPIs is outside of the BIDS environment. One way is to write some MDX code in SSMS, as you saw back in Figure 11-20. Another way is to test the KPIs in Excel.
Note There is an interesting project on CodePlex that showcases the use of programmatic KPI creation. It is called the Analysis Services Personalization Extension. This CodePlex project allows you to add calculations to a cube without requiring you to update and deploy an Analysis Services project. In addition, you can customize calculations for specific users. Download the sample application at http://www.codeplex.com/MSFTASProdSamples.
using MDX with SSrS and PerformancePoint Server MDX is an important part of any serious data warehousing application. However, in the world of fancy dashboards and reports, MDX is only as valuable as the reporting tools that support it. Reporting Services and PerformancePoint Server are two popular tools for producing enduser output in a data warehousing environment, and both support the incorporation of MDX to produce a truly flexible reporting experience. At this point, let’s take a quick look at how you can use MDX with the two tools.
Using MDX with SSRS 2008 SSRS 2008, like its predecessor (SSRS 2005), allows report authors to create reports against SSAS OLAP cubes. In some instances, you can use the built-in graphical query tool to design reports without writing MDX code. However, you might often have to override the query designer and write your own custom MDX code if you want to use features of the MDX language that the query designer doesn’t support. One example of this, which you’ll see in the next couple of paragraphs, is the use of named sets. There are, of course, other features of MDX that aren’t supported directly in the SSRS visual MDX query designer. As we’ll show in Chapter 21, “Building Reports for SQL Server 2008 Reporting Services,” you can switch the SSRS query designer in BIDS from visual to manual mode by clicking the design mode button on the embedded toolbar. In manual mode, you can simply type any MDX code that you want to use. Let’s take a look at an example where we’ll leverage the dynamic named set and ranking function from the “About Named Sets” section of this chapter. Figure 11-22 shows our result, a basic but functional SSRS report that shows the top 10 products based on geography and date selection. To get started, you can write the code in Figure 11-23, which hard-codes the customer and date selection into the WHERE clause. (In the next step, you’ll change those to use query parameters.)
350
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 11-22 Sample output of an MDX query in SSRS
Figure 11-23 The MDX query editor in SSRS 2008
Next, you click the MDX query parameters button on the toolbar (the fifth button from the right), which is shown in Figure 11-23. SSRS 2008 then displays the Query Parameters dialog box, where you can define two queries, for the Customer parameter and Date parameter, and their corresponding dimension and hierarchy attribute settings. (See Figure 11-24.)
Chapter 11
Advanced MDX
351
Figure 11-24 Defining MDX query parameters in SSRS 2008
After you define the query parameters, you can modify the query shown in Figure 11-24 to reference the query parameters. Note that SSRS uses the StrToSet function to convert the parameters from a string value to an actual set. The following code sample shows the MDX that is created. Note the WHERE clause, which uses the StrToSet function to convert the named parameter values to sets. The result is shown in Figure 11-25. SELECT {[measures].[Internet Gross Profit], [measures].[ProductProfitrank]} on CoLumnS, [Top10ProductsByProfit] on rowS from [Adventure works] whErE ( StrToSet(@CustomerParm), StrToSet(@DateParm))
Figure 11-25 Using StrToSet and MDX parameters in SSRS 2008
352
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Using MDX with PerformancePoint Server 2007 Although PerformancePoint Server 2007 contains many powerful built-in features, you’ll often need to incorporate bits and pieces of MDX to build flexible output. We want to take a few moments and provide a quick walkthrough of a PerformancePoint Server chart that uses MDX. Once again, we’ll show the result first. Figure 11-26 shows a line chart that plots the last 12 months of available data for a user-defined geography and product combination. Although a user could normally build this chart in PerformancePoint Server 2007 without using MDX, the chart also has a requirement to plot monthly sales for all siblings of the product selection. In Figure 11-26, the chart plots sales for Mountain Bikes, Road Bikes, and Touring Bikes when the user selects Mountain Bikes (because all three belong to the same parent, Bikes). For this, you need to use the MDX Siblings function, which the PerformancePoint Server 2007 designer doesn’t really support. So you need to write some custom MDX.
Figure 11-26 The desired output in PerformancePoint Server 2007—a line chart that shows monthly sales for a selected product and its siblings
The PerformancePoint Server 2007 designer allows you to override the graphical designer and write your own custom MDX. Your MDX code will reference an existing named set called [Last12Months] and also account for user-defined parameters for geography and product. The named set is simply a convenience to make your code more readable. It consists
Chapter 11
Advanced MDX
353
of the last 12 month–level members of the time hierarchy and is defined using the syntax that we covered at the beginning of this chapter in “Querying Dimension Properties.” In PerformancePoint Server 2007, you reference parameters with << and >> tokens, as you can see in the following code sample: SELECT [Internet Sales Amount] * <>.SIBLInGS on rowS, [Last12months] on CoLumnS from [Adventure works] whErE (<> )
This code is entered into the MDX query editor, shown in Figure 11-27.
Figure 11-27 The MDX editor in PerformancePoint Server 2007, with the ability to code parameters using
<<parm>> tokens
At the bottom of the MDX code entry page, you can define GeoFilter and ProdFilter with default values, as shown in Figure 11-28.
Figure 11-28 Defining MDX parameters in PerformancePoint Server 2007
The next step is to build filter sections in the dashboard page. In Figure 11-29, you define two filters, for GeographyDownToState and ProductsDownToSubcategory, so that the user can select from subset lists that only go down as far as State and SubCategory. As with the named set called Last12Months that we used in the earlier examples, both GeographyDownToState and ProductsDownToSubcategory are named sets that we’ve created to improve the readability of the code that will use these values.
354
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 11-29 Building filter definitions in PerformancePoint Server 2007
The final major step is to add all three components (the two filters and the chart) onto a dashboard page. (See Figure 11-30.) In addition, you also need to add filter links between the two filters and the chart parameters (not shown here). You do this by creating filter definitions and the links between the filter and the chart in PerformancePoint Server using the Dashboard Designer.
Figure 11-30 Building a sample dashboard in PerformancePoint Server 2007, with two filter definitions and
the chart
Summary In this chapter, you took a deeper look at working with MDX in the SSAS environment. In this chapter (and the previous one), we not only covered advanced functions such as IIf and ParallelPeriod, but also explored concepts such as scripting KPIs and named sets. We also looked at SOLVE_ORDER, and closed the chapter with an introduction to MDX syntax in SSRS and PerformancePoint Server. This included the use of other features, such as the StrToSet function and MDX parameters.
Chapter 12
Understanding Data Mining Structures We have completed our tour of Microsoft SQL Server Analysis Services OLAP cubes and dimension design, development, refinement, processing, building, and deploying, but we still have much more to do in Analysis Services and the Business Intelligence Development Studio (BIDS). In this chapter and in Chapter 13, “Implementing Data Mining Structures,” we explore the rich world of Analysis Services data mining structures. In this chapter, we review the business situations that warrant the use of Analysis Services data mining models and explain how to determine which specific data mining algorithms work best for various business needs. We continue to use BIDS as our development environment—for design, development, tuning, and so on. We have a lot of information to cover, so let’s get started!
Reviewing Business Scenarios As we’ve discussed, you can think of the data mining functionality included in SSAS as a set of tools to give your end users the ability to discover patterns and trends based on defined subsets of your data. The source data can be relational or multidimensional. You can simply review the results of applying data mining algorithms to your data and use those results as a basis for making business decisions. You can also use the results as a basis for processing new data. Microsoft often called the data mining functionality available in SSAS predictive analytics because this set of tools is seen as a way for a business to proactively understand its data. An example of this would be to design (or refine) a marketing strategy based on the results of data clustering. Alternatively, you can use the result of this analysis to help you to predict future values based on feeding new data into the validated model. Data mining is meant to be complementary to an SSAS cube. A cube is often used to verify results—in other words, to answer the question “We think this happened, does the data support our belief?” Mining structures are used to discover correlations, patterns, and other surprises in the data—in other words, “What will happen?” Another common use of mining is when businesses buy competitive data; mining can be used to help businesses answer questions like “What if we got into this type of new business?” and “What if we started doing business in these locations?” In SQL Server 2008, Microsoft continues to focus on making data mining models easier for you to implement and the results easier for your users to understand. Data mining can be one of the most challenging types of data analysis solutions to put into operation because of the need to deeply understand the various algorithms involved. Traditionally, data mining products were used only by companies that had substantial resources: The specialized data 355
356
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
mining products were expensive, and consultants had to be hired to implement the complex algorithms included in those products. It was not uncommon for those working in the data mining industry to have advanced degrees in mathematics, particularly in the area of statistics. The general product goal for SSAS—BI for everyone—is extended to data mining. In fact, Microsoft usually refers to data mining as predictive analysis because it believes that the title more properly describes the accessibility and usage of the data mining toolset in SQL Server 2008. The tools provided in BIDS make creating mining structures easy. As with OLAP cubes, data mining structures are created via wizards in BIDS. Tools to help you verify the accuracy of your particular mining model and to select the most appropriate algorithms are also available. Your users also benefit, by having meaningful results presented in a variety of ways. Both BIDS and SSMS include many data mining model viewers to choose from, and you can tailor data mining results to the appropriate audience. The client integration in Microsoft Office Excel that was introduced in SQL Server 2005 has been significantly enhanced for SQL Server 2008. An API is also included so that you can do custom development and integration into any type of user application, such as Windows Forms, Web Forms, and so on. Note The model viewers in BIDS or SSMS are not intended to be used by end users. They are provided for you so that you can better understand the results of the various mining models included in your mining structure. You may remember that these viewers are available in Excel 2007 with the SQL Server 2008 Data Mining Add-ins for Office installed. Of course, Excel 2007 is often used as an end-user client tool. So, although the viewers in BIDS or SSMS aren’t meant to be accessed by end users from BIDS, these same viewers are often used from within Excel 2007 by end users. This allows you, the developer, to have a nearly identical UI (from within BIDS or SSMS) as that of your end users. These viewers are also available as embeddable controls for developers to include in custom end-user applications.
This version of SSAS has tremendously enhanced the available methods for using data mining. These methods are expressed as algorithms—nine algorithms are included in SSAS 2008, and we discuss them in detail later in this chapter. Although some enhancements have been made to tuning capabilities and performance for SQL Server 2008, these algorithms provide nearly the same functionality as they did in SQL Server 2005. One of the most challenging aspects of data mining in SSAS is understanding what the various algorithms actually do and then creating a mining structure that includes the appropriate algorithm or algorithms to best support your particular business requirements. Another important consideration is how you will present this information to the end users. We believe that these two concerns have seriously reduced implementation of data mining solutions as part of business intelligence solutions. We find that neither developers nor end users can
Chapter 12
Understanding Data Mining Structures
357
visualize the potential benefits of SQL Server 2008 data mining technologies if developers can’t provide both groups with reference samples. For you to be able to build such samples, you’ll have to first think about business challenges that data mining technologies can impact. The following list is a sample of considerations we’ve encountered in our work designing business intelligence solutions for customers: ■■
What characteristics do our customers share? How could we group them, or put the types of customers into buckets? This type of information could be used, for example, to improve effectiveness of marketing campaigns by targeting different campaign types more appropriately, such as using magazine ads for customers who read magazines, TV ads for customers who watch TV, and so on.
■■
What situations are abnormal for various groups? This type of analysis is sometimes used for fraud detection. For example, purchasing behavior outside of normal locations, stores, or total amounts might be indicative of fraud for particular customer groups.
■■
What products or services should be marketed or displayed next to what other products or services? This is sometimes called market-basket analysis and can be used in scenarios such as deciding which products should be next to each other on brick-andmortar store shelves, or for Web marketing, deciding which ads should be placed on which product pages.
■■
What will a certain value be (such as rate of sales per week) for an item or set of items at some point in the future, based on some values (such as the price of the item) that the item had in the past? An example of this would be a retailer that adjusts the price of a key item upward or downward based on sell-through rate for that price point for that type of item for particular groups of stores, thereby controlling the amount of inventory in each store of that particular item over time.
As we dive deeper into the world of data mining, we’ll again use the sample Adventure Works DW 2008 data, which is available for download from CodePlex at http://www.codeplex.com/ MSFTDBProdSamples/Release/ProjectReleases.aspx?ReleaseId=16040. To do this, open the same Adventure Works solution file that we’ve been using throughout this book in BIDS. The sample contains both OLAP cubes and data mining structures. When working with this sample, you can work in interactive or disconnected mode in BIDS when designing data mining models, just as we saw when working with OLAP cubes. We’ll start by working in interactive (connected) mode. You’ll note that the sample includes five data mining structures. We’ll use these for the basis of our data mining discussion in this chapter. Figure 12-1 (shown in disconnected mode) shows the sample data mining containers, called mining structures, in Solution Explorer in BIDS. Each mining structure contains one or more data mining models. Each mining model is based on a particular algorithm. As we drill in, we’ll understand which business situations the selected algorithms are designed to impact.
358
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 12-1 The sample Adventure Works cube contains five different mining structures.
Categories of Data Mining Algorithms You create a data mining structure in BIDS by using the Data Mining Wizard. As with OLAP cubes, when you create a new data mining structure, you must first define a data source and a data source view to be used as a basis for the creation of the new data mining structure. Data mining structures contain one or more data mining models. Each data mining model uses one of the nine included data mining algorithms. It is important that you understand what capabilities are included in these algorithms. Before we start exploring the individual algorithms, we’ll first discuss general categories of data mining algorithms: classification, clustering, association, forecasting and regression, sequence analysis and prediction, and deviation analysis. This discussion will focus on the types of business problems data mining algorithms are designed to impact. Next we’ll discuss which SSAS data mining algorithms are available in which category or categories.
Classification With classification, the value of one more fixed variables is predicted based on multiple input variables (or attributes). These types of algorithms are often used when a business has a large volume of high-quality historical data. The included algorithm most often used to implement this technique is Microsoft Decision Trees. The Microsoft Naïve Bayes and the Neural Network algorithms can also be used. The Naïve Bayes algorithm is so named because it assumes all input columns are completely independent (or equally weighted). The Neural Network algorithm is often used with very large volumes of data that have very complex relationships. With this type of source data, Neural Network will often produce the most meaningful results of all of the possible algorithms.
Chapter 12
Understanding Data Mining Structures
359
Clustering In clustering, source data is grouped into categories (sometimes called segments or buckets) based on a set of supplied values (or attributes). All attributes are given equal weight when determining the buckets. These types of algorithms are often used as a starting point to help end users better understand the relationships between attributes in a large volume of data. Businesses also use algorithms that create grouping of attributes, such as clustering-type algorithms, to make more intelligent, like-for-like predictions: If this store is like that store in these categories, it should perform similarly in this category. The included algorithm most often used to implement this technique is the Microsoft Clustering algorithm.
Association Finding correlations between variables in a set of data is called association, or market-basket analysis. The goal of the algorithm is to find sets of items that show correlations (usually based on rates of sale). Association is used to help businesses improve results related to cross-selling. In brick-and-mortar locations, the results can be used to determine shelf placement of products. For virtual businesses, the results can be used to improve click-through rates for advertising. The included algorithm most often used to implement this technique is the Microsoft Association algorithm.
Forecasting and Regression Similar to classification, that is, predicting a value, forecasting and regression are based on multiple input variables. The difference is that the predictable value is a continuous number. In forecasting, the input values usually contain data that is ordered by time. This is called a time series. Businesses use regression algorithms to predict the rate of sale of an item based on retail price, position in store, and so on, or to predict amount of rainfall based on humidity, air pressure, and temperature. The included algorithm most often used to implement this technique is the Microsoft Time Series. The Linear and Logistical Regression algorithms can also be used.
Sequence Analysis and Prediction Sequence analysis and prediction find patterns in a particular subset of data. Businesses can use this type of algorithm to analyze the click-path of users through a commercial Web site. These paths or sequences are often analyzed over time—for example, what items did the customer buy on the first visit? What did the customer buy on the second visit? Sequence and association algorithms both work with instances (called cases in the language of data mining) that contain a set of items or states. The difference is that only sequence algorithms analyze the state transitions (the order or time series that cases occurred). Association algorithms consider all cases to be equal. The included algorithms most often used to implement this technique are Microsoft Sequence Clustering or Time Series.
360
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Deviation Analysis Deviation analysis involves finding exceptional cases in the data. In data mining (and in other areas, such as statistics), such cases are often called outliers. Businesses use this type of algorithm to detect potential fraud. One example is credit card companies who use this technique to initiate alerts (which usually result in a phone call to the end user, asking him or her to verify a particularly unusual purchase based on location, amount, and so on). The most commonly used included algorithms for this type of analysis are Microsoft Decision Trees used in combination with one or more other algorithms (often Microsoft Clustering).
Working in the BiDS Data Mining interface We’ll use the AdventureWorks BI sample as a basis for understanding the BIDS interface for data mining. The sample includes five data mining structures. Each structure includes one or more data mining models. Each model is based on one of the included of data mining algorithms. As with the OLAP designer, you’ll right-click folders in the Solution Explorer window to open wizards to create new data mining objects. One difference between working with OLAP objects and data mining objects in BIDS is that for the latter you’ll use the Properties window more frequently to perform tuning after you’ve initially created data mining objects. Figure 12-2 shows the BIDS data mining structure design interface. Note the five tabs in the designer: Mining Structure, Mining Models, Mining Model Viewer, Mining Accuracy Chart, and Mining Model Prediction. The Properties window in the figure is highlighted.
FiguRe 12-2 The BIDS designer for data mining structures
Chapter 12
Understanding Data Mining Structures
361
Tip If you’re completely new to data mining, you might want to skip to Chapter 23, “Using Microsoft Excel 2007 as an OLAP Cube Client,” and read the explanation of the SQL Server 2008 Data Mining Add-ins for Excel 2007. Specifically, you’ll be interested to know that in addition to being a client interface for Analysis Services data mining, the Data Mining tab on the Excel 2007 Ribbon is also designed to be a simpler, alternative administrative tool to create, edit, and query data mining models. We have found that using Excel first provides a more accessible entry into the capabilities of SSAS data mining, even for developers and administrators. In Figure 12-2 the Properties window shows some values in the Data Type section that are probably new to you, such as Content, DiscretizationBucketCount, and so on. We’ll be exploring these values in greater detail in the next section.
Understanding Data Types and Content Types Analysis Services data mining structures use data and content types specific to the Microsoft implementation of data mining. You need to understand these types when you build your mining structures. Also, certain algorithms support only certain content types. We’ll start by explaining the general concepts of content and data type assignments and then, as we work our way through more detailed explanations of each algorithm, we’ll discuss requirements for using specific types with specific algorithms. A data type is a data mining type assignment. Possible values are Text, Long, Boolean, Double, or Date. Data types are detected and assigned automatically when you create a data mining structure. A content type is an additional attribute that a mining model algorithm uses to understand the behavior of the data. For example, marking a source column as a Cyclical content type tells the mining algorithm that the order of the data is particular, important, and repetitive, or has a cycle to it. One example is the month numbers of more than one year in a time table. The rule of thumb is to determine the data type first, then to verify (and sometimes adjust) the appropriate content type in your model. Remember that certain algorithms support certain content types only. For example, Naïve Bayes does not support Continuous content types. The Data Mining Wizard detects content types when it creates the mining structure. The following list describes the content types and the data types that you can use with the particular content types. ■■
Discrete The column contains distinct values—for example, a specific number of children. It does not contain fractional values. Also, marking a column as Discrete does not indicate that the order (or sequence) of the information is important. You can use any data type with this content type.
362
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
■■
Continuous The column has values that are a set of numbers representing some unit of measurement. These values can be fractional. An example of this would be an outstanding loan amount. You can use the date, double, or long data type with this content type.
■■
Discretized The column has continuous values that are grouped into buckets. Each bucket is considered to have a specific order and to contain discrete values. You saw an example of this in Figure 12-2 using the Age column in the Targeted Mining sample. Note that you’ll also set the DiscretizationMethod and (optionally) the DiscretizationBucketCount properties if you mark your column as Discretized. In our sample, we’ve set the bucket size to 10 and DiscretizationMethod to Automatic. Possible values for discretization method are automatic, equal areas, or clusters. Automatic means that SSAS determines which method to use. Equal areas results in the input data being divided into partitions of equal size. This method works best with data with regularly distributed values. Clusters means that SSAS samples the data to produce a result that accounts for “clumps” of data values. Because of this sampling, Clusters can be used only with numeric input columns. You can use the date, double, long, or text data type with the Discretized content type.
■■
Table The column contains a nested table that has one or more columns and one or more rows. These columns can contain multiple values, but of these values at least one value must be related to the parent case record. An example would be individual customer information in a case table, with related customer purchase item information in a nested transaction table.
■■
Key The column is used as a unique identifier for a row. You can use the date, double, long, or text data type for this.
■■
Key Sequence The column is a type of a key—the sequence of key values is important to your model. You can use the double, long, text, or date data type with this content type.
■■
Key Time The Key Time column, similar to Key Sequence, is a type of key where the sequence of values is important. Additionally, by marking your column with this content type, you are indicating to your mining model that the key values run on a time scale. You can use the double, long, or date data type with this content type.
■■
Ordered The column contains data in a specific order that is important for your mining model. Also, when you mark a column with the Ordered content type, SSAS considers that all data contained is discrete. You can use any data type with this content type.
■■
Cyclical The column has data that is ordered and represents a set that cycles (or repeats). This is often used with time values (months of the year, for example). Data marked as Cyclical is considered both ordered and discrete. You can use any data type with this content type.
Chapter 12
Understanding Data Mining Structures
363
Note There is also a designator named Classified for columns. This refers to the ability to include information in a column that describes a different column in that same model. We rarely use this feature because the standard SSAS data mining algorithms don’t support it. Table 12-1 lists the data types and the content types they support. Keep in mind that understanding the concept of assigning appropriate content and data types is critical to successful model building. TaBle 12-1
Data Types and Content Types
Data Type
Content Types Supported
Text
Discrete, Discretized, Sequence
Long
Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered (by sequence or by time)
Boolean
Discrete
Double
Continuous, Cyclical, Discrete, Discretized, Key Sequence, Key Time, Ordered (by sequence or by time)
Date
Continuous, Discrete, Discretized, Key Time
You can specify both the data type and the content type by using the Data Mining Wizard or by configuring the Properties windows in BIDS. Note that to use the Key Time and Key Sequence content types you have to install additional algorithms—these content types are not supported by the algorithms included with SSAS.
Setting Advanced Data Properties In addition to data and content types, you may wish to specify a few other properties so that the algorithm you select understands the source data better. This understanding improves the results produced by the algorithm. These properties are as follows: ■■
Modeling Flags These vary by algorithm but usually include the NOT NULL flag at a minimum.
■■
Relationship (between attributes) This is available only by using the DMX clause Related To between two attribute columns. It is used to indicate natural hierarchies and can be used with nested tables (defined later in this chapter).
■■
Distribution Normal indicates that the source data distribution resembles a bellshaped histogram. Uniform indicates that the source data distribution resembles a flat curve where all values are equally likely. Log Normal indicates that the source data distribution is elongated at the upper end only. This attribute configuration is optional; you generally use this only when you are using source data that is dirty or does not represent what you expect, such as feeding data that really should match a (normal) bell curve, but doesn’t.
364
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
In BIDS, the first tab in the mining structure designer is named Mining Structure. Here you can see the source data included in your mining model (which is, in fact, a DSV). As with the OLAP cube designer in BIDS, in this work area you can only view the source data from the DSV—you cannot change it in any way, such as by renaming columns, adding calculated columns, and so on. You have to use the original DSV to make any structural changes to the DSV. In this view you are only allowed to view the source data, add or remove columns, or add nested tables to the mining structure. You can use either relational tables or multidimensional cubes as source data for Analysis Services data mining structures. If you choose relational data, that source data can be retrieved from one or more relational tables, each with a primary key. Optionally the source data can include a nested table. An example of this would be customers and their orders: The Customers table would the first table and Orders would be a nested table. This situation would require a primary key/foreign key relationship between the rows in the two tables as well. One way to add a nested table to a data mining structure is to right-click the Object Browser tree on the Mining Structure tab, and then click Add A Nested Table. You are then presented with the dialog box shown in Figure 12-3. Select the table you wish to nest from the DSV. Note that you can filter the columns in the nested table by data type. Tip It is very important that you understand and model nested tables correctly (if you are using them). For more information read the SQL Server Books Online topic “Nested Tables (Analysis Services—Data Mining).”
FiguRe 12-3 Nested tables in a mining model structure
You can configure several properties for the data mining structure in this view. An example is the CacheMode property. Your choices are KeepTrainingCases or ClearAfterProcessing. The latter option is often used during the early development phase of a mining project. You may process an individual mining model only to find that the data used needs further cleaning. In this case, you’d perform the subsequent cleaning, and then reprocess that model.
Chapter 12
Understanding Data Mining Structures
365
Alternatively, you can perform a full process on the entire mining structure. If you do this, all mining models that you have defined inside of the selected mining structure are processed. As with OLAP cube processing, data mining processing includes both processing of metadata and data itself. Metadata is expressed as XMLA; data is retrieved from source systems and loaded into destination data mining structures using DMX queries. Chapter 13 includes more detail on this process.
Choosing a Data Mining Model The next tab in the mining structure designer in BIDS is the Mining Models tab. Here you view the mining model(s) that you’ve included in the selected mining structure. You can easily add new models to your structure by right-clicking the designer surface and then clicking New Mining Model. You can also change the source data associated with a type of mining model by creating more than one instance of that model and “ignoring” one or more columns from the mining structure DSV. Ignoring a column of data in a particular mining model is shown (for the Yearly Income column) using the Microsoft Naïve Bayes algorithm in Figure 12-4.
FiguRe 12-4 The Mining Models tab in BIDS allows you to specify the usage of each source column for each mining model.
You can change the use of the associated (non-key) source columns in the following ways: ■■
Ignore
■■
Input
■■
Predict
■■
PredictOnly
This setting causes the model to remove the column from the model. This setting causes the model to use that column as source data for the model. This setting causes the model to use that column as both input and output. This setting causes the model to use that column as output only.
366
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Note The specific requirements for input and predictable columns for each type of algorithm are discussed in Chapter 13. Nested tables are another point to consider when you are deciding how to mark columns for use by data mining algorithms. If your source model includes a nested table and you’ve marked that table as Predict (or PredictOnly), all of its nested attributes are automatically marked as predictable. For this reason, you should include only a small number of attributes in a nested table marked as predictable. As we continue our tour of Analysis Services data mining, it is interesting to note that the Mining Models tab is designed to support a key development activity—that is, building multiple models using the same source data. You may wonder why you’d want to do that. Because we haven’t explored the algorithms in detail yet, this capability will probably be a bit mysterious to you at this point. Suffice it to say that this ability to tinker and adjust by adding, tuning, or removing mining models is a key part of using SSAS data mining successfully. Predictive analytics is not an exact science; it is more of an art. You apply what you think will be the most useful algorithms to get the best results. Particularly at the beginning of your project, you can expect to perform a high number of iterations to get this tuning just right. You will inevitably try adjusting the amount of input data columns, the algorithm used, the algorithm parameters, and so on so that you can produce useful results. You’ll also test and score each model as you proceed; in fact, SSAS includes tools to do this so that you can understand the usefulness of the various mining model results. We will return to this topic later in this chapter after we review the capabilities of the included algorithms. Filtering is a new capability in SQL Server 2008 data mining models. You can build mining models on filtered subsets of the source data without having to create multiple mining structures. You can create complex filters on both cases and nested tables using BIDS. To implement filtering, right-click the model name on the BIDS Mining Models tab and then click Set Model Filter. You are then presented with a blank model filter dialog box, where you can configure the desired filter values. We show a sample in Figure 12-5. Another enhancement to model building included in SQL Server 2008 is the ability to alias model column names. This capability allows for shorter object names. You can implement it using BIDS or with the ALTER MINING STRUCTURE (DMX) syntax. To use this new syntax, you must have first created a mining structure. Then you can create another structure (which can include aliased column names) based on your original structure. For detailed syntax, see the SQL Server Books Online topic “ALTER MINING STRUCTURE (DMX)” at http://msdn.microsoft.com/en-us/library/ms132066.aspx.
Chapter 12
Understanding Data Mining Structures
367
FiguRe 12-5 Mining models based on filtered subsets of source data
You can configure the algorithm parameters for each mining model in the mining structure by right-clicking the mining model on the designer surface and then clicking Set Algorithm Parameters. The available parameters vary depending on which mining algorithm you are working with and the edition of SQL Server 2008. Several advanced configuration properties are available in the Enterprise edition of SQL Server 2008 only. Figure 12-6 shows the configurable parameters for the Microsoft Decision Trees model. Note that when you select one of the properties, the configuration dialog box shows you a brief definition of the configurable property value.
FiguRe 12-6 The Algorithm Parameters dialog box in BIDS, in which you can manually configure advanced algorithm properties
368
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
As you become a more advanced user of data mining, for select algorithms, you may also add your own custom parameters (and configure their values) via this dialog box. Of course the available parameters vary for each included algorithm. You should document any changes you make to default values and you should have a business justification for making such changes. In many cases you will find that you don’t need to make any changes to default values—in fact, making changes without a full understanding can result in decreased performance and overall effectiveness of the selected algorithm.
Picking the Best Mining Model Viewer The next tab in the mining structure designer in BIDS is the Mining Model Viewer. An interesting aspect of this section is that each mining model algorithm includes one or more types of mining model viewers. The purpose of the broad variety of viewers is to help you to determine which mining model algorithms are most useful for your particular business scenario. The viewers include both graphical and text (rows and columns) representations of data. Some of the viewers include multiple types of graphical views of the output of the mining model data. Additionally, some of the viewers include a mining legend shown in the Properties window of the designer surface. Each algorithm has a collection of viewers that is specific to that algorithm. These viewers usually present the information via charts or graphs. In addition, the Microsoft Generic Content Tree Viewer is always available, providing very detailed information about each node in the mining model. Tip The mining model viewers are also available in SSMS. To access them, connect to SSAS in SSMS, right-click the particular mining structure in the Object Explorer, and then click Browse. The viewers are also available in Excel 2007 (after you’ve installed the SQL Server 2008 Data Mining Add-Ins for Office 2007) via the Data Mining tab on the Ribbon. Excel is intended as an end-user interface. Application developers can also download embeddable versions of these viewers and incorporate them into custom applications. You can download the embeddable controls at http://www.sqlserverdatamining.com/ssdm/Home/Downloads/tabid/60/Default.aspx.
The following list shows the nine data mining algorithms available in SQL Server 2008. Although some of the algorithms have been enhanced, no new algorithms are introduced in this product release. This list is just a preview; we cover each algorithm in detail later in the chapter. ■■
Microsoft Association
■■
Microsoft Clustering
■■
Microsoft Decision Trees
■■
Microsoft Linear Regression
■■
Microsoft Logistic Regression
Chapter 12 ■■
Microsoft Naïve Bayes
■■
Microsoft Neural Network
■■
Microsoft Sequence Clustering
■■
Microsoft Time Series
Understanding Data Mining Structures
369
Again using the Adventure Works DW 2008 sample, we now look at some of the viewers included for each algorithm. Using the sample mining structure named Targeted Mailing, we can take a look at four different viewers, because this structure includes four mining models, each based on a different mining model algorithm. After you open this structure in the designer in BIDS and click the Mining Model Viewer tab, the first listed mining model, which is based on the Microsoft Decision Trees algorithm, opens in its default viewer. Note Each algorithm includes one or more viewer types. Each viewer type contains one or more views. An example of this is that the Microsoft Decision Trees algorithm, which ships with two viewer types: Microsoft Tree Viewer and Microsoft Generic Content Tree Viewer. The Microsoft Tree Viewer contains two views: Decision Tree and Dependency Network. The Microsoft Generic Content Tree Viewer contains a single view of the same mining model, but in a different visual format. Are you confused yet? This is why we prefer to start with the visuals! In addition to the two types of views shown in the default viewer (Figure 12-7) of the Microsoft Decision Trees algorithm, you can further customize the view by adjusting view parameters. The figure shows a portion of the Decision Tree view with its associated mining legend. It shows the most closely correlated information at the first level, in this case, number of cars owned. The depth of color of each node is a visual cue to the amount of association— darker colors indicate more association. Note that the mining legend reflects the exact number of cases (or rows) for the particular node of the model that is selected. It also shows the information via a probability column (percentage) and a histogram (graphical representation). In the diagram, the selected node is Number Of Cars Owned = 2. We’ve also set the Background filter to 1, indicating that we wish to see data for those who actually purchased a bicycle, rather than for all cases. Note also that the default setting for levels is 3. This particular model contains six levels; however, viewing them all on this surface is difficult. Note If you set the levels or default expansion settings to the maximum included in the model (six), you can observe one of the challenges of implementing data mining as part of a BI solution—effective visualization. We’ll talk more about this topic as we continue on through the world of SSAS data mining; suffice to say at this point that we’ve found the ability to provide appropriate visualizations of results to be a key driver of success in data mining projects. The included viewers are a start in the right direction; however, we’ve found that in the real world it is rare for our clients or other developers to spend enough time thoroughly understanding what is included before they try to buy or build other visualizers.
370
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 12-7 The Microsoft Tree Viewer for the Microsoft Decision Trees algorithm
The other type of built-in view for the Microsoft Tree Viewer in BIDS for this algorithm is the Dependency Network. This view allows you to quickly see which data has the strongest correlation to a particular node. You can adjust the strength of association being shown by dragging the slider on the left of the diagram up or down. Figure 12-8 shows the Dependency Network for the same mining structure that we’ve been working with. You’ll note that the three most correlated factors for bike purchasing are yearly income, number of cars owned, and region. Tip We’ve found that the Dependency Network view is one of the most effective and universally understood. We use it early and often in our data mining projects. Remember that all of the viewers found in BIDS are available in SSMS and, most important, as part of Excel 2007 after you install the SQL Server 2008 Data Mining Add-ins for Office 2007. Depending on the sophistication and familiarity of the client (business decision maker, developer, or analyst), we’ve sometimes kept our initial discussion of viewers and algorithms to Excel rather than BIDS. We do this to reduce the complexity of what we are demonstrating. We find this approach works particularly well in proof-of-concept discussions.
Chapter 12
Understanding Data Mining Structures
371
FiguRe 12-8 The Dependency Network view for Microsoft Decision Trees algorithm
As with the Decision Tree view for the Microsoft Tree Viewer, the Dependency Network view includes some configurable settings. Of course, the slider on the left is the most powerful. Note that this view, like most others, also contains an embedded toolbar that allows you to zoom/pan and to otherwise tinker with the viewable information. At the bottom of this viewer you’ll also note the color-coded legend, which assists with understanding of the results. In contrast to the simplicity of the Dependency Network view, the Generic Content Tree Viewer shown in Figure 12-9 presents a large amount of detail. It shows the processed results in rows and columns of data. For certain mining models, this viewer will include nested tables in the results as well. This viewer includes numeric data for probability and variance rates. We’ve found that this level of information is best consumed by end users who have a formal background in statistics. In addition to the information presented by this default viewer, you can also query the models themselves using the DMX and XMLA languages to retrieve whatever level of detail you desire for your particular solution.
372
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 12-9 The Microsoft Generic Content Tree Viewer for the Microsoft Decision Trees algorithm is complex and detailed.
To continue our journey through the available viewers, select TM Clustering from the Mining Model drop-down list on the Mining Model Viewer tab. You can see that you have new types of viewers to select from. The Microsoft Cluster Viewer includes the following four different views, which are reflected in nested tabs in the viewer: Cluster Diagram, Cluster Profiles, Cluster Characteristics, and Cluster Discrimination. We suggest that you now take the time to look at each sample data mining structure that is included in the AdventureWorksDW 2008 sample, its associated models, and their associated viewers. Looking at these visualizers will prepare you for our discussion of the guts of the algorithms, which starts shortly. One capability available in the viewers might not be obvious. Some of the included algorithms allow the viewing of drillthrough data columns. An improvement to SQL Server 2008 is that for some algorithms drillthrough is available to all source columns in the data mining structure, rather than just those included in the individual model. This allows you to build more compact models, which is important for model query and processing performance. If you choose to use drillthrough, you must enable it when you create the model. The following algorithms do not support drillthrough: Naïve Bayes, Neural Network, and Logistic Regression. The Time Series algorithm supports drillthrough only via a DMX query, not in BIDS. Figure 12-10 shows the results of right-clicking, clicking Drill Through, and then clicking Model And Structure Columns on the Cluster 10 object in the Cluster Diagram view
Chapter 12
Understanding Data Mining Structures
373
of the TM Clustering model. Of course, if you choose to use drillthrough in your BI solution, you must verify that any selected end-user tools also support this capability.
FiguRe 12-10 Drillthrough results window from the Mining Model Viewer in BIDS for the TM Clustering
sample
Before we begin our investigation of the data mining algorithms, let’s briefly look at the other two tabs that are part of the BIDS mining structure designer. At this point we just want to get a high-level overview of what type of activity is performed here. Because a great deal of power (and complexity) is associated with these included capabilities, we take a deeper look at the last two tabs (Mining Accuracy Chart and Mining Model Prediction) in the next chapter.
Mining Accuracy Charts and Prediction The next tab in the BIDS mining structure designer is the Mining Accuracy Chart tab. Here you can validate (or compare) your model against some actual data to understand how accurate your model is and how well it will work to predict future target values. This tool is actually quite complex, containing four nested tabs named Input Selection, Lift Chart, Classification Matrix, and Cross Validation. You might be surprised that we are looking at these sections of the mining tools in BIDS before we review the detailed process for creating mining structures and models. We have found in numerous presentations that our audiences tend to understand the whys of model structure creation details more thoroughly if we first present the information in this section. So bear with us as we continue our journey toward understanding how to best design, develop, validate, and use mining models. Note The interface for this tab has changed in a couple of ways in SQL Server 2008. One way reflects a change to the capabilities of model building. That is, now the Model Creation Wizard includes a page that allows you to automatically create a training (sub) set of your source data. We talk more about this in Chapter 13 when we go through that wizard.
374
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Figure 12-11 shows the Input Selection nested tab of the Mining Accuracy Chart tab. Note that you can configure a number of options as you prepare to validate the mining models in a particular mining structure. The options are as follows: ■■
Synchronize Prediction Columns And Values (selected by default)
■■
Select one or more included Mining Models (all are selected by default)
■■
Configure the Predictable Column Name and value for each model. In this case all models have only one predictable column. We’ve set the value to 1 (bike buyers).
■■
Select the testing set. Either use the set automatically created when the model was created, or manually specify a source table or view. If you manually specify, then you should validate the automatically generated column mappings by clicking the Build (…) button.
■■
(Optional) Specify a filter for the manually associated testing set by creating a filter expression.
FiguRe 12-11 The Mining Accuracy Chart tab allows you to validate the usefulness of one or more
mining models.
The results of the Mining Accuracy Chart tab are expressed in multiple ways, including a lift chart, a profit chart, a classification matrix, or a cross validation. Generally they allow you to assess the value of the results your particular model predicts. These results can be complex to interpret, so we’ll return to this topic in the next chapter (after we’ve learned the mechanics of building and processing data mining structures). Also note that the cross validation capability is a new feature introduced in SQL Server 2008.
Chapter 12
Understanding Data Mining Structures
375
The next tab in the BIDS designer is the Mining Model Prediction tab. Here you can create predictions based on associating mining models with new external data. When you work with this interface, what you are actually doing is visually writing a particular type of DMX— specifically a prediction query. The DMX language contains several prediction query types and keywords to implement these query types. We look at DMX in more detail in the next chapter. When you first open this interface, the first data mining model in the particular structure will be populated in the Mining Model window. You can select an alternative model from the particular structure by clicking the Select Model button at the bottom of the Mining Model window. The next step is to specify the source of the new data. You do this by clicking the Select Case Table button in the Select Input Table(s) window. After you select a table, the designer will automatically match columns from source and destination with the same names. To validate the automatically generated column mappings, you simply right-click in the designer and then click Modify Connections. This opens the Modify Mapping window. After you’ve selected the source table(s) and validated the column mappings, you’ll use the lower section of the designer to create the DMX query visually. By using the first button on the embedded toolbar you can see the DMX that the visual query designer has generated, or you can execute the query. Remember that you can write and execute DMX queries in SSMS as well. Figure 12-12 shows the interface in BIDS for creating a prediction query.
FiguRe 12-12 The Mining Model Prediction tab allows you to generate DMX prediction queries.
We’ve now covered our initial tour through the BIDS interface for data mining. We certainly haven’t seen everything yet. As we move to the next level, it’s time to dive deeper into the particular capabilities of the included algorithms. Understanding what each does is, of course, key to implementing SSAS data mining successfully in your BI project.
376
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Data Mining algorithms Now we’ll begin to work through the capabilities of all included algorithms. We’ll take a look at them in order of complexity, from the simplest to the most complex. For each algorithm, we discuss capabilities, common configuration, and some of the advanced configurable properties. Before we start reviewing the individual algorithms, we cover a concept that will help you understand how to select the mining algorithm that best matches your business needs. This idea is called supervision—algorithms are either supervised or unsupervised. Supervised mining models require you to select both input and predictable columns. Unsupervised mining models require you to select only input columns. When you are building your models, SSAS presents you with an error dialog box if you do not configure your model per the supervision requirements. The unsupervised algorithms are Clustering, Linear Regression, Logistic Regression, Sequence Clustering, and Time Series. The supervised algorithms are Association, Decision Trees, Naïve Bayes, and Neural Network.
Microsoft Naïve Bayes Microsoft Naïve Bayes is one of the most straightforward algorithms available to you in SSAS. It is often used as a starting point to understanding basic groupings in your data. This type of processing is generally characterized as classification. The algorithm is called naïve because no one attribute has any higher significance than another. It is named after Thomas Bayes, who envisioned a way of applying mathematic (probability) principals to understanding data. Another way to understand this is that all attributes are treated as independent, or not related to one another. Literally, the algorithm simply counts correlations between all attributes. Although it can be used for both predicting and grouping, Naïve Bayes is most often used during the early phases of model building. It’s more commonly used to group rather than to predict a specific value. Typically you’ll mark all attributes as either simple input or both input and predictable, because this asks the algorithm to consider all attributes in its execution. You may find yourself experimenting a bit when marking attributes. It is quite typical to include a large number of attributes as input, then to process the model and evaluate the results. If the results don’t seem meaningful, we often reduce the number of included attributes to help us to better understand the most closely correlated relationships. You might use Naïve Bayes when you are working with a large amount of data about which you know little. For example, your company may have acquired sales data after purchasing a competitor. We use Naïve Bayes as a starting point when we work with this type of data. You should understand that this algorithm contains a significant restriction: Only discrete (or discretized) content types can be evaluated. If you select a data structure that includes data columns marked with content types other than Discrete (such as Continuous), those columns will be ignored in mining models that you created based on the Naïve Bayes algorithm. Only a small number of included configurable properties are in this algorithm. To view
Chapter 12
Understanding Data Mining Structures
377
the parameters, we’ll use the Targeted Mailing sample. Open it in BIDS, and then click the Mining Models tab. Right-click the model that uses Naïve Bayes and then click Set Algorithm Parameters. You’ll see the Algorithm Parameters dialog box, shown in Figure 12-13.
FiguRe 12-13 The Algorithm Parameters dialog box allows you to view and possibly change parameter
values.
Four configurable parameters are available for the Naïve Bayes algorithm: MAXIMUM_ INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES, MAXIMUM_STATES, and MINIMUM_ DEPENDENCY_PROBABILITY. You can change the configured (default) values by typing the new value in the Value column. As mentioned previously, configurability of parameters is dependent on the edition of SQL Server you are using. This information is noted in the Description section of the Algorithm Parameters dialog box. You might be wondering how often you’ll be making adjustments to the default values for the algorithm parameters. We find that as we become familiar with the capabilities of particular algorithms, we tend to begin using manual tuning. Because Naïve Bayes is frequently used in data mining projects, particularly early in the project, we do find ourselves tinkering with its associated parameters. The first three are fairly obvious: Adjust the configured value to reduce the maximum number of input values, output values, or possible grouping states. The last dependency is less obvious. When you reduce that value, you are asking for a reduction in the number of nodes or groups that the processed model produces.
Feature Selection SSAS data mining includes a capability called feature selection. This setting is applied before the model is trained (loaded with source data). Feature selection automatically chooses the attributes in a dataset that are most likely to be used in the model. If feature selection is used during mining model processing, you will see that it was in detailed execution statements that are listed in the mining model processing window. Feature selection works on both input and predictable attributes. It can also work on the number of states in a column, depending on the algorithm being used in the mining model. Only the input attributes and states that the algorithm selects are included in the modelbuilding process and can be used for prediction. Predictable columns that are ignored by feature selection are used for prediction, but the predictions are based only on the global statistics that exist in the model. To implement feature selection, SSAS uses various methods
378
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
(documented in the SQL Server Books Online topic “Feature Selection in Data Mining”) to determine what is called the “interestingness score” of attributes. These methods depend on the algorithm used. It is important to understand that you can affect the invocation and execution of these various methods of determining which attributes are most interesting to the model. You do this by changing the configured values for the following mining model parameters: MAXIMUM_ INPUT_ATTRIBUTES, MAXIMUM_OUTPUT_ATTRIBUTES, and MAXIMUM_STATES. Tip The Algorithm Parameters dialog box shows only a partial list of the configurable parameters for each algorithm. For several algorithms, it shows none at all. If you wish to add a configurable parameter and a value, click Add at the bottom of the dialog box. If you search on the name of the particular algorithm in SQL Server Books Online, you can review the complete list of configurable parameters for each algorithm. If you have experience with SQL Server 2005 data mining, you’ll notice that for many algorithms using SQL Server 2008 (BIDS), the Algorithm Parameters dialog box will show you more configurable parameters than were shown in the 2005 edition.
In a nutshell, you can think of feature selection as a kind of a built-in improvement algorithm—that is, it uses an internal algorithm to try to improve the quality of your mining model results. If you are new to data mining, you’ll probably just want to let it run as is. As you become a more advanced user, you may want to guide or override feature selection’s execution using the method described the previous paragraph. We find feature selection to be helpful because of the many unknowns you can encounter when you work with data mining models: quality of data, uncertainty about which data to include in a model, choice of algorithm, and so on. Feature selection attempts to intelligently narrow the results of data mining model processing to create a more targeted and more meaningful result. We particularly find this to be useful during the early stages of the data mining life cycle—for example, when we are asked to mine new data, perhaps purchased from a competitor. We often use the Naïve Bayes algorithm in such situations, and we particularly find feature selection useful in combination with less precise algorithms such as Naïve Bayes. The Microsoft Naïve Bayes Viewer includes four types of output views: Dependency Network, Attribute Profiles, Attribute Characteristics, and Attribute Discrimination. We often use the Dependency Network view because its output is easy to understand. It simply shows the related attributes and the strength of their relationship to the selected node. (You adjust the view by using the slider at the left of the view.) This view (also included with the Microsoft Tree Viewer) was shown in Figure 12-8. The Attribute Profiles view, a portion of which is shown in Figure 12-14, shows you how each input attribute relates to each output attribute. You can rearrange the order of the attributes shown in this view by clicking and dragging column headers in the viewer.
Chapter 12
Understanding Data Mining Structures
379
FiguRe 12-14 The Attribute Profiles view for Naïve Bayes provides a detailed, attribute-by-attribute look at
the algorithm results.
You can change the view by adding or removing the States legend, by changing the number of histogram bars, or by changing the viewed predictable attribute (though only if you’ve built a model that contains more than one predictable value, of course). You can also hide columns by right-clicking the desired column and then clicking Hide. Note that the options below the Drillthrough option on the shortcut menu are not available. This is for two reasons: First, drillthrough is not enabled by default for any mining model. Second, the algorithm on which this particular sample model is built, Naïve Bayes, does not support drillthrough. On the next tab, the Attribute Characteristics view, you can see the probability of all attributes as related to the predicted value output. In this example, the default value is set to a state of 0, which means “does not buy a bicycle.” The default sort is by strongest to weakest correlation for all attributes. If you wish to adjust this view—for example, to sort within attribute values—simply click the column header to re-sort the results. Figure 12-15 shows a portion of the Attribute Characteristics view set to sort by Attributes. By sorting in this view, you can easily see that a short commute distance correlates most strongly (of all attributes in view) to the state of not purchasing a bicycle.
380
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 12-15 The Attribute Characteristics view for Naïve Bayes allows you to see attribute correlation in several different sort views.
On the last tab, the Attribute Discrimination view, you can compare the correlations between attributes that have two different states. Continuing our example, we see in Figure 12-16 that the attribute value of owning 0 cars correlates much more significantly to the predicted state value of buying a bicycle (Value 2 drop-down list set to 1) than to the state of not buying a bicycle. You can further see that the next most correlated factor is the Age attribute, with a value of 35-40. As with the Attribute Characteristics view, you can re-sort the results of the Attribute Discrimination view by clicking any of the column headers in the view. It is important to make sound decisions based on the strength of the correlations. To further facilitate that using this view you can right-click any of the data bars and then click Show Legend. This opens a new window that shows you the exact count of cases that support the view produced. For example, opening the legend for the attribute value Number Of Cars Owned shows the exact case (row) count to support all of the various attribute states: cars owned = 0 or !=1, and bicycles purchased=0 or 1. These results are shown in a grid. As previously mentioned, Naïve Bayes is a simple algorithm that we often use to get started with data mining. The included views are easy to understand and we often show such results directly to customers early in the data mining project life cycle so that they can better
Chapter 12
Understanding Data Mining Structures
381
understand their data and the possibilities of data mining in general. We turn next to a very popular algorithm, Microsoft Decision Trees.
FiguRe 12-16 The Attribute Discrimination view for Naïve Bayes allows you to compare two states and their associated attribute values.
Microsoft Decision Trees Algorithm Microsoft Decision Trees is probably the most commonly used algorithm, in part because of its flexibility—decision trees work with both discrete and continuous attributes—and also because of the richness of its included viewers. It’s quite easy to understand the output via these viewers. This algorithm is used to both view and to predict. It is also used (usually in conjunction with the Microsoft Clustering algorithm) to find deviant values. The Microsoft Decision Trees algorithm processes input data by splitting it into recursive (related) subsets. In the default viewer, the output is shown as a recursive tree structure. If you are using discrete data, the algorithm identifies the particular inputs that are most closely correlated with particular predictable values, producing a result that shows which columns are most strongly predictive of a selected attribute. If you are using continuous data, the algorithm uses standard linear regression to determine where the splits in the decision tree occur. Figure 12-17 shows the Decision Tree view. Note that each node has a label to indicate the value. Clicking a node displays detailed information in the Mining Legend window. You can configure the view using the various drop-down lists at the top of the viewer, such as Tree, Default Expansion, and so on. Finally, if you’ve enabled drillthrough on your model, you can display the drillthrough information—either columns from the model or (new to SQL Server 2008) columns from the mining structure, whether or not they are included in this model. The drillthrough result for Number Cars Owned = 0 is shown in the Drill Through window in Figure 12-17.
382
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FiguRe 12-17 By adjusting the COMPLEXITY_PENALTY value, you can prune your decision tree, making the tree easier to work with.
Microsoft Decision Trees is one of the algorithms that we have used most frequently when implementing real-world data mining projects. Specifically, we’ve used it in business marketing scenarios to determine which attributes are more closely grouped to which results. We have also used it in several law-enforcement scenarios to determine which traits or attributes are most closely associated to offender behaviors. When you are working with the Microsoft Decision Trees algorithm, the value of the results can be improved if you pre-group data. You can do this using ETL processes or by using standard source data queries against the source data before you create a data mining structure. Another important consideration when using this algorithm is to avoid overtraining your model. You can impact this by adjusting the value of the COMPLEXITY_PENALTY parameter in the Algorithm Parameters dialog box. By adjusting this number you can change the complexity of your model, usually reducing the number of inputs to be considered and thereby reducing the size of your decision tree. For example, a value of 0.5 produces 1 to 9 attributes, while a value of 0.9 produces 10 to 99 attributes. Another capacity of the Microsoft Decision Trees algorithm is to create multiple result trees. If you set more than one source column as predictable (or if the input data contains a nested table that is set to predictable), the algorithm builds a separate decision tree for each predictable source column. You can select which tree you’d like to view in the Decision Tree view by selecting the tree you wish to view from the Tree drop-down list.
Chapter 12
Understanding Data Mining Structures
383
The Dependency Network view is also available for the Microsoft Decision Trees algorithm. It looks and functions similarly to the way it does with the Naïve Bayes algorithm. As with Naïve Bayes, you can remove less-related nodes to the selected node by adjusting the value of the slider to the left of the view. This view was shown in Figure 12-8. Tip If the result of using this algorithm hides some nodes you wish to include in your mining model output, consider creating another mining model using the same source data, but with a more flexible algorithm, such as Naïve Bayes. You’ll be able to see all nodes in your output.
Microsoft Linear Regression Algorithm Microsoft Linear Regression is a variation of the Microsoft Decision Trees algorithm, and works like classic linear regression—it fits the best possible straight line through a series of points (the sources being at least two columns of continuous data). This algorithm calculates all possible relationships between the attribute values and produces more complete results than other (non–data mining) methods of applying linear regression. In addition to a key column, you can use only columns of the continuous numeric data type. Another way to understand this is that it disables splits. You use this algorithm to be able to visualize the relationship between two continuous attributes. For example, in a retail scenario, you might want to create a trend line between physical placement locations in a retail store and rate of sale for items. The algorithm result is similar to that produced by any other linear regression method in that it produces a trend line. Unlike most other methods of calculating linear regression, the Microsoft Linear Regression algorithm in SSAS calculates all possible relationships between all input dataset values to produce its results. This differs from other methods of calculating linear regression, which generally use progressive splitting techniques between the source inputs. The configurable parameters are maximum input (or output) and FORCE_REgRESSOR. This algorithm is used to predict continuous attributes. When using this algorithm, you mark one attribute as a regressor. The regressor attribute must be marked as a Continuous content type. This attribute will be used as a key value in the regression formula. You can manually set a source column as a regressor by using the FORCE_REgRESSOR parameter. Alternatively, you can set the DMX REgRESSOR flag on your selected column.
Microsoft Time Series Algorithm Microsoft Time Series is used to impact a common business problem, accurate forecasting. This algorithm is often used to predict future values, such as rates of sale for a particular product. Most often the inputs are continuous values. To use this algorithm, your source data must contain at one column marked as Key Time. Any predictable columns must be of type
384
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Continuous. You can select one or more inputs as predictable columns when using this algorithm. Time series source data can also contain an optional Key Sequence column. New in SQL Server 2008 for the Microsoft Time Series algorithm is an additional algorithm inside of the time series—the Auto Regressive Integrated Moving Average (ARIMA) algorithm. It is used for long-term prediction. In SQL Server 2005 the Microsoft Time Series algorithm used only Auto Regression Trees with Cross Predict (ARTxp), which is more effective at predicting the next step in a series for up to a maximum of 5-10 future steps. However, past that point, ARIMA performs better. Also new for 2008 is the ability to configure custom blending of the two types of time series algorithms (this requires using the Enterprise edition). You do this by configuring a custom value for the PREDICTION_SMOOTHINg parameter. In the Standard edition of SQL Server 2008 both types of time algorithms are used and the results are automatically blended by default. In the Standard edition you can choose to use one of the two included algorithms, rather than the default blended result. However, you cannot tune the blending variables (as you can with the Enterprise edition). Note that the FORECAST_METHOD parameter shows you which algorithms are being used. When you are working with the Microsoft Time Series algorithm another important consideration is the appropriate detection of seasonal patterns. You should understand the following configurable parameters when considering seasonality: ■■
AUTO_DETECT_PERIODICITY Lowering the default value of 0.6 (the range is 0 to 1.0) results in reducing the model training (processing) time because periodicity is detected only for strongly periodic data.
■■
PERIODICITY_HINT Here you can provide multiple values to give the algorithm a hint about the natural periodicity of the data. You can see in Figure 12-18 that the value 12 has been supplied. For example, if the periodicity of the data is yearly and quarterly, the setting should be {3, 12}.
FiguRe 12-18 The FORECAST_METHOD parameter allows you to configure the type of time algorithm (requires Enterprise edition).
Chapter 12
Understanding Data Mining Structures
385
The Microsoft Time Series Viewer helps you to understand the output of your model. It shows the predicted values over the configured time series. You can also configure the number of prediction steps and whether you’d like to show deviations, as shown in Figure 12-19. The default view using the sample Forecasting data mining structure (and Forecasting mining model) shows only a subset of products in the Charts view. If you want to view the forecast values for other bicycle models, simply select those models from the drop-down list on the right side of the view. Note also that this sample model includes two predictable columns, Amount and Quantity. So each output value (bicycle) has two predictive values.
FiguRe 12-19 The Charts view lets you see the selected, predicted values over the time series.
In our view we’ve selected the two predictive output values Amount and Quantity for a single product (M200 Europe). We’ve also selected Show Deviations to add those values to our chart view. (Deviations are the same thing as outliers.) When you pause your mouse on any point on the output lines on the chart, a tooltip displays more detailed information about that value. The other view included for mining models built using the Microsoft Time Series algorithm is the Model view. It looks a bit like the Decision Tree view used to view mining models built with the Microsoft Decision Trees algorithm. However, although time series models in the Model view (Figure 12-20), have nodes (as shown in the Decision Tree view), the Mining Legend window shows information related to this algorithm—namely coefficients,
386
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
histograms, and tree node equations. You have access to this advanced information so that you can better understand the method by which the predictions have been made.
FiguRe 12-20 The Model view allows you see the coefficient, histogram, and equation values for each node
in the result set.
Closer examination of the values in the Mining Legend window shows that in addition to the actual equations, a label lists which of the two possible time-based algorithms (ARIMA or ARTxp) was used to perform this particular calculation.
Microsoft Clustering Algorithm As its name indicates, the Microsoft Clustering algorithm focuses on showing you meaningful groupings in your source data. Unlike Naïve Bayes, which requires discrete content inputs and considers all input attributes of equal weight, Microsoft Clustering allows for more flexibility in input types and grouping methodologies. You can use more content type as input and you can configure the method used to create the groups or clusters. We’ll dive into more detail in the next section. Microsoft Clustering separates your data into intelligent groupings. As we mentioned in the previous paragraph, you can use Continuous, Discrete, and most other content types. You can optionally supply a predictive value, by marking it as predict only. Be aware that
Chapter 12
Understanding Data Mining Structures
387
Microsoft Clustering is generally not used for prediction—you use it to find natural groupings in your data. When using the Microsoft Clustering algorithm it is important for you to understand the types of clustering that are available to you, which are called hard or soft. You can configure the CLUSTERINg_METHOD parameter using the properties available for this algorithm, as shown in Figure 12-21. The choices are Scalable EM (Expectation Maximization), Non-Scalable EM, Scalable K-Means, or Non-Scalable K-Means. The default is Scalable EM, mostly for its lightweight performance overhead. K-type clustering is considered hard clustering because it creates buckets (groupings) and then assigns your data into only one bucket with no overlap. EM clustering takes the opposite approach—overlaps are allowed. This type is sometimes called soft clustering. The scalable portion of the selected method of clustering refers to whether a subset of the source data or the entire set of source data is used to process the algorithm. For example, a maximum size of 50,000 source rows is used to initially process Scalable EM. If that value is sufficient for the algorithm to produce meaningful results, other data rows greater than 50,000 are ignored. In contrast, Non-Scalable EM loads the entire dataset at initial process. Performance can be up to three times faster for Scalable EM than for Non-Scalable EM. Scalability works identically for K-Means, meaning that the Scalable version of K-Means loads the first 50,000 rows and loads subsequent rows only if meaningful results are not produced by the first run of the algorithm. You can use the CLUSTERINg_METHOD parameter to adjust the clustering method used. In Figure 12-21, notice that the default value is 1. The possible configuration values are 1–4: 1, Scalable EM; 2, Non-Scalable EM; 3, Scalable K-Means; 4, Non-Scalable K-Means.
FiguRe 12-21 You can choose from four different methods of implementing the clustering algorithm. Configure your choice by using the CLUSTERINg_METHOD parameter.
After you’ve created and tuned your model, you can use the Microsoft Cluster Viewer to better understand the clusters that have been created. The four types of Microsoft Cluster views are: Cluster Diagram, Cluster Profiles, Cluster Characteristics, and Cluster Discrimination. Looking first at the Cluster Diagram view, you’ll notice that the default shading variable is set to the entire population. Because of this, the State value is not available. Usually we adjust the Shading Variable value to something other than the default, such as Bike Buyer, which is
388
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
shown in Figure 12-22. We’ve also set the State value to 1. This way we can use the results of the Microsoft Clustering algorithm to understand the characteristics associated with bicycle purchasers. We also show the result of Drill Through on Cluster 1. This time we limited our results to model columns only.
FiguRe 12-22 The Cluster Diagram view shows view information about variables and clusters, including Drill
Through.
As you take a closer look at this view, you’ll see that you can get more information about each cluster’s characteristics by hovering over it. You can rename clusters by right-clicking them. It is common to rename cluster nodes to help with the usability of the view. For example, you might use something like Favors Water Bottles as a cluster node name. You can view drillthrough information for a cluster by right-clicking it and then clicking Drill Through. Figure 12-22 shows the Cluster Diagram view with the Drill Through results for Cluster 1. Notice the (now familiar) dependency strength slider to the left of this view. The other three views—Cluster Profiles, Cluster Characteristics, and Cluster Discrimination— are quite similar to those with the same names that are available to view the results of the Naïve Bayes algorithm (Attribute Profiles, for example). Do you understand the key differences between Naïve Bayes and Microsoft Clustering? To reiterate: For Microsoft Clustering you have more flexibility in source data types and in configuring the method of grouping (EM or K-Means and Scalable or Non-Scalable). For these reasons Microsoft Clustering is used for similar types of tasks, such as finding groupings in source data, but it is more often used later in data mining project life cycles than Naïve Bayes.
Chapter 12
Understanding Data Mining Structures
389
Tip Microsoft Clustering is sometimes used to prepare or investigate source data before implementing Microsoft Decision Trees. The results of clustering (clusters) are also sometimes used as input to separate Microsoft Decision Trees models. This is because this type of pre-separated input reduces the size of the resulting decision tree, making it more readable and useful.
Microsoft Sequence Clustering Microsoft Sequence Clustering results in a similar output as (regular) Microsoft Clustering, with one important addition: It monitors the states between values. In other words, it detects clusters, but clusters of a particular type—clusters of sequenced data. To implement this algorithm you must mark at least one source column with the Key Sequence content type. This key sequence must also be in a nested table. If your source data structure does not include appropriately typed source data, this algorithm type will not be available in the drop-down list when you create or add new mining models. One example is click-stream analysis of navigation through various Web pages on a Web site. (Click-stream analysis refers to which pages were accessed in what order by visitors to one or more Web sites.) The Microsoft Sequence Clustering algorithm uses the EM (Expectation Maximization) method of clustering. Rather than just counting to find associations, this algorithm determines the distance between all possible sequences in the source data. It uses this information to create cluster results that show sequences ranked. An interesting configurable parameter for this algorithm is CLUSTER_COUNT, which allows you to set the number of clusters that the algorithm builds. The default is 10. Figure 12-23 shows this property. Another parameter is MAXIMUM_SEQUENCE_STATES. If you are using the Enterprise edition of SQL Server 2008, you can adjust the default of 64 to any number between 2 and 65,535 (shown in the Range column in Figure 12-23) to configure the maximum number of sequence states that will be output by the algorithm.
FiguRe 12-23 You can adjust the maximum number of sequence states if using the Enterprise edition of SQL
Server 2008.
Five different views are included as part of the Microsoft Sequence Cluster Viewer for you to use to see the output of the Microsoft Sequence Clustering algorithm: Cluster Diagram, Cluster Profiles, Cluster Characteristics, Cluster Discrimination, and State Transitions. The first four views function quite similarly to those found in the Microsoft Cluster Viewer. Using
390
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
the fifth view, State Transitions, you can look at the state transitions for any selected cluster. Each square (or node) represents a state of the model, such as Water Bottle. Lines represent the transition between states, and each node is based on the probability of a transition. The background color represents the frequency of the node in the cluster. As with the other cluster views, the default display is for the entire population. You can adjust this view, as we have in Figure 12-24, by selecting a particular cluster value (in our case, Cluster 1) from the dropdown list. The number displayed next to the node represents the probability of affecting the associated node.
FiguRe 12-24 The State Transitions view helps you visualize the transitions between states for each cluster in
your model.
The next algorithm we’ll examine is second in popularity only to the Microsoft Time Series algorithm. This is the Microsoft Association algorithm. It is used to analyze groups of items, called itemsets, which show associative properties. In our real-world experience with implementing data mining solutions, we’ve actually used this algorithm more than any of the other eight available.
Chapter 12
Understanding Data Mining Structures
391
Microsoft Association Algorithm As we just mentioned, Microsoft Association produces itemsets, or groups of related items from the source attribute columns. It creates and assigns rules to these itemsets. These rules rank the probability of the particular itemsets to be together. This is often called market-basket (or shopping basket) analysis. Source data for Microsoft Association takes the format of a case and a nested table. Source data can have only one predictable value. Typically this is the key column of the nested table. All input columns must be of type Discrete. If you are impressed with the power of this algorithm, you are not alone. As mentioned, we’ve used this one with almost every customer who has implemented data mining as part of their BI solution. From brick-and-mortar retailers who want to discover which items when physically placed together would result in better sales, to online merchants who want to suggest you might like x product because you bought y, we use this algorithm very frequently with customers who want to improve rates of sale of their products. Of course this isn’t the only use for this algorithm, but it is the business scenario we more frequently use it for. One consideration in using Microsoft Association is that its power doesn’t come cheaply. Processing source data and discovering itemsets and rules between them is computationally expensive. Change default parameter values with care when using this algorithm. Microsoft Association has several configurable parameters. As with some of the other algorithms, the ability to adjust some of these parameters depends on the edition of SQL Server that you are using. Several values require the Enterprise edition of SQL Server. Included in the configurable parameters is the ability for you to adjust the maximum size of the discovered itemsets. This parameter is set to a value of three by default. These parameters are shown in Figure 12-25.
FiguRe 12-25 You can adjust the size of the itemsets via the MAXIMUM_ITEMSET_SIZE parameter when
using the Microsoft Association algorithm.
Three types of views are available in the Microsoft Association Rules Viewer for you to review the results of the Microsoft Association algorithm: the Rules, Itemsets, and Dependency Network views. The Rules view shows a list of the rules that have been generated by the algorithm. These rules are displayed by default in order of probability of occurrence. You can,
392
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
of course, change the sort order by clicking any of the column headers in this view. You can also adjust the view to show rules with different minimum probabilities or minimum importance values than the configured defaults display by changing the values of these controls. The Itemsets view allows you to take a closer look at the itemsets that the algorithm has discovered. We’ve changed a couple of the default view settings in Figure 12-26, so that you can more easily understand the data. We’ve selected Show Attribute Name Only in the Show drop-down list. Next we clicked the Size column to show a sort order that was supported by the number of items in the itemset. You can see that the top single-item itemset is the Sport-100 with 6171 cases. As you can do with the Rules view, you can also adjust the minimum support in this view. As well, you can adjust the minimum itemset size and maximum rows displayed.
FiguRe 12-26 The Itemsets view allows you to see the itemsets that were discovered by the Microsoft
Association algorithm.
You can also manually type a filter to the Rules or the Itemsets view. To do this, type the filter value in the Filter Itemset box. An example for this particular sample would be Mountain-200 = Existing. After you type a filter, press Enter to refresh the view and see the filter applied. Although we find value in both the Rules and Itemsets views, we most often use the Dependency Network view to better understand the results of the Microsoft Association algorithm. This view shows relationships between items and allows you to adjust the view by
Chapter 12
Understanding Data Mining Structures
393
dragging the slider to the left of the view up or down. After you click a particular node, color coding indicates whether the node is predicted (by another node) or predictive (of another node). This view is shown in Figure 12-27.
FiguRe 12-27 The Dependency Network view for the Microsoft Association algorithm allows you to review itemsets from the results of your mining model.
Tip The SQL Server 2008 Data Mining Add-ins for Excel 2007 include a new capability for working with the Microsoft Association algorithm using the Table Analysis Tools on the Excel 2007 Ribbon. This algorithm is used by the new Shopping Basket Analysis functionality available with these tools. The Shopping Basket Analysis button generates two new types of reports (views) for the output produced: the Shopping Basket Bundled Item and Shopping Basket Recommendations. We will review this in more detail in Chapter 13. The next algorithm we look at is the most powerful and complex. We haven’t often used it in production situations because of these factors. However, we are familiar with enterprise customers who do need the power of the Microsoft Neural Network algorithm, so we’ll take a closer look at that next. This algorithm also contains the Microsoft Logistic Regression functionality, so we’ll actually be taking at look at both algorithms in the next section.
394
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Microsoft Neural Network Algorithm Microsoft Neural Network is by far the most powerful and complex algorithm. To glimpse the complexity, you can simply take a look at the SQL Server Books Online description of the algorithm: “This algorithm creates classification and regression mining models by constructing a Multilayer Perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm, the Microsoft Neural Network algorithm calculates probabilities for each possible state of the input attribute when given each state of the predictable attribute. You can later use these probabilities to predict an outcome of the predicted attribute, based on the input attributes.” If you’re thinking “Huh?,” you’re not alone. When do you use this algorithm? It is recommended for use when other algorithms fail to produce meaningful results, such as those measured by a lift chart output. We often use Microsoft Neural Network as a kind of a last resort, when dealing with large and complex datasets that fail to produce meaningful results when processed using other algorithms. This algorithm can accept a data type of Discrete or Continuous as input. For a fuller understanding of how these types are processed, see the SQL Server Books Online topic “Microsoft Neural Network Algorithm Technical Reference.” Using Microsoft Neural Network against large data sources should always be well-tested using near-production-level loads because of the amount of overhead needed to process these types of models. As with the other algorithms, several parameters are configurable using the Algorithm Parameters dialog box in SQL Server 2008. As with some of the other algorithms that require extensive processing overhead, you should change the default values only if you have a business reason to do so. These parameters are shown in Figure 12-28.
FiguRe 12-28 You can change the HIDDEN_NODE_RATIO default value of 4.0 if you are working with the Enterprise edition of SQL Server 2008.
The Microsoft Neural Network Viewer contains only one view for this algorithm. The view, shown in Figure 12-29, does allow you add input attribute filters as well as adjust the output attribute that is shown. For this figure, we’ve added three input attribute filters related to Commute Distance, Education, and Occupation. When you pause your mouse on the bars you’ll see a tooltip with more detailed information about the score, probability, and lift related to the selected value.
Chapter 12
Understanding Data Mining Structures
395
FiguRe 12-29 Only one viewer is specific to the Microsoft Neural Network algorithm.
A variant of the Microsoft Neural Network algorithm is the Microsoft Logistic Regression algorithm. We’ll take a look at how it works next.
Microsoft Logistic Regression Microsoft Logistic Regression is a variant of the Microsoft Neural Network algorithm. (Here the HIDDEN_NODE_RATIO parameter is set to 0.) Microsoft Logistic Regression uses a variant of linear regression; one example is when the dependent variable is a dichotomy, such as success/failure. It is interesting to note that the configurable parameters are identical to those of the Microsoft Neural Network algorithm, with the exception of the removal of the HIDDEN_NODE_RATIO value, which will always be set to 0 in this case. Figure 12-30 shows the configurable parameters for this algorithm.
FiguRe 12-30 The Microsoft Logistic Regression algorithm contains six configurable parameters.
Because this algorithm is a variant of Microsoft Neural Network, the viewer for it is the same as the one for that algorithm.
396
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Tip The SQL Server 2008 Data Mining Add-Ins for Excel 2007 include a new capability for working with the Microsoft Logistic Regression algorithm using the Table Analysis Tools on the Excel 2007 Ribbon. It generates output that can be used offline to calculate predictions using information from the processed data mining model. This algorithm is used by the new Prediction Calculator functionality available with these tools. This activity is described in more detail in Chapter 25, “SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007.”
The art of Data Mining Feeling overwhelmed? As you come to understand the amount of power included out of the box in SSAS data mining, that is not an uncommon reaction. If you are completely new to data mining, remember our earlier tip: Rather than starting with BIDS, use the Table Analysis Tools and Data Mining tab on the Excel 2007 Ribbon to familiarize yourself with the functionality of the algorithms. We devote an entire chapter to this very topic (Chapter 23) because of the importance of using these Ribbon interfaces not only for end users, but also for BI developers. We’ve often found that the best way to get a feel for exactly what each of these algorithms does is to use each on a small set of sample data, which can simply originate in Excel. Selecting the appropriate algorithm is a key part of successfully implementing SSAS data mining. One key difference between designing OLAP cubes and data mining structures and models is that the latter are much more pliable. It is common to tinker when implementing data mining models by trying out different algorithms, adjusting the dataset used, and tuning the algorithm by adjusting the parameter values. Of course part of this development cycle includes using the mining model validation visualizers and then continuing to fiddle with the models based on the results of the validators. This is not to imply a lack of precision in the algorithms, but to remember the original purpose of using data mining—to gain understanding about relationships and patterns in data that is unfamiliar to you.
Summary We’ve completed a deep dive into the world of data mining algorithm capabilities. In this chapter we discussed preparatory steps for implementing SQL Server 2008 data mining. These steps included understanding content types and usage attributes for source data, as well as data mining concepts such as case and nested tables. We then took a closer look at the BIDS interface for data mining, exploring each of the tabs. Next we discussed how each of the nine included algorithms works—what they do, what they are used for, and how to adjust some default values.
Chapter 12
Understanding Data Mining Structures
397
Because we’ve covered so much ground already, we are going to break implementation information into a separate chapter. So, take a break (you’ve earned it!), and after you are refreshed, join us in the next chapter, where we implement the CRISP-DM SDLC (otherwise known as a proven software development life cycle for implementing BI projects using data mining technologies in BIDS. This includes creating, validating, tuning, and deploying data mining models. We’ll also examine DMX and take a look at data mining capabilities in SSIS.
Chapter 13
Implementing Data Mining Structures In this chapter, we implement the Cross Industry Standard Process for Data Mining (CRISP-DM) life cycle to design, develop, build, and deploy data mining structures and models using the tools provided in Microsoft SQL Server 2008 Analysis Services (SSAS). We focus on using the Business Intelligence Development Studio (BIDS) to create mining structures and models. After we create a model or two, we explore the functionality built into BIDS to validate and query those models, take a look at the Data Mining Extensions (DMX) language, and review data mining integration with Microsoft SQL Server 2008 Integration Services (SSIS). To complete our look at data mining, we discuss concerns related to model maintenance and security.
Implementing the CRISP-DM Life Cycle Model In Chapter 12, “Understanding Data Mining Structures,” we looked at the data mining algorithms included in SSAS. This chapter focuses on the implementation of these algorithms in business intelligence (BI) projects by using specific data mining models. We also introduce a software development life cycle model that we use for data mining projects: the CRISP-DM life cycle model. As we move beyond understanding its algorithms and data mining capabilities to building and validating actual data mining models, we refer to the phases of the CRISP model, which are shown in Figure 13-1. Note Some Microsoft data mining products, notably the Data Mining tab of the Ribbon in Microsoft Office Excel 2007, have been built specifically to support the phases that make up the CRISP-DM model. We talk more about the integration of Excel with Analysis Services data mining in Part IV. In this chapter, the focus is on working with mining models within BIDS. We also cover integration with Microsoft SQL Server 2008 Reporting Services, Excel, and other client tools in Part IV.
As previously mentioned, we’ve done the work already (in Chapter 12) to understand and implement the first three phases of the CRISP model. There we covered business understanding by discussing the particular capabilities of the nine included data mining algorithms as they relate to business problems. To review briefly, the nine included data mining algorithms are as follows: ■■
Microsoft Naïve Bayes algorithm A very general algorithm, often used first to make sense of data groupings 399
400
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
■■
Microsoft Association algorithm Used for market basket analysis—that is, a “what goes with what” analysis—or for producing itemsets
■■
Microsoft Sequence Clustering algorithm Used for sequence analysis, such as Web site navigation click-stream analysis
■■
Microsoft Time Series algorithm Used to forecast future values over a certain time period
■■
Microsoft Neural Network algorithm A very computationally intense algorithm; used last, when other algorithms fail to produce meaningful results
■■
Microsoft Logistic Regression algorithm A more flexible algorithm than spreadsheet logistic regression; used as an alternative to that
■■
Microsoft Decision Trees algorithm Used in decision support; the most frequently used algorithm
■■
Microsoft Linear Regression algorithm for details.)
■■
Microsoft Clustering algorithm A general grouping; more specific than Naïve Bayes; used to find groups of related attributes
Business Understanding
A variation of decision trees (See Chapter 12
Data Understanding Data Preparation
Deployment Modeling Data
Evaluation
FIguRe 13-1 Phases of the CRISP-DM process model, the standard life cycle model for data mining
Data understanding and data preparation go hand in hand, and they can be performed only after a particular algorithm has been selected. You’ll recall that this is because algorithms have particular requirements related to source data, such as use of Naïve Bayes requiring that
Chapter 13
Implementing Data Mining Structures
401
all source data be of content type discrete. Data preparation also includes extract, transform, and load (ETL) processes to clean, validate, and sometimes also pre-aggregate source data. We’re frequently asked, “Should I create OLAP cubes first, and then implement data mining using data from those cubes? Or is data mining best implemented using relational source data?” Although there’s no single right answer to these questions, we generally prefer to implement data mining after OLAP cubes have been created. The simple and practical reason for this is that, in our experience, most relational source data is in need of cleansing and validation prior to it being used for any type of BI task. Also OLAP cubes provide aggregated data. Loading smaller amounts of cleaner data generally produces more meaningful results. Because we already reviewed the nine mining algorithms in detail in Chapter 12 and you understand the preparatory steps, you’re ready to create mining models. You can do this inside of a previously created mining structure, or you can create a new, empty mining structure in BIDS. Keep in mind that to create a mining structure, at a minimum, you first have to define a data source and a data source view. You can also choose to build an OLAP cube, but this last step is not required. After you’ve completed the preparatory steps just listed, right-click on the Data Mining folder in Solution Explorer in BIDS, and choose New Mining Structure. Doing this opens the Data Mining Wizard, which allows you to create a data mining structure with no mining models (if you don’t have an existing structure you want to work with). In the following sections, we start by creating an empty mining structure, and add a couple of mining models to that structure.
Building Data Mining Structures using BIDS Creating an empty data mining structure might seem like a waste of time. However, we find this method of working to be efficient. As you’re aware of by now, source data for data mining models requires particular structures (that is, data, content, and usage types), and these structures vary by algorithm. Creating the structure first helps us to focus on including the appropriate source data and content types. Also, you might remember that drillthrough can be implemented against all columns in the structure whether or not they are included in the model. Creating a structure first and then defining subsets of attributes (columns) included in the structure results in more compact and effective data mining models. To create a data mining structure, open BIDS and create a new Analysis Services project. Create a data source and a data source view. Then right-click on the Data Mining Structures folder in the Solution Explorer window in BIDS. As you work through the pages in the Data Mining Wizard to create an empty data mining structure, consider these questions: ■■
What type of source data will you include: relational (from a data source view) or multidimensional (from a cube in your project)? Using this wizard, you can’t combine the two types of source data; you must choose one or the other. You can, of course, include
402
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
more than one table or dimension from the particular selected source. In this chapter, as in the others, we’ll use sample data from the SQL Server 2008 sample databases or OLAP cubes. ■■
If you’re creating an empty data structure, indicate that on the second page of the wizard. Otherwise, you must select one algorithm that will be used to create a data mining model in your new data structure. As mentioned, at this point we’ll just create an empty structure, and later we’ll add a couple of mining models to it.
■■
If you choose relational source data, you must decide which table will be the case table. Will you include a nested table? Which columns will you include? Or, if you choose multidimensional data, which dimensions, attributes, and facts will you include? During this part of the wizard, the key column for each source table is detected. You can adjust the automatically detected value if needed.
■■
What are the content and data types of the included columns or attributes? BIDS attempts to automatically detect this information as well. When you’re working in the pages of the wizard, you can make changes to the values that BIDS supplies. You can also rename source columns or attributes. Figure 13-2 shows this page. You’ll see that the values available for content types on this page are Continuous, Discrete, or Discretized for nonkey columns. For key columns, the content type choices are Key, Key Sequence, or Key Time. As we discussed in Chapter 12, this is a subset of all available content types. To mark a source column (or attribute) with a content type not available in the wizard, you change the value in the property sheet for that attribute (after you complete the wizard).
FIguRe 13-2 The BIDS Data Mining Wizard detects content and data types in your source data.
Chapter 13
Implementing Data Mining Structures
403
The data mining data types—Text, Long, Boolean, Double, and Date—are also automatically detected by the wizard in BIDS. ■■
How much test or holdout training data would you like to use? New for 2008 is the ability to create holdout test sets during structure or model creation via the wizard. This capacity allows a validation function to determine the predictive accuracy of a model by using the training sets for training the model and making the predictions against the test sets. Usually, around 30 percent (default value) of the source data is held out (meaning not included in the training set). You can choose to have this configured automatically, to configure it via the Data Mining Wizard, or to configure it programmatically using DMX, Analysis Management Objects (AMO), or XML DDL. Figure 13-3 shows the page in the Data Mining Wizard where you can configure a portion of your source data to be created as a testing set.
FIguRe 13-3 The BIDS Data Mining Wizard allows you to easily create testing datasets.
There are two ways you can specify a training set using this wizard. You can accept (or change) the default percentage value or you can fill in a value for a maximum number of cases in the testing dataset. If you fill in both values, the value that produces the smaller result is used to create the actual test set. You can alternatively configure the values used to produce test sets in the properties of the data mining structure. These properties are HoldoutMaxCases, HoldoutMaxPercent, and HoldoutSeed. The last property mentioned is used to configure comparable size test sets when multiple models are created using the same data source. Note that the Microsoft Time Series algorithm does not support automatic creation of test sets.
404
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
This is a welcome new feature, particularly because it works not only with simple (single table or dimension) sources, but also with structures that contain nested tables. You can just click through and name your new data mining structure. If you make a selection in the wizard that results in an error during mining structure processing, the wizard prompts you with an informational dialog box. The error messages are specific and thoughtfully written. This is yet another way that Microsoft has made it easier for you to successfully build mining structures, even if you are a novice. The flexibility of the Data Mining Wizard and the resulting data mining structure is quite convenient. As we’ve mentioned repeatedly, it’s quite common to adjust content types, add or remove attributes from a structure, and otherwise tune and tinker as the data mining process continues. Of course, if you attempt to create either an invalid structure or model, you can simply click the Back button in the wizard, fix the error, and then attempt to save the new object again. The next step in the process after creating a data mining structure is to add one or more mining models to it. We’ll do that next.
Adding Data Mining Models using BIDS After you’ve created a data mining structure, you’ll want to add one or more mining models to it. The steps to do this are quite trivial, as long as you thoroughly understand which data mining algorithm to select and what the required structure of the source data is for the algorithm. You can right-click inside the designer surface in BIDS when the Mining Structure or Mining Models tab is selected and then click New Mining Model. You simply provide a model name and select the algorithm. BIDS then adds this model to your mining structure. After you’ve added a mining model based on a particular algorithm, you might want to modify the usage of the attributes. The values are auto-generated by BIDS during data mining model creation, but you can easily change them by selecting a new value in the drop-down list of the Mining Model designer. The values you can select from are Ignore, Input, Key, Predict, or PredictOnly. You might also want to configure some of the properties that are common to all types of data mining models or some of the algorithm parameters to further customize your mining model. We covered many of the algorithm-specific parameters algorithm by algorithm in the previous chapter. Common properties for mining models include the object name, whether or not drillthrough is allowed, and the collation settings. You access these properties by rightclicking the mining model on the Mining Models tab in BIDS and then clicking Properties. This opens the Properties window. In addition to configuring properties and parameters, you might want to reduce the size of your model by specifying a filter. This is a new feature introduced in SQL Server 2008. You can do this on the Mining Models tab by right-clicking the model you want to filter and then
Chapter 13
Implementing Data Mining Structures
405
clicking Set Model Filter. This opens the dialog box shown in Figure 13-4. There you can specify a query to filter (or partition) your mining source data. An example of a business use of this feature is to segment source data based on age groupings.
FIguRe 13-4 New in SQL Server 2008 is the ability to filter source data for data mining models using BIDS.
Using filters allows you to create smaller mining models. These smaller models can be processed more quickly and can produce understandable and meaningful results more quickly for your business situation. You can also configure filters by using the Properties window for your mining model. Although creating most types of mining models is straightforward, some specific types of source data require particular considerations. One example of this situation occurs if you use multidimensional data as source data rather than using relational tables as source data. If you select From Existing Cube as your data source for your mining structure, you are presented with one additional page before you complete the Data Mining Wizard. This page allows you to define a particular slice (or subset) of the cube as the basis for your mining structure. Figure 13-5 shows an example of defining a slice in this way after selecting the Show Only Customers Who Are Home Owners option in the Data Mining Wizard.
FIguRe 13-5 The Data Mining Wizard allows you to define slices of the source cube.
406
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Another example of a requirement specific to a particular source type is when you’re creating a model based on an existing cube that contains a time series using the Data Mining Wizard. If this is your situation, you need to slice the time dimension to remove all members that fall outside a certain range of time (assuming those members have been loaded into your cube already). Still another difference in the wizard if you base your mining model on an existing cube (rather than relational source data) is that on the final page, shown in Figure 13-6, you’re presented with two additional options: Create Mining Model Dimension and Create Cube Using Mining Model Dimension. If you select either of these last two options, you also need to confirm the default name for the new dimension or cube, or you can update that value to name the new cube or dimension as your business requirements dictate. The default naming conventions are _DMDim and _DM.
FIguRe 13-6 The Data Mining Wizard allows you to create a new dimension in your existing cube using the
information in your mining model, or you can create an entirely new cube.
One interesting aspect of creating a new dimension, whether in the existing cube or in a new cube, is that a new data source view (DSV) is created representing the information in the model. Unlike most typical DSVs, this one cannot be edited, nor can the source data (the model output) be viewed via the DSV interface.
Chapter 13
Implementing Data Mining Structures
407
Processing Mining Models After you’ve created a data mining structure that includes one or more data mining models, you’ll want to use the included data mining viewers to look at the results of your work. Before you can do this, however, you have to process the data mining objects. As with multidimensional structure (OLAP cube) processing, your options for processing depend first on whether you’re working in connected or disconnected mode. If you’re building your initial structures and models and are working in disconnected mode, you’ll be presented with the following options for creating your objects: Process, Build, Rebuild, and Deploy. These options function nearly the same way as they do when you’re using them to work with OLAP cubes. That is, Build (or Rebuild) simply validates the XMLA that you’ve created using the visual design tools. There is one difference between data mining models and OLAP cubes when using Build, though—that is, there are no AMO design warnings built in for the former. If you’ve created a structure that violates rules, such as a case table with no key column, you receive an error when you attempt to build it. This violation is shown using a red squiggly line and the build will be unsuccessful. After you’ve successfully built your data mining objects, you can process them. This copies all XMLA to the SSAS server that will create the data mining structures. After this step is complete, the data mining objects are populated with source data from the data source defined in the project. Process progress is shown in a dialog box of the same name. An example is shown in Figure 13-7.
FIguRe 13-7 The Process Progress dialog box shows you step-by-step process activities for data mining structures or models.
408
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Again, as when you’re building OLAP cubes, during the early development phases, you’ll tend to build, process, and deploy data mining objects using the default processing setting. This setting, Process Full, completely erases all existing data mining objects with the same name and completely rebuilds them and repopulates them with source data. Although this technique is quick and easy during development, as you move your mining models into production, you’ll want to use more granular methods to process models, particularly when you’re simply adding new data to an existing mining model. Processing methods for the structure are full, default, structure, clear structure, and unprocess. Processing methods for the model are full, default, and unprocess. We’ll cover the methods available to you to do such granular processing later in this chapter. After you’ve processed and deployed your mining models to the server, you can use the Mining Model Viewer tab to look at your mining model results. We spent quite a bit of time reviewing the various mining model viewers in the previous chapter, and we’re not going to repeat that information here—other than to remind you that we use the mining model viewers as a first step in validating our data mining models. Another consideration regarding data mining object processing is error handling. You can configure several error-handling properties for each data mining structure by using the available settings in the Properties dialog box. To make these properties visible, you must change the ErrorConfiguration value from (default) to (custom). Figure 13-8 shows these values. They include many configurable properties relating to the handling of key and null errors. You’ll recognize most of these from the dimension and fact custom error-handling configuration possibilities, because, in fact, these properties are nearly the same. Also, we offer the same advice here as we did when working with data destined for OLAP cubes—that is, refrain from loading nulls and key errors by performing thorough ETL processes prior to attempting to load your data mining models.
FIguRe 13-8 Custom error handling can be defined for each mining structure.
After you move to production, you might want to automate the processing of your mining models or structures. As with OLAP object processing, you can easily accomplish this by creating an SSIS package and choosing the Analysis Services Processing Task option. We’ll take a closer look at the integration between SSIS and SSAS data mining later in this chapter.
Chapter 13
Implementing Data Mining Structures
409
Now, however, we’re going to return to the logical next step in the data mining system development life cycle—that is, model validation. To accomplish this task, you can start by using the tools included in BIDS. We’ll start by exploring the Mining Accuracy Chart tab in BIDS. Note In addition to using XMLA as a definition language for data mining objects, there is another data definition language available for creating some types of data mining objects. This language is called Predictive Model Markup Language (PMML). It’s a standard developed by a working group of data mining vendors, including Microsoft. See SQL Server Books Online to understand which types of algorithms are supported.
Validating Mining Models Although some of our clients have validated the usefulness of their data mining models simply by using the viewers available on the Mining Model Viewer tab to help them understand the results of the mining model algorithm, most consumers of data mining projects also want to include an accuracy check of the models that are produced. This is done to validate the output produced and to guide further customization of a particular model—or even to help determine which algorithms shed the most light on the particular business situations being addressed. You should be aware that in the CRISP-DM software development life cycle, there is a distinct phase of the software development life cycle dedicated to model evaluation because of the very large number of variables involved in successful data mining. These variables include the choice of data for inclusion, content types, inputs, predictability, choice of mining algorithm, and so on. We’ve mentioned this idea several times previously, but it bears repeating here. We find this inexactitude to be challenging for developers new to data mining to understand, accept, and work with. Because of the wide variety of choices involved in data mining, model validation is a key phase in all but the simplest data mining projects. Let’s take a look at the types of tools that are included in BIDS to assist with model validation. Each is represented by an embedded tab in the Mining Accuracy Chart tab. They are the lift (or profit) chart, classification matrix, and the cross validation tool. Also on the Mining Accuracy Chart tab is the preparatory Input Selection tab. This interface is shown in Figure 13-9. When you configure the test dataset and click on any of the accuracy chart tabs, a DMX query is generated and the results are shown in graphical or tabular form. For example, in the case of a lift chart, the DMX query begins with CALL System.Microsoft.AnalysisServices. System.DataMining.AllOther.GenerateLiftTableUsingDataSource. To use any of these validator tools, you must first configure the input. You do this by selecting the mining model or models from the selected structure that you want to analyze, and then selecting a predictable column name and (optionally) a predictive value setting. The next portion of the input configuration involves selecting what the source of the test data will be. Test data contains the right answer (or correct value for the predictions) and is used
410
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
to score the model during accuracy analysis. The default setting is to use the test dataset that you (optionally, automatically) created at the time of data mining structure creation. At the bottom of the Input Selection tab, you can change that default to use any dataset that is accessible through a data source view in your BIDS project as testing data. You can also optionally define a filter for manually selected testing datasets. After you’ve configured input values, you click on the Lift Chart tab to create a lift chart validation of your selected model or models.
FIguRe 13-9 The Input Selection tab is used to select the test data to use when verifying model accuracy.
To further explain the outputs of the Mining Accuracy Chart tab, you need to understand a bit more about lift charts and profit charts.
Lift Charts A lift chart compares the accuracy either of all predictions or of a specific prediction value of each model included in your structure to that of an average guess and also to that of a perfect prediction. Lift chart functionality is included in BIDS. The output of lift charts is a line chart. This chart includes a line for average guessing, or 50 percent correct values, and a line for ideal results or 100 percent correct results. The average guess line bisects the chart’s center; the ideal line runs across the top of the chart. Lift charts compare holdout data with mining model data. Holdout data is data that contains all the attributes included in the mining model data, but, importantly, it also contains the predicted value. For example, if a time series algorithm was predicting the rates of sale of particular models of a bicycle over a range of time, the holdout data would contain actual values for rates of sale for a subset of that same data. Another way to think of a lift chart is that it performs mining model accuracy validation. Useful data mining models are scored somewhere in the range above random guessing and below perfection or ideal. The closer the model scores to the ideal line, the more effective it is. Lift charts can be used to validate a single mining model. They are also useful to validate multiple models at the same time. These models can be based on different algorithms, the
Chapter 13
Implementing Data Mining Structures
411
same algorithm with varied inputs (such as content types, attributes, filters, and so on), or a combination of these techniques. To show the functionality of a lift chart, we’ll work with the Targeting Mail sample data mining structure included in the Adventure Works sample. To start, open the mining structure in BIDS and then select all included mining models as input to the lift chart. Also, accept the default, which is to use the holdout data that was captured when the model was first trained. You’ll have to select a value for the predicted attribute. This is called PredictValue on the Input Selection tab. For our sample, set the value to 1, which means customers who purchased a bicycle. Note If your sample model, Targeted Mailing, does not include holdout data, called mining model test cases or mining structure test cases on the Input Selection tab, select the Specify A Different Data Set option on that tab and then use the sample vTargetMail view included in Adventure Works DW 2008 as the holdout data.
After you click on the second tab, Lift Chart, in BIDS, you’ll be able to review the output of the lift chart showing which customers will be most likely to purchase bicycles. You might use lift chart output to help you to determine which mining models are most accurate in predicting this value, and you might then base business decisions on the most accurate mining model. A common scenario is targeting mailings to potential future customers—for example, determining which model or models most accurately associates attributes with bike-buying behavior. The lift chart output will help you to select those models and to take action based on the output. A lift chart can be used to validate the accuracy of multiple models, each of which predict discrete values. A random (or average) guess value is shown by a blue line in the middle of the chart; the perfect (or ideal) value is shown as a line at the top of the chart. The random guess is always equal to a 50 percent probability of correctness of results. The perfect value is always equal to 100 percent correct values. As shown in Figure 13-10, the random line always measures at 50 percent of the perfect value line. You might wonder where the percentage originates from. This is the result of a DMX query that SSAS automatically generates after you configure the Input Selection tab of the Mining Accuracy Chart tab in BIDS. The (holdout) testing data is compared to the trained mining model data. If you want to see the exact query that SSAS generates, you can just run SQL Server Profiler to capture the DMX. This data contains the correct predicted value for the dataset—using our example, customers who actually purchased a bicycle. It’s important that you understand that a lift chart shows the probability of a prediction state and that it does not necessarily show the correctness of the prediction state or value. For example, it could validate that a model predicts that someone either would not buy a bicycle or that they would buy a bicycle. The chart predicts the validity of only those attribute state or states that you configure it to predict.
412
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIguRe 13-10 A lift chart allows you to validate multiple mining models.
The output format of the lift chart you receive depends on the content types in the model or models used for input. If your model is based on continuous predictable attributes, the lift chart output is a scatter plot rather than a straight-line chart. Also, if your mining model is based on the Time Series algorithm, you must use an alternative method (namely, a manual DMX prediction query) rather than a lift chart to validate it. Another possibility is that you want to use a lift chart to validate a mining model that does not contain a predictive attribute. In this case, you get a straight-line chart. It simply shows the probability of correct selections for all possible values of the attribute—that is, both positive and negative values in the case of a discrete attribute. The lift chart output includes a mining legend as well. There is one row for each mining model evaluated, as well as one row for the random guess and the best score (called an Ideal Model). There are three columns in this legend: Score, Target Population, and Predict Probability. The Score column shows the comparative merit for all the included models— higher numbers are preferred. For this example, the lift chart is showing that the mining model using the Microsoft Decision Trees algorithm is producing the highest score and is, therefore, the best predictor of the targeted value—that is, customers who will actually purchase a bicycle—of all of the mining models being compared. The Target Population column shows how much of the population would be correctly predicted at the value selected on the chart—that is, “purchased bicycle” or “did not purchase bicycle.” That value is indicated by the gray vertical line in Figure 13-10, which is set at the random guess value, or 50 percent of the population. In the same figure, you can see that
Chapter 13
Implementing Data Mining Structures
413
for the TM Decision Tree model, the target population captured (at around 50 percent of the total population) is around 72 percent, which would, of course be, around 36 percent of the entire possible population. The Predict Probability column shows the probability score needed for each prediction to capture the shown target population. Another way to think about the last column mentioned is in terms of selectivity. Using our example, the predict probability values are quite similar: 50 percent and 43 percent, respectively. But what if they were 40 percent and 70 percent? If that were the case, you’d have to decide whether you’d like to send promotional mail to future potential bicycle buyers even if over one-half (that is, 60 percent) who would not respond (to get 72 percent), or whether you’d rather risk lower negative results—roughly one-third in the second model to get a lower return (around 60 percent). Suffice it say that the three values of the legend are used in conjunction to make the best business decisions.
Profit Charts A profit chart displays the hypothetical increase in profit that is associated with using each model that you’ve chosen to validate. As mentioned, lift charts and profit charts can be used with all included algorithms except the Microsoft Time Series algorithm. You can view a profit chart by selecting it from the Chart Type drop-down list above the charts. You configure a profit chart by clicking the Profit Chart Settings button on the Lift Chart tab. This opens the dialog box shown in Figure 13-11.
FIguRe 13-11 The Profit Chart Settings dialog box allows you to configure values to determine profit
thresholds.
The configurable values are as follows: ■■
Population
■■
Fixed Cost General costs associated with this particular business problem (for example, cost of machinery and supplies)
Total number of cases used to create the profit chart
414
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
■■
Individual Cost catalog)
■■
Revenue Per Individual
Individual item cost (for example, cost of printing and mailing each Estimated individual revenue from each new customer
After you’ve completed entering these configuration values, BIDS produces a profit chart using a DMX query, just as is done when you create a lift chart using BIDS. The profit chart output contains both graphical and legend information. The graph shows the projected profit for the population based on the models you’ve included. You can also use the mining legend to help you evaluate the effectiveness of your models. The legend includes a row for each model. For the selected value on the chart, the profit chart legend shows what the projected profit amount is. It also shows the predict probability for that profit value. Higher numbers are more desirable here. You can see in Figure 13-12 that the data mining model that is forecasted to produce the largest profit from the Target Mailing data mining structure is the decision tree model. Specifically, at the value we’re examining (around 52 percent of the population), the projected profit for the decision tree model is greater than for any other mining model included in this assessment. The value projected is around $190,000.
FIguRe 13-12 Profit charts show you profit thresholds for one or more mining models.
As with the lift chart output, you should also consider the value of the predict probability. You’ll note that in our example this value is around 50 percent for decision trees. It varies from a low of 42 percent for clustering to a high of 50 percent for decision trees. You’ll recall this tells you how likely the first value—that is, profit—will occur.
Chapter 13
Implementing Data Mining Structures
415
Classification Matrix For many types of model validation, you’ll be satisfied with the results of lift and profit charts. In some situations, however, you’ll want even more detailed validation output. For example, when the cost of making an incorrect decision is high—say you’re selling timeshare properties and offering potential buyers an expensive preview trip to a destination property—you’ll need the detailed validation found in the classification matrix (sometimes called the confusion matrix in data mining literature). This matrix is designed to work with discrete predictable attributes only. It displays tabular results that show the predicted values and the actual values for one or more predictable attributes. These results are sometimes grouped into the following four categories: false positive, true positive, false negative, and true negative. So it’s much more specific in functionality than either the lift chart or profit chart. It reports exact numbers for the four situations. As with lift charts, the testing data you configured on the Input Selection tab is used as a basis for producing results for the classification matrix. The output of this validator is a matrix or table for each mining model validated. Each matrix shows the predicted values for the model (on rows) and the actual values (on columns). Looking at this table, you can see exactly how many times the model made an accurate prediction. The classification matrix analyzes the cases included in the model according to the value that was predicted, and it shows whether that value matches the actual value. Figure 13-13 shows the results for the first mining model in the group we’re analyzing: the decision tree model. The first result cell, which contains the value 6434, indicates the number of true positives for the value 0 (zero). Because 0 indicates the customer did not purchase a bike, this statistic tells you that model predicted the correct value for members of the data population who did not buy bikes in 6434 cases. The cell directly underneath that one, which contains the value 2918, tells you the number of false positives, or how many times the model predicted that someone would buy a bike when actually she did not. The cell that contains the value 2199 indicates the number of false positives for the value 1. Because 1 means that the customer did purchase a bike, this statistic tells you that in 2199 cases, the model predicted someone would not buy a bike when in fact he did. Finally, the cell that contains the value 6933 indicates the number of true positives for the target value of 1. In other words, in 6933 cases the model correctly predicted that someone would buy a bike. By summing the values in cells that are diagonally adjacent, you can determine the overall accuracy of the model. One diagonal tells you the total number of accurate predictions, and the other diagonal tells you the total number of erroneous predictions. So to continue our example, for the decision tree model the correct predictions are as follows: 6434 + 6933 = 13367. And the incorrect predictions are expressed in this way: 2918 + 2199 = 5117. You can repeat this process to determine which model is predicting most accurately.
416
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIguRe 13-13 A classification matrix shows your actual and false positive (and negative) prediction values.
So, to evaluate across the models included in the Targeted Mailing sample structure, you might want to produce a summary table that looks like the one in Table 13-1. TABLe 13-1
Sample Results for Targeted Mailing Structure
Type of Model
Correct
Incorrect
TM Decision Tree
13367
5117
TM Clustering
11151
7333
TM Naïve Bayes
11671
6813
TM Neural Net
12237
6247
You can also use the classification matrix to view the results of predicted values that have more than two possible states. SQL Server Books Online recommends copying the values to the Clipboard and then copying that data to Excel (as we did to produce the Table 13-1) using the button available on the Classification Matrix tab in BIDS to facilitate this activity.
Chapter 13
Implementing Data Mining Structures
417
Cross Validation In addition to the three validation tools we’ve covered, SQL Server 2008 introduces a new type of validation tool. It works a bit differently than the existing tools. The cross validation tool was added specifically to address requests from enterprise customers. Keep in mind that cross validation does not require separate training and testing datasets. You can use testing data, but you won’t always need to. This elimination of the need for holdout (testing) data can make cross validation more convenient to use for data mining model validation. Cross validation works by automatically separating the source data into partitions of equal size. It then performs iterative testing against each of the partitions and shows the results in a detailed output grid. An output sample is shown in Figure 13-14. Cross validation works according to the value specified in the Fold Count parameter on the Cross Validation tab of the Mining Accuracy Chart tab in BIDS. The default value for this parameter is 10, which equates to 10 sets. If you’re using temporary mining models to cross validate in Excel 2007, 10 is the maximum number of allowable folds. If you’re using BIDS, the maximum number is 256. Of course, a greater number of folds equates to more processing overhead. For our sample, we used the following values: Fold Count of 4, Max Cases of 100, Target Attribute of Bike Buyer, and Target State of 1. Cross validation is quite computationally intensive, and we’ve found that we often need to adjust the input parameters to achieve an appropriate balance between performance overhead and results.
FIguRe 13-14 The new cross validation capability provides sophisticated model validation.
418
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
You can also implement cross validation using newly introduced stored procedures. See the SQL Server Books Online topic “Cross Validation (Analysis Services – Data Mining)” for more detail. Note that the output shown contains information similar to that displayed by the classification matrix—that is, true positive, false positive, and so on. A reason to use the new cross-validation capability is that it’s a quick way to perform validation using multiple mining models as source inputs. Note Cross validation cannot be used to validate models built using the Time Series or Sequence Clustering algorithms. This is logical if you think about it because both of these algorithms depend on sequences and if the data was partitioned for testing, the validity of the sequence would be violated.
After you’ve validated your model or models, you might want to go back, make some changes to improve validity, and then revalidate. As mentioned, such changes can include building new models (using different algorithms), changing algorithm parameter values, adding or removing source columns, reconfiguring attribute values, and more. We find model building to be an iterative process. A couple of best practices are important to remember here, however: ■■
The cleaner the source data is, the better. In the world of data mining, there are already many variables. One variable that can diminish the value of the results is messy or dirty data.
■■
Algorithms are built for specific purposes; use the correct tool for the job. Some algorithms are more easily understood than others. For example, it’s obvious what you’d use the Time Series algorithm for. It might be less obvious, at least at first, when to use Naïve Bayes and Clustering.
■■
Stick with parameter default values when you’re first starting. Again, there are many variables, and until you have more experience with both data mining capabilities and the source data, reduce the number of variables.
Again, at this point, you might be done and ready to deploy your mining models (as reports) for end users to work with and to use as a basis for decision support. There is an additional capability that some of you will want to use in your data mining projects: the ability to query your mining models. Specifically, you can use your model as a basis for making predictions against new data. This is accomplished using DMX prediction-type queries. As with the validation process, BIDS provides you with a graphical user interface, called Mining Model Prediction, that allows you to visually develop DMX prediction queries. We’ll explore that in the next section.
Chapter 13
Implementing Data Mining Structures
419
Data Mining Prediction Queries Data mining prediction queries can be understood as conceptually similar to relational database queries. That is, this type of query can be performed ad hoc (or on demand) using the GUI query-generation tools included in BIDS (or SSMS), or by writing the DMX query statement by hand. A more interesting use of DMX prediction queries is, however, in application integration. Just as database queries, whether relational or multidimensional, can be integrated into client applications, so too can mining model prediction queries. An example of a business use of such capability is viability for a bank loan. The SQL Server 2008 Data Mining Add-ins for Office 2007 introduces a Prediction Calculator to the Table Tools Analyze tab on the Excel 2007 Ribbon. We’ll cover this capability in more detail in Chapter 25, “SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007.” The idea is that new input data can be used as a basis for the existing mining model to predict a value. We believe that custom application integration provides a significant opportunity for application developers to harness the power of SSAS data mining in their own applications. First, of course, we’ll need to review the mechanics of how to perform DMX prediction queries. Two types of queries are supported in the BIDS interface: batch and singleton. Singleton queries take one single row of data as input. Batch queries take a dataset as input. There are three viewers on the Mining Model Prediction tab. They are the visual DMX query builder, the SQL (native DMX statement) query builder, and the Results view. The first step in creating a query is to select an input table. (Batch query is the default.) If you want to use a singleton query, click the third button from the left on the toolbar. After doing so, the Input Table dialog box changes to a single input dialog box. For the purposes of our discussion, we’re using the sample Target Mailing data mining structure and the TM Decision Tree data mining model. We’ve configured vTargetMail as the input table. Of course, this input view is the same one we used to build the mining model. For a real-world query, you use new source data. Figure 13-15 shows that the column name mappings are detected automatically in the query interface. If you want to view, alter, or remove any of these mappings, just right-click on the designer surface and then click Modify Connections. You’ll be presented with a window where you can view, alter, or remove connections between source and destination query attributes (columns) for batch queries.
420
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
FIguRe 13-15 The Mining Model Prediction tab (top portion) helps you build DMX predict queries visually by
allowing you to configure input values.
After you’ve configured the source and new input values, the next step in building your DMX prediction query is to configure the query itself. Using the BIDS designer, you can either type the DMX into the query window or you can use the guided designer. If using the guided designer, you’ll first configure the Source column. There you can select the mining model, the input, a prediction function, or a custom expression. This is shown in Figure 13-16. You’ll most commonly select a prediction function as a source. We’ll talk a bit more about the prediction functions that are available in DMX a bit later in this chapter.
FIguRe 13-16 The Mining Model Prediction tab (bottom portion) helps you build DMX predict queries visually by allowing you to select functions or to write criteria.
Next you configure the Field value. If you choose a prediction function as a source, you’ll select from the available prediction functions to configure the Field text box. After you select a particular prediction function, the Criteria/Argument text box is filled in with a template showing you the required and optional arguments for the function you’ve selected. For example, if you select the Predict function, the Criteria text box contains <Scalar column reference>[, EXCLUDE_NULL|INCLUDE_NULL][, INCLUDE_NODE_ID]. This indicates that the scalar column reference is a required argument (denoted by the angle brackets) and that the other arguments are optional (denoted by the square brackets). For our example, we’ll simply replace the column reference with the [Bike Buyer] column. You can see the DMX that has
Chapter 13
Implementing Data Mining Structures
421
been generated by clicking the Query (SQL) button at the left end of the toolbar. For our example, the generated DMX is as follows: SELECT Predict([Bike Buyer]) From [TM Decision Tree] PREDICTION JOIN OPENQUERY([Adventure Works DW], SELECT [MaritalStatus], [Gender], [YearlyIncome], [TotalChildren], [NumberChildrenAtHome], [HouseOwnerFlag], [NumberCarsOwned], [CommuteDistance], [Region], [Age], [BikeBuyer] FROM [dbo].[vTargetMail] ) AS t ON [TM Decision Tree].[Marital Status] = t.[MaritalStatus] AND [TM Decision Tree].[Gender] = t.[Gender] AND [TM Decision Tree].[Yearly Income] = t.[YearlyIncome] AND [TM Decision Tree].[Total Children] = t.[TotalChildren] AND [TM Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND [TM Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND [TM Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned] AND [TM Decision Tree].[Commute Distance] = t.[CommuteDistance] AND [TM Decision Tree].[Region] = t.[Region] AND [TM Decision Tree].[Age] = t.[Age] AND [TM Decision Tree].[Bike Buyer] = t.[BikeBuyer]
To execute the query, just click the Result button at the left end of the toolbar. To help you understand the capabilities available when using prediction queries, we’ll take a closer look at the prediction functions. See the Books Online topic “Data Mining Extensions (DMX) Function Reference” for more information. To explore DMX, we’ll open SSMS to look at the DMX prediction template queries that are included as part of all SSAS templates. Let’s take a look at both of those next.
DMX Prediction Queries The Data Mining Extensions language is modeled after the SQL query language. You probably recognized the SQL-like syntax of SELECT…FROM…JOIN…ON from the code generated in the preceding example. Note the use of the DMX Predict function and the PREDICTION JOIN keyword. We’ll start by looking at both of these commonly used elements.
422
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The syntax for PREDICTION JOIN is much like the syntax for the Transact-SQL JOIN keyword. The difference is that, for the former, the source table is a data mining model and the joined table is the new input data. The connection to the new input data is made either using the SQL keyword OPENROWSET or OPENQUERY. OPENROWSET requires the connection credentials to be included in the query string, so OPENQUERY is the preferred way to connect. OPENQUERY uses a data source that is defined in SSAS. There is one variation for the PREDICTION JOIN syntax. It’s to use NATURAL PREDICTION JOIN. You can use this when the source columns are identical to the attribute (column) names in the mining model. NATURAL PREDICTION JOIN is also commonly used when you’re performing a singleton (that is, single value as input) prediction query. By now, you might be wondering what common query scenarios occur in the world of data mining. In the world of application integration, there are generally three types of queries: content, prediction, and prediction join. Content queries simply return particular data or metadata from a processed mining model. Prediction queries can be used to execute predictions, such as forward steps in a time series. Prediction join queries are a special type of prediction query in that they use data from a processed model to predict values in a new dataset. Note There are other types of DMX queries related to model administration. These include structure, model content, and model management. You might also remember that both PMML and XMLA are involved in data mining object management. These topics aren’t really in the scope of our discussion on DMX for developers.
To take a look at the included DMX templates in SSMS, connect to SSAS in SSMS, display the Template Explorer, and then click on the DMX node to view and work with the included templates. Figure 13-17 shows the included DMX templates. Note that the templates are categorized by language (DMX, MDX, and XMLA) and then by usage—that is, by Model Content, and so on. We’ve dragged a couple of the prediction queries to the query design window so that we can discuss the query structure for each of them. In the first example, Base Prediction, you can clearly see the familiar SQL-like keyword syntax of SELECT…FROM…(PREDICTION) JOIN…ON… WHERE. Note that all DMX keywords are colored blue, just as keywords in Transact-SQL are. As mentioned previously, values between angle brackets represent replaceable parameters for required arguments and values between square brackets represent optional arguments. The second example, Nested Prediction, shows the syntax for this type of query. Nested predictions use more than one relational table as source data. They require that the data in the source tables has a relationship and is modeled as a CASE table and as a NESTED table. For more information, see the Books Online topic “Nested Tables (Analysis Services—Data Mining)” at http://msdn.microsoft.com/en-us/library/ms175659.aspx. Note the use of the following keywords to create the nested hierarchy: SHAPE, APPEND, and RELATE.
Chapter 13
Implementing Data Mining Structures
423
FIguRe 13-17 The Template Explorer in SSMS provides you with three different types of DMX queries.
The last code example shown in Figure 13-17 shows the syntax for a singleton prediction query. Here you’ll note the use of the syntax NATURAL PREDICTION JOIN. This can be used because the input column names are identical to those of the mining model.
DMX Prediction Functions DMX contains many prediction functions. SQL Server Books Online provides a good description of each of them. See the topic “Data Mining Extensions (DMX) Function Reference.” The core function in the DMX prediction library is the Predict function. It can return either a single, specific value (scalar) or a set of values (table). Although there are many specific prediction functions that are designed to work with specific algorithms, the query syntax is somewhat simplified by the fact that the Predict function itself supports polymorphism. What this means is that you can often simply use the Predict function rather than a more specific type of prediction, such as PredictAssociation, and SSAS automatically uses the appropriate type of prediction function for the particular algorithm (model) being queried.
424
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
The Predict function includes several options. The availability of these options is dependent on the return type specified—that is, scalar or table. For tabular results, you can specify these options: ■■
Inclusive
Includes everything in the results
■■
Exclusive
Excludes source data in the results
■■
Input_Only
■■
Include_Statistics Includes statistical details; returns as two additional columns, $Probability and $Support
■■
Top count
Includes only the input cases
Returns the top n rows, with n specified in the argument
For scalar return types, you can specify either the Exclude_Null or Include_Null option. You can write and test your prediction queries in BIDS, SSMS, or any other tool that supports DMX queries. Excel 2007 also executes DMX queries after it has been configured with Data Mining Add-ins. If you want to see the exact syntax of a DMX query that some enduser application runs, you can use SQL Server Profiler to capture the DMX query activity. We described the steps to use SQL Server Profiler to do this in an earlier chapter. Now that you understand the basic prediction query syntax and options, let’s take a closer look at the prediction functions that are available as part of DMX. Table 13-2 lists the function, its return type, and a brief description. Now that you have a better understanding of the available predictive functions in the DMX language, we’ll return to the Mining Model Prediction tab in BIDS. We’ve switched the query view from the default of batch input to singleton input by right-clicking on the designer surface and then selecting Singleton Query from the shortcut menu. Next, we configured the values for the singleton input by selecting values from the drop-down lists next to each Mining Model Column entry in the Singleton Query Input pane. Following that, we configured the query portion (bottom section) by setting the Source to a Prediction Function, then selecting the Predict function in the Field area, and finally by dragging the Bike Buyer field from the Mining Model pane to the Criteria/Argument area. We then clicked the Query Mode button (first button on the toolbar) and then clicked the Query View button in the drop-down button list. This results in the DMX query shown in Figure 13-18. If you’d like to execute the query you’ve built, click the Query Mode button and then click the Result button.
Chapter 13 TABLe 13-2
Implementing Data Mining Structures
425
DMX Prediction Functions
Function
Returns
Notes
Predict
Scalar
Core function
PredictSupport
Scalar
Count of cases that support predicted value
PredictVariance
Scalar
Variance distribution for which Predict is the mean (for continuous attributes)
PredictStdev
Scalar
Square root of PredictVariance
PredictProbability
Scalar
Likelihood that the Predict value is correct
PredictProbabilityVar
Scalar
Certainty that the value of PredictVariance is accurate
PredictProbabilityStdev
Scalar
Square root of PredictProbabilityVar
PredictTimeSeries
Table
Predicted value of next n in a time series
Cluster
Scalar
ClusterID that the input case belongs to with highest probability
ClusterDistance
Scalar
Distance from predicted ClusterID
ClusterProbability
Scalar
Probability value of belonging to predicted Cluster
RangeMid
Scalar
Midpoint of predicted bucket (discretized columns)
RangeMin
Scalar
Low point of predicted bucket (discretized columns)
RangeMax
Scalar
High point of predicted bucket (discretized columns)
PredictHistogram
Rowset
Histogram of each possible value, with predictable column and probability, expressed as $Support, $Variance, $Stdev, $Probability, $Adjusted Probability, $Probability Variance, and $Probability Stdev
FIguRe 13-18 Singleton queries can be built on the Mining Model Prediction tab in BIDS.
426
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
In addition to viewing the query in the Query mode, you can also edit the query directly by typing the DMX code into the window. The BIDS code window color-codes the DMX—for example, by coloring DMX keywords blue—but unfortunately, no IntelliSense is built in to the native BIDS DMX code window. The SSMS DMX query window functions similarly—that is, it has color-coded keywords but no IntelliSense. Note that in the query the type of prediction join configured for the singleton input is automatically set to NATURAL PREDICTION JOIN. This is because the input column names are identical to those of the data mining model to which the join will be performed. As we complete our brief look at DMX, we remind you that we’ve found a number of ways to implement DMX queries. These include ad hoc querying using BIDS or SSMS. We use this method during the early phases of the development cycle. We also sometimes use manual query execution to diagnose and troubleshoot key queries, such as those that might have been generated by end-user client tools. We use SQL Server Profiler to capture the syntax of the DMX query and then tinker with the query using the ad hoc query environments in either BIDS or SSMS. Another use of custom DMX queries is to embed them into custom applications. We think this is an exciting opportunity area for application developers. The potential for integrating data mining predictive analytics in custom applications is a new and wide-open area for most companies. Just one interesting example of custom application integration is to use data mining as a type of input validation in a form-based application. Rather than writing custom application-verification code, you can use a trained data mining model as a basis for building input validation logic. In this example, you send the input values from one or more text boxes on the form as (singleton) inputs to a DMX query. The results can be coded against thresholds to serve as indicators of valid or invalid text box entries. What is so interesting about this use case is that as the source data changes (is improved), the input validation results also improve. You can think of this as a kind of dynamic input validation. Before leaving the subject of DMX, we’d be remiss if we didn’t talk a bit about the integration between SSAS data mining and SSIS. SSIS includes several types of integration with both OLAP cubes and data mining models. In the next section, we’ll look specifically at the integration between data mining and SSIS.
Data Mining and Integration Services We realize that if you’re reading this book in the order that it’s written, this section will be the first detailed look you’ve taken at Integration Services. If you’re completely new to SSIS, you might want to skip this section for now, read at least the introductory chapter on SSIS (which is the next chapter in the book), and then return to this section.
Chapter 13
Implementing Data Mining Structures
427
Data mining by DMX query is supported as a built-in item in SSIS as long as you’re using the Enterprise edition of SQL Server 2008. You have two different item types to select from. The first type, Data Mining Query Task, is available in the control flow Toolbox. The editor for this task is shown in Figure 13-19.
FIguRe 13-19 The Data Mining Query Task Editor dialog box allows you to associate a data mining query with the control flow of an SSIS package.
The second type is available in the data flow Toolbox in the transformation section: it’s the Data Mining Query data flow component. As with the control flow task, the data flow component allows you to associate a data mining query with an SSIS package. The difference between the task and the component is, of course, where they are used in your package— one is for control flow, and the other is for data flow. Note that there are three tabs in the dialog box: Mining Model, Query, and Output. On the Mining Model tab, you configure the connection to the SSAS instance using a connection manager pointing to your Analysis Services instance and then select the mining structure. You’ll see a list of mining models that are included in the mining structure you’ve selected. On the Query tab, you configure your DMX query either by directly typing the DMX into the query window or by using the Build New Query button on the bottom right of the Query tab of the Task Editor dialog box. Clicking this button opens the (now familiar) dialog box you saw in BIDS when you were using the Mining Model Prediction tab. This tab allows you to select a data mining model as a source for your query and then to select the source of the query input, just as you did using BIDS. In our case, we selected the TM Decision Tree
428
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
model as the source and the vTargetMail view as the input. Then, as we did in the earlier query example using the BIDS query interface, we continued our configuration by adding the Predict function using the [Bike Buyer] column as our argument. Note that the DMX query produced uses PREDICTION JOIN to perform the join between the source and new input data. A sample DMX query is shown in the Data Mining Query Task Editor in Figure 13-20.
FIguRe 13-20 The data mining query task allows you to write, edit, and associate variables with your DMX query.
The tabs within the Query section allow you to perform parameter mapping or to map the result set using variables defined in the SSIS package. The last main tab in the Data Mining Query Task Editor is the Output tab. This allows you to map the output of this task to a table using a connection you define in the SSIS package. If you use the Data Mining Query component in the data flow section, the DMX querybuilding is accomplished in a similar way to using the Data Mining Query task in the control flow section. The difference is that, for the component, you configure the input from the incoming data flow rather than in the task itself. Also, you must connect the data flow task to a data flow destination (rather than using an Output tab on the task itself) to configure the output location for the results of your DMX query. There are additional tasks available in SSIS that relate to SSAS data mining. However, they do not relate to DMX querying. Rather, they relate to model processing. Before we explore these tasks, let’s take a closer look at data mining structure and model processing in general.
Chapter 13
Implementing Data Mining Structures
429
Data Mining Object Processing As we’ve seen, when you process a mining model using BIDS a dialog box showing you the process progress is displayed. You might be interested to know that you can use a number of methods or tools to accomplish data mining object processing. These include SSMS, SSIS, Excel 2007, and XMLA script. To understand what structure or model processing entails, let’s take a closer look at the options available for processing. For mining structures, you have five options related to processing: ■■
Process Full Completely erases all information, and reprocesses the structure, models, and trains using source data.
■■
Process Default ready state.
■■
Process Structure Populates only the mining structure (but not any included mining models) with source data.
■■
Process Clear Structure Removes all training data from a structure. This option is often used during early development cycles, and it works only with structures, not with models.
■■
Unprocess Drops the structure or model and all associated training data. Again, this option is often used during early development cycles for rapid prototyping iterations.
Performs the least disruptive processing function to return objects to
As with OLAP object processing, during development cycles, it’s quite common to simply run a full process every time you update any structures or models. As you move to deployment, it’s more common to perform (and automate) a more granular type of processing. For data mining objects, this is more commonly accomplished using the Process Default option. So, what exactly happens during processing? Let’s look at the Process Progress dialog box, shown in Figure 13-21, to increase our understanding. This dialog box looks just like the one that is displayed when you process OLAP objects. The difference is in how the data mining objects are processed and what the shape of the results is. SSAS data mining uses the SQL INSERT INTO and OPENROWSET commands to load the data mining model with data during the processing or training phase. If a nested structure is being loaded, the SHAPE command is used as well. If there are errors or warnings generated during processing, they are displayed in the Process Progress dialog box as well. One interesting warning to keep an eye out for is the updated automatic feature selection warning. Feature selection is automatically applied when SSAS finds that the number of attributes fed to your model would result in excessive processing overhead without improving the value of the results. It works differently, depending on which algorithm you are using. Very broadly, feature selection uses one or
430
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
more methods of attribute scoring to score each source attribute. Calculated score values are then used by various ranking algorithms to decide whether or not to include those attributes in the model’s training data. You can manually tune feature selection by configuring the maximum number of input or output attributes, or by number of states. For more information about how feature selection is applied when models built on various algorithms are trained, see the SQL Server Books Online topic “Feature Selection in Data Mining” at http://msdn.microsoft.com/en-us/library/ms175382.aspx.
FIguRe 13-21 Process Progress dialog box
A final consideration is to understand that, as with OLAP objects, you can use included control flow tasks or data flow components in SSIS to automate the processing of data mining objects. One of these tasks is the Analysis Services processing task. This task is found in the Toolbox in SSIS. It allows you to configure any type of processing for any mining model or structure as part of an SSIS control flow and resulting executable package. In addition to the task already mentioned, in the data flow section of SSIS, there is an additional built-in component that you can use to automate data mining object processing. This component is the data mining model training component. It’s available only in the Enterprise edition of SQL Server 2008 SSIS. It’s used as a destination, and it requires input columns from a data source in the SSIS data flow. In this component, on the Columns tab, you configure the column mapping that will be used to load (or train) the destination data mining model. This consists of matching the data flow (input) columns to the destination (mining structure) columns.
Chapter 13
Implementing Data Mining Structures
431
In addition to using SSIS to automate data mining object processing, you can use other methods. These include using SSMS or using XMLA scripting. One final consideration when you’re moving from development to deployment is security for your data mining structures and models. Similar to OLAP objects, the simplest way for you to define custom security settings is for you to create and configure roles using either BIDS or SSMS. Also common between data mining and OLAP is the fact that by default only an administrator can access any of the objects. You might remember from an earlier chapter that SSAS roles allow you to define permissions at the level of the entire database (which includes all data sources and all cubes and mining structures) or much more granularly, such as read permission on a single data source, cube, dimension, or mining structure. We show the mining structure security interface in Figure 13-22. Note that when you examine this interface, you set the level of access (options are None or Read), whether or not to enable drillthrough, whether or not to allow the definition to be read, and whether or not to allow processing.
FIguRe 13-22 Security role configuration options for data mining models
Data Mining Clients There is one final important consideration in the data mining software development life cycle—that is, considering what the end-user client applications will be. We devote several chapters in Part IV to this topic. Whatever client or clients you choose for your BI project, it’s critical that you investigate their support for advanced features (for example, drillthrough) of SQL Server 2008 data mining before you build such features into your models using BIDS. In Chapter 24, “Microsoft Office 2007 as a Data Mining Client,” we examine the integration between Microsoft Office 2007 and SQL Server 2008 data mining. This integration feature set is currently available in Excel 2007 and Visio 2007. We’ll also provide information about integration between SSRS and SSAS data mining in Chapter 20, “Creating Reports in SQL Server 2008 Reporting Services.” We’ll take a quick look at Office SharePoint Server 2007 and PerformancePoint Server data mining integration as well in Chapter 25. And, recall that in Chapter 12 we looked at other methods of implementing data mining clients, both purchased and built, including the data mining controls that can be embedded into custom .NET applications.
432
Part II
Microsoft SQL Server 2008 Analysis Services for Developers
Summary In this chapter, we looked at the data mining structure and model creation wizards in BIDS. We then reviewed the included capabilities for validating your mining models. These processes include lift and profit charts, as well as the classification matrix and the newly introduced cross validation capability. We then explored the Mining Model Prediction tab in BIDS. This led us to a brief look at DMX itself. We followed that by looking at DMX query and mining model processing integration with SSIS. This completes the second major section of this book. By now, you should be comfortable with OLAP cube and data mining concepts and their implementation using BIDS. In the next chapter, we’ll take a deeper look at the world of ETL using SSIS. Be aware that we’ll include information about integration with SSRS and other clients with SSAS objects in the section following the SSIS section.
Part III
Microsoft SQL Server 2008 Integration Services for Developers
433
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services In Part II, we looked at the ways in which Microsoft SQL Server 2008 Analysis Services (SSAS) delivers a set of core functionality to enable analysis of data warehouse data within a business intelligence (BI) solution. But how does the data get from the disparate source systems into the data warehouse? The answer is Microsoft SQL Server 2008 Integration Services (SSIS). As the primary extract, transform, and load (ETL) tool in the SQL Server toolset, Integration Services provides the core functionality required to extract data from various source systems that contain the information for a BI solution, transform the data into forms required for analysis, and load the data into the data warehouse. These source systems can include RDBMS systems, flat files, XML files, and any other source to which Integration Services can connect. SSIS can do much more than simply perform ETL for a BI solution, but in this part of the book we’re going to focus primarily on the use of SSIS as an ETL platform for loading the data warehouse in a BI solution. In this part, we will also cover best practices and common techniques for preparing data for loading into an SSAS data mining model. This chapter introduces SSIS from an architectural perspective by presenting the significant architectural components of the SSIS platform: what pieces make up SSIS, how they work together, and how you as an SSIS developer work with them to build the ETL components for your BI solution. We will look at the major building blocks of SSIS, including the SSIS runtime, packages and their components, and the tools and utilities through which you will design, develop, and deploy ETL solutions. By the end of this chapter you will have the technical foundation for the content presented in the chapters that follow. You will also have an understanding of most of the core capabilities of the SSIS platform. Developers new to SQL Server 2008 sometimes ask us, “Why should I use SSIS rather than more traditional methods of data management, such as Transact-SQL scripts?” We need more than one chapter to answer this question completely; however, our goal for this chapter is to give you a preliminary answer, based mostly on an understanding of SSIS architecture.
435
436
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Overview of Integration Services Architecture The Integration Services platform includes many components, but at the highest level, it is made up of four primary parts: ■■
Integration Services runtime The SSIS runtime provides the core functionality necessary to run SSIS packages, including execution, logging, configuration, debugging, and more.
■■
Data flow engine The SSIS data flow engine (also known as the pipeline) provides the core ETL functionality required to move data from source to destination within SSIS packages, including managing the memory buffers on which the pipeline is built and the sources, transformations, and destinations that make up a package’s data flow logic.
■■
Integration Services object model The SSIS object model is a managed .NET application programming interface (API) that allows tools, utilities, and components to interact with the SSIS runtime and data flow engine.
■■
Integration Services service The SSIS service is a Windows service that provides functionality for storing and managing SSIS packages.
Note The Integration Services service is an optional component of SSIS solutions. Unlike the SQL Server service, which is responsible for all access to and management of the data in a SQL Server database, the SSIS service is not responsible for the execution of SSIS packages. This is often a surprise to developers new to SSIS, who expect that a service with the same name as the product will be the heart of that product. This isn’t the case with SSIS. Although the Integration Services service is useful for managing deployed SSIS solutions and for providing insight into the status of executing SSIS packages, it is perfectly possible to deploy and execute SSIS packages without the SSIS service running. These four key components form the foundation of SSIS, but they are really just the starting point for examining the SSIS architecture. Of course the principal unit of work is the SSIS package. We’ll explain in this chapter how all of these components work together to allow you to build, deploy, execute, and log execution details for SSIS packages. Figure 14-1 illustrates how these four parts relate to each other and how they break down into their constituent parts. Notice in Figure 14-1 that the metacontainer is an SSIS package. It is also important to note that the primary component for moving data in an SSIS package is the Data Flow task. We’ll spend the rest of the chapter taking a closer look at many of the architectural components shown in Figure 14-1. In addition to introducing each major component and discussing how the components relate to each other, this chapter also discusses some of the design
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
437
goals of the SSIS platform and how this architecture realizes these goals. After this introductory look at SSIS architecture, in subsequent chapters we’ll drill down to examine all of the components shown in Figure 14-1. Custom Applications
SSIS Designer
Command-Line Utilities
SSIS Wizards
Native
Tasks
Managed
Integration Services Service .dtsx File
Object Model Integration Services Runtime
Custom Tasks
Package msdb Database
Task
Log Providers
Task
Data Sources
Task
Container
Enumerators
Task Task
Data Flow Task
Connection Managers
Event Handlers
Object Model Integration Services Datra Flow
Source
Data Flow Components Custom Data Flow Components
Source
Transformation Transformation Destination
Destination
FIgure 14-1 The architecture of Integration Services
Note When comparing SSIS to its predecessor, DTS, it’s not uncommon to hear someone describe SSIS as “the new version of DTS,” referring to the Data Transformation Services functionality included in SQL Server 7.0 and SQL Server 2000. Although SSIS and DTS are both ETL tools, describing SSIS as a new version of DTS does not accurately express the scope of differences between these two components. In a nutshell, the functionality of DTS was not what Microsoft’s customers wanted, needed, or expected in an enterprise BI ETL tool.
438
Part III
Microsoft SQL Server 2008 Integration Services for Developers
SSIS was completely redesigned and built from the ground up in SQL Server 2005 and enhanced in SQL Server 2008. If you work with the SSIS .NET API, you will still see the term DTS in the object model. This is because the name SSIS was chosen relatively late in the SQL Server 2005 development cycle—too late to change everything in the product to match the new marketing name without breaking a great deal of code developed by early adopters— not because code is left over from DTS in SSIS.
Integration Services Packages As mentioned earlier, when you develop an ETL solution by using SSIS, you’re going to be developing packages. SSIS packages are the basic unit of development and deployment in SSIS and are the primary building blocks of any SSIS ETL solution. As shown in Figure 14-1, the package is at the heart of SSIS; having SSIS without packages is like having the .NET Framework without any executables. An SSIS package is, in fact, a collection of XML that can be run, much like an executable file. Before we examine the core components of an SSIS package, we’ll take a brief look at some of the tools and utilities available to help you to develop, deploy, and execute packages.
Tools and Utilities for Developing, Deploying, and Executing Integration Services Packages SSIS includes a small set of core tools that every SSIS developer needs to use and understand. SSIS itself is an optional component of a SQL Server 2008 installation. Installing the SSIS components requires an appropriate SQL Server 2008 license. Also, some features of SSIS require the Enterprise edition of SQL Server 2008. For a detailed feature comparison, see http://download.microsoft.com/download/2/d/f/2df66c0c-fff2-4f2e-b739-bf4581cee533/ SQLServer%202008CompareEnterpriseStandard.pdf. This installation includes a variety of tools to work with SSIS packages. Additional tools are included with SQL Server 2008, as well as quite a few add-ons provided by Microsoft, third parties, and an active community of open source SSIS developers. However, the tools introduced here—SQL Server Management Studio, Business Intelligence Development Studio, and the command-line utilities DTEXEC, DTEXECUI, and DTUTIL—are enough to get you started.
SQL Server Management Studio As you know from earlier chapters, SQL Server Management Studio (SSMS) is used primarily as a management and query tool for the SQL Server database engine, but it also includes the ability to manage SSAS, SSRS, and SSIS. As shown in Figure 14-2, the Object Explorer window in SSMS can connect to an instance of the SSIS service and be used to monitor running packages or to deploy packages to various package storage locations, such as the msdb database or the SSIS Package Store.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
439
FIgure 14-2 SSMS Object Explorer displays SSIS packages.
It’s important to keep in mind that you cannot use SSMS to develop SSIS packages. Package development is performed in BIDS, not SSMS. ETL developers who have worked with DTS in SQL Server 7.0 and SQL Server 2000 often look for package-development tools in SSMS because SSMS is the primary replacement for SQL Server Enterprise Manager, and Enterprise Manager was the primary development tool for DTS packages. The advent of SSIS in SQL Server 2005, however, brings different tools for building and managing packages. Note The Import/Export Wizard is the one exception to the “you can’t develop SSIS packages in SSMS” concept. You start this tool by right-clicking any user database in the SSMS Object Explorer, clicking Tasks, and then clicking either Import Data or Export Data. The Import/Export Wizard allows you to create simple import or export packages. In this tool, you select the data source, destination, and method you will use to copy or move the objects to be imported or exported. You can choose to select the entire object or you can write a query to retrieve a subset. Objects available for selection include tables or views. You can save the results of this wizard as an SSIS package and you can subsequently edit this package in the SSIS development environment.
Business Intelligence Development Studio The primary tool that SSIS package developers use every day is the Business Intelligence Development Studio (BIDS). We’ve referred to BIDS in several earlier chapters, and in Part II, you saw its uses in SQL Server Analysis Services. BIDS also contains the necessary designers for creating, editing, and debugging Integration Services components, which include control flows and data flows, event handlers, configurations, and anything else that packages need. As with SSAS development, SSIS package development using BIDS is accomplished with wizards and visual tools in the integrated development environment. We will use BIDS for the majority of our subsequent explanations and examples for SSIS. Just as with SSAS OLAP cubes and data mining models, as we explore SSIS we’ll use the sample SSIS packages available for download at http://www.CodePlex.com. These packages revolve around the AdventureWorksDW scenario that we have already been using.
440
Part III
Microsoft SQL Server 2008 Integration Services for Developers
If you have a full version of Microsoft Visual Studio 2008 installed on your development computer and you subsequently install SSIS, the SSIS templates will be installed to your existing Visual Studio instance.
upgrading Integration Packages from an earlier Version of SQL Server BIDS includes a wizard called the SSIS Package Upgrade Wizard that will automatically open if you attempt to open an SSIS package designed in SQL Server 2005. This wizard only launches the first time you attempt to open a SQL Server 2005 package. Thereafter, BIDS automatically loads the packages and gives you the choice to upgrade those packages upon processing. This wizard includes a number of options, including the ability to assign a password, a new GUID, and more. The default options include updating the connection string, validating the package, creating a new package ID, continuing with the package upgrade when a package upgrade fails, and backing up the original package. Like many other SQL Server 2008 wizards, this one includes a final List Of Changes dialog box, the ability to move back at any time, and a final dialog box showing conversion steps and the success or failure of each step. If you have packages created in SQL Server 2000, you can attempt to upgrade them or you can install the backward-compatible run-time engine for DTS packages and run those packages using that engine.
DTEXEC and DTEXECUI After you create packages by using BIDS, they will be executed. Although packages can be executed and debugged within the BIDS environment, this is generally only the case during the development phase of a BI project. Executing packages through BIDS can be significantly slower, largely because of the overhead of updating the graphical designers as the package executes. Therefore, you should use another utility in production scenarios, and that utility is the DTEXEC.exe command-line utility. Like many SQL Server command-line tools, DTEXEC provides a wide variety of options and switches. This capability is very powerful and allows you to configure advanced execution settings. We’ll cover these capabilities as we continue through this section of the book. For now, we’ll get started with an example. The following switches execute a package from the file system, specify an additional XML configuration file, specify that validation warnings should be treated as errors, and specify that errors and information events should be reported by the runtime to any logging providers: DTEXEC.EXE /FILE "C:\Package.dtsx" /CONFIGFILE "C:\ConfigFile.dtsConfig" /WARNASERROR /REPORTING EI
Fortunately, SSIS developers do not need to memorize all of these command-line options, or even look them up very often—DTEXECUI.exe, also known as the Execute Package Utility, is
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
441
available for this. As its name implies, DTEXECUI is a graphical utility that can execute packages, but it has additional capabilities as well. As shown in Figure 14-3, DTEXECUI allows developers and administrators to select execution options through a variety of property pages and be presented with the DTEXEC command-line switches necessary to achieve the same results through the DTEXEC command-line utility.
FIgure 14-3 DTEXECUI—the Execute Package Utility
Because many SSIS packages are executed as part of a scheduled batch process, these two utilities complement each other very well. You can use the DTEXECUI utility to generate the command-line syntax, and then execute it using DTEXEC as part of a batch file or SQL Server Agent job.
DTUTIL DTUTIL.exe is another command-line utility; it is used for SSIS package deployment and management. With DTUTIL, developers and administrators can move or copy packages to the msdb database, to the SSIS Package Store (which allows you to further group packages into child folders viewable in SSMS), or to any file system folder. You can also use DTUTIL to encrypt packages, set package passwords, and more. Unlike DTEXEC, DTUTIL has no corresponding graphical utility, unfortunately, but SQL Server Books Online contains a comprehensive reference to the available switches and options.
442
Part III
Microsoft SQL Server 2008 Integration Services for Developers
The Integration Services Object Model and Components Although the tools included with SSIS provide a great deal of out-of-the-box functionality, on occasion these tools do not provide everything that SSIS developers need. For these situations, SSIS provides a complete managed API that can be used to create, modify, and execute packages. The .NET API exposes the SSIS runtime, with classes for manipulating tasks, containers, and precedence constraints, and the SSIS data flow pipeline, with classes for manipulating sources, transformations, and destinations. Through the SSIS API, developers can also build their own custom tasks, transformations, data sources and destinations, connection managers, and log providers. SSIS also includes a Script task for including custom .NET code in the package’s control flow and a Script component for including custom .NET code in the package’s data flow as a source, destination, or transformation. In addition, new to SQL Server 2008 is the ability to write custom code in C#. (Previously you were limited to using Microsoft Visual Basic .NET). In Chapter 19, “Extending and Integrating SQL Server 2008 Integration Services,” we’ll take a closer look at working with the SSIS API. Now we’ll consider the physical makeup of an SSIS package. SSIS packages are stored as XML files, typically with a .dtsx file extension, but because of the flexibility that SSIS provides for building, executing, and storing packages, it’s not always that straightforward. Despite this, if you think of a C# application as being made up of .cs files, thinking of SSIS applications as being made up of .dtsx files isn’t a bad place to start. In the chapters ahead we’ll see that this isn’t strictly true, but for now it’s more important to start on familiar footing than it is to be extremely precise. In this chapter we’ll cover some of the most common scenarios around SSIS packages, and in later chapters we’ll dive into some of the less-used (but no less powerful) scenarios. SSIS packages are made up of the following primary logical components: ■■
Control flow
■■
Data flow
■■
Variables
■■
Expressions
■■
Connection managers
■■
Event handlers and error handling
Control Flow The control flow is where the execution logic of the package is defined. A package’s control flow is made up of tasks that perform the actual work of the package, containers that provide
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
443
looping and grouping functionality, and precedence constraints that determine the order in which the tasks are executed. Figure 14-4 shows the control flow from a real-world SSIS package; each rectangle is an individual task, and the tasks are connected by colored arrows, which are precedence constraints. Notice that in our example we’ve followed the important development best practice of using meaningful task names along with detailed task annotations. We’ll go into much greater detail in later chapters, but to get you started here are some of the included tasks: Bulk Insert, Data Flow, Execute Process, Execute SQL, Script, and XML. Some new tasks have been introduced in the SQL Server 2008, most notably the Data Profiling task.
FIgure 14-4 A sample control flow
You can think of a package’s control flow as being similar to the “main” routine in traditional procedural code. It defines the entry point into the package’s logic and controls the flow of that logic between the package’s tasks, not unlike the way the “main” routine controls the flow of program logic between the program’s subroutines in a traditional Visual Basic or C# program. Also, you have quite a few options for configuring the precedence constraints between tasks. These include conditional execution, such as Success, Failure, or Completion, and execution based on the results of an expression. As with much of the information introduced here, we’ll be spending quite a bit more time demonstrating how this works and suggesting best practices for the configuration of these constraints and expressions in later chapters.
444
Part III
Microsoft SQL Server 2008 Integration Services for Developers
As you can probably imagine, some control flows can become quite complex, but this is frequently not necessary. Many real-world packages have very simple control flow logic; it’s the data flow where things start to get really interesting.
Data Flow The data flow is where a package’s core ETL functionality is implemented. While each package has one primary control flow, a package can have zero, one, or many data flows. Each data flow is made up of data source components that extract data from a source system and provide that data to the data flow, transformations that manipulate the data provided by the data source components, and data destination components that load the transformed data into the destination systems. It is very important that you understand that SSIS attempts to perform all data transforms completely in memory, so that they execute very quickly. A common task for SSIS package designers for BI solutions is to ensure that package data flows are designed in such a way that transforms can run in the available memory of the server where these packages are executed. Figure 14-5 shows a portion of the data flow from a real-world SSIS package; each rectangle is a data source component, transformation component, or data destination component. The rectangles are connected by colored arrows that represent the data flow path that the data follows from the source to the destination. On the data flow designer surface you use green arrows to create a path for good data rows and red arrows to create a path for bad data rows. Although you can create multiple good or bad output path configurations, you will receive a warning if you try to configure a data flow transformation or destination before you attach output rows (whether good or bad rows) to it. The warning reads “This component has no available input rows. Do you wish to continue editing the available properties of this component?” Although a package’s control flow and data flow share many visual characteristics (after all, they’re both made up of rectangles connected by colored arrows) it is vitally important to understand the difference between the two. The control flow controls the package’s execution logic, while the data flow controls the movement and transformation of data.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
445
FIgure 14-5 Data flow
Tasks vs. Components In an Integration Services package, tasks and components are very different. As mentioned earlier, tasks are the items that make up a package’s control flow logic, and components are the items that make up a package’s data flow logic that is implemented in a Data Flow task. This makes components something like properties on a task, or the value members in a collection property on a task. Why is this distinction so important? Many aspects of SSIS are implemented at the task level. You can enable and configure logging for each task. You can define event handlers for a task. You can add expressions to the properties on a task. You can enable and disable tasks. But none of these activities is available for components. And to make matters worse, quite a few publications will casually refer to components as “tasks,” confusing the reader and muddying the waters around an already complicated topic.
Variables Each SSIS package defines a common set of system variables that provide information (typically read-only information) about the package and its environment. System variables include information such as PackageName and PackageID, which identify the package itself, and
446
Part III
Microsoft SQL Server 2008 Integration Services for Developers
ExecutionInstanceGUID, which identifies a running instance of the package. The values of these variables can be very useful when you create an audit log for the packages that make up your ETL solution. Audit logs are a common business requirement in BI solutions. In addition to these common system variables, package developers can create any number of user variables that can then be used by the package’s tasks, containers, and data flow components to implement the package’s logic. SSIS package variables play essentially the same role as do variables in a traditional application: Tasks can assign values to variables; other tasks can read these values from the variables. In fact, variables are the only mechanism that SSIS provides through which tasks can share information.
SSIS vs. DTS: Task Isolation The fact that you can only use variables to share information between tasks in an SSIS package marks one of the major differences between SSIS and DTS. In DTS, it was possible (and unfortunately quite common) to modify properties of packages and package components from code in other package components. This provided a great deal of flexibility to DTS developers, but also made maintaining DTS packages problematic. Essentially, everything within a DTS package was defined with a global scope so that it could be accessed from anywhere within a package, and this resulted in “spaghetti code” (interspersed program logic rather than cleanly separated units of logic) often being the norm in DTS applications. SSIS, on the other hand, was designed from the ground up to be a true enterprise ETL platform, and one aspect of this design goal was that SSIS solutions needed to be maintainable and supportable. By preventing tasks from making arbitrary changes to other tasks, the SSIS platform helps prevent common errors—such as erroneously altering the value of a variable—and simplifies package troubleshooting and maintenance.
Figure 14-6 shows the Variables window from an SSIS package in BIDS. To open this window, choose Other Windows on the View menu, and then select Variables. Click the third icon on the toolbar to show the system variables. User variables have a blue icon; system variables (which are hidden by default) are shown with a gray icon.
FIgure 14-6 The Variables window
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
447
Just as in traditional programming environments, each variable in SSIS has a well-defined scope and cannot be accessed from outside that scope. For example, a variable defined at the package scope can be accessed from any task or transformation within the package (assuming that the task or transformation in question knows how to access variables). But a variable defined within the scope of a Data Flow task can only be accessed from transformations within that data flow; it cannot be accessed from any other tasks in the package or from transformations in any other data flows in the package. This results in better quality executables that are easier to read, maintain, and understand. It is also important to understand that when a variable is created, its scope is defined and that scope cannot be changed. To change the scope of a variable, you have to drop and re-create it.
Expressions One of the most powerful but often least understood capabilities provided by the SSIS runtime is support for expressions. According to SQL Server Books Online, “expressions are a combination of symbols (identifiers, literals, functions, and operators) that yields a single data value.” This is really an understatement. As we’ll see, expressions in SSIS provide functionality that is equally difficult to quantify until you see them in action, and the sometimes dry content in SQL Server Books Online doesn’t really do much to explain just how powerful they are. In DTS some of the functionality available in SSIS expressions was called Dynamic Properties, but in SQL Server 2005 and later, SSIS expressions significantly increased the capabilities of associating properties with expressions. SSIS expressions provide a powerful mechanism for adding custom declarative logic to your packages. Do you need to have the value of a variable change as the package executes, so that it reflects the current state of the running package? Just set the variable’s EvaluateAsExpression property to true and set its Expression property to an SSIS expression that yields the necessary value. Then, whenever the variable is accessed, the expression is evaluated and its value is returned. If you don’t find this exciting, consider this: The same thing that you can do with variables you can also do with just about any property on just about any task in your package. Expressions allow you to add your own custom logic to prebuilt components, extending and customizing their behavior to suit your needs. To draw an analogy to .NET development, this is like being able to define your own implementation for the properties of classes that another developer built, without going to the trouble of writing a child class to do it. Figure 14-7 shows the built-in Expression Builder dialog box. We’ll use this tool in later chapters as we review the process and techniques we commonly use to associate custom expressions with aspects of SSIS packages. We’ll go into more depth on expressions and how to use them in the chapters ahead, but they are far too important (and far too cool) to not mention as early as possible.
448
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 14-7 The Expression Builder dialog box
Connection Managers The next part of a package that we’re going to look at in this section is the connection manager. A connection manager is a logical representation of a connection. It acts as a wrapper around a physical database connection (or a file connection, an FTP connection, and so on—it all depends on the type of connection manager) and manages access to that connection both at design time when the package is being built and at run time when the package executes. Another way to look at connection managers is as the gateway through which tasks and components access resources that are external to the package. For example, whenever an Execute SQL task needs to execute a stored procedure, it must use a connection manager to connect to the target database. Whenever an FTP task needs to download data files from an FTP server, it must use a connection manager to connect to the FTP server and another connection manager to connect to the file location to which the files will be copied. In short, whenever any task or component within an SSIS package needs to access resources outside the package, it does so through a connection manager. Because of this, connection managers are available practically everywhere within the SSIS design tools in BIDS. The Connection Managers window, as shown in Figure 14-8, is included at the bottom of the control flow designer, at the bottom of the data flow designer, and at the bottom of the event handler designer.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
449
FIgure 14-8 Connection managers
The Connection Manager exception Using a connection manager for each external access is a good rule of thumb to keep in mind, but this rule has several exceptions. One example is when you use the Raw File source and destination data flow components. These are used to read from and write to an SSIS native raw file format for data staging. Each of these components has an Access Mode property through which the raw file can be identified by file name directly or by reading a file name variable, instead of using a connection manager to reference the file. Several such exceptions exist, but for the most part the rule holds true: If you want to access an external resource, you’re going to do it through a connection manager. You can see in Figure 14-9 that you have a broad variety of connection types to select from.
FIgure 14-9 Connection manager types
450
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Although the gateway aspect of connection managers may sound limiting, this design offers many advantages. First and foremost, it provides an easy mechanism for reuse. It is very common to have multiple tasks and data flow components within a package all reference the same SQL Server database. Without connection managers, each component would need to use its own copy of the connection string, making package maintenance and troubleshooting much trickier. Connection managers also simplify deployment: Because anything that is location-dependent (and may therefore need to be updated when deploying the package into a different environment) if a package is managed by a connection manager, it is simple to identify what needs to be updated when moving a package from development to test to production environments. It is interesting to note that the Cache connection manager is new to SQL Server 2008. We’ll cover why it was added and how you use it in later chapters. Tip You can download additional connection managers from utility locations (such as CodePlex), purchase them, or develop them by programming directly against the SSIS API.
Event Handlers and Error Handling If a package’s control flow is like the “main” routine in traditional procedural code, a package’s event handlers are similar to the event procedures written as part of a traditional application to respond to user interaction, such as a button-click event. Similar to the components of a control flow, an SSIS event handler is made up of tasks, containers, and precedence constraints, but while each package has only one main control flow, a package can have many event handlers. Each event handler is attached to a specific event (such as OnError or OnPreExecute) for a specific task or container or for the package itself. In Figure 14-10 you can see a list of available event handlers for SSIS. The most commonly used event handler is the first one, OnError. We will drill down in subsequent chapters, examining business scenarios related to BI projects where we use other event handlers.
FIgure 14-10 SSIS event handlers
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
451
As you can see in Figure 14-11, the UI for defining and managing event handlers in an SSIS package is very similar to the tools in Visual Studio for managing event handlers in Visual Basic or C# code. The biggest difference (other than the fact that the event handler has a control flow designer surface for tasks and precedence constraints instead of code) is the Executable drop-down list. Instead of selecting from a simple list of objects, in SSIS you select from a tree of the available containers and tasks, reflecting the hierarchical nature of the executable components in an SSIS package.
FIgure 14-11 Event handlers
In addition to defining event handlers for various conditions, including the OnError event in the control flow, SSIS provides you with an easy-to-use interface to configure the desired behavior for errors that occur in the data flow. You can specify error actions in sources, transformations, and destinations in the data flow. Figure 14-12 shows an example of configuring the error output for an OLE DB connection. Notice that you are presented with three action options at the column error and truncation level: Fail Component, Ignore Failure, or Redirect Row. If you select the last option, you must configure an error row output destination as part of the data flow.
452
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 14-12 Error output for an OLE DB data source
The easy configuration of multiple error handling scenarios in SSIS packages is an extremely important feature for BI solutions because of the large (even huge) volumes of data that are common to work with for both OLAP cubes and data mining structures.
The Integration Services runtime The SSIS runtime provides the necessary execution context and services for SSIS packages. Similar to how the .NET common language runtime (CLR) provides an execution environment for ASP.NET and Windows Forms applications—with data types, exception handling, and core platform functionality—the SSIS runtime provides the execution context for SSIS packages. When an SSIS package is executed, it is loaded into the SSIS runtime, which manages the execution of the tasks, containers, and event handlers that implement the package’s logic. The runtime handles things such as logging, debugging and breakpoints, package configurations, connections, and transaction support. For the most part, SSIS package developers don’t really need to think much about the SSIS runtime. It just works, and if you’re building your packages in BIDS, you can often take it for granted. The only time that SSIS developers generally need to interact directly with the runtime is when using the SSIS .NET API and the classes in the Microsoft.SqlServer.Dts.Runtime namespace, but understanding what’s going on under the hood is always important if you’re going to get the most out of SSIS (or any development platform).
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
453
The Integration Services Data Flow engine While the SSIS runtime is responsible for managing the execution of packages’ control flow, the heart of the SSIS data flow is the data flow engine. The data flow engine is responsible for interacting with data source components to extract data from flat files, relational databases, and other sources; for managing the transformations that manipulate the data that flows through the pipeline; and for interacting with the data destination components that load data into destination databases or other locations. But even more important than this, the data flow engine is responsible for managing the memory buffers that underlie the work that the source, destination, and transformation components are doing. Note Why is the data flow engine also known as the pipeline? Early in the SSIS development cycle the term pipeline was used for what became known as the Data Flow task; this was due to its logical pipeline design. These days everyone knows it as the Data Flow task, but the name lives on in the Microsoft.SqlServer.Dts.Pipeline.Wrapper namespace, through which developers can interact with the core SSIS data flow functionality.
In the previous section we introduced the SSIS runtime, briefly explained its purpose and function, and said that most SSIS developers could take it for granted, and then we moved on. We’re not going to do the same thing with the data flow engine. Although the data flow engine is also largely hidden from view, and although many SSIS developers do build packages without giving it much thought, the data flow engine is far too important to not go into more detail early on. Why is this? It is due largely to the type of work that most SSIS packages do in the real world. Take a look at the control flow shown earlier in Figure 14-4. As you can see, this package performs the following basic tasks: ■■
Selects a row count from a table
■■
Inserts a record into another table and captures the identity value for the new row
■■
Moves millions of records from one database to another database
■■
Selects a row count from a table
■■
Updates a record in a table
This pattern is very common for many real-world SSIS packages: A handful of cleanup tasks perform logging, notification, and similar tasks, with a Data Flow task that performs the core ETL functionality. Where do you think the majority of the execution time—and by extension, the greatest opportunity for performance tuning and optimization—lies in this package? The vast majority of the execution time for most SSIS packages in real-world BI projects is spent in the Data Flow task, moving and transforming large volumes of data. And because of
454
Part III
Microsoft SQL Server 2008 Integration Services for Developers
this, the proper design of the Data Flow task generally matters much more than the rest of the package, assuming that the control flow correctly implements the logic needed to meet the package’s requirements. Therefore, it is important that every SSIS developer understands what’s going on in the data flow engine behind the Data Flow task; it’s very difficult to build a data flow that performs well for large volumes of data without a solid understanding of the underlying architecture. Let’s take a closer look at the data flow engine by looking at some of its core components, namely buffers and metadata, and at how different types of components work with these components.
Data Flow Buffers Although the Data Flow task operates as a logical pipeline for the data, the underlying implementation actually uses memory buffers to store the data being extracted from the data flow’s sources (transformed as necessary) and loaded into the data flow’s destinations. Each data source component in a data flow has a set of memory buffers created and managed by the data flow engine; these buffers are used to store the records being extracted from the source and manipulated by transformations before being loaded into the destination. Ideally, each buffer will be stored in RAM, but in the event of a low memory condition, the data flow engine will spool buffers to disk if necessary. Because of the high performance overhead of writing buffers to disk (and the resulting poor performance), avoid this whenever possible. We’ll look at ways to control data flow buffers in a later chapter as part of our coverage of performance tuning and optimization. In SQL Server 2008, the data flow pipeline has been optimized for scalability, enabling SSIS to more efficiently utilize multiple processors (specifically, more than two CPUs) available in the hardware environment where SSIS is running. This works by implementing more efficient default thread scheduling. This improved efficiency allows you to spend less time performance tuning the package (by changing buffer size settings, for example). Automatic parallelism and elimination of thread starvation and deadlocks results in faster package execution.
Data Flow Metadata The properties of the buffers described earlier depend on a number of factors, the most important of which is the metadata of the records that will pass through the data flow. The metadata is generally determined by the source query used to extract data from the source system. (We’ll see exceptions to this in the next few sections, but we’ll look at the most common cases first.) For example, consider a data flow that is based on the following query against the AdventureWorks2008 database. The data type of each column in the source table is included for reference only.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
455
SELECT [BusinessEntityID] -- INT ,[NationalIDNumber] -- NVARCHAR (15) ,[LoginID] -- NVARCHAR (256) ,[JobTitle] -- NVARCHAR (50) ,[HireDate] -- DATETIME ,[rowguid] -- UNIQUEIDENTIFIER ,[ModifiedDate] -- DATETIME FROM [HumanResources].[Employee] WHERE [ModifiedDate] > '2008-02-27 12:00:00.000'
Each row in the data flow buffer created to hold this data will be 678 bytes wide: 4 bytes for the INT column, 8 bytes for each DATETIME column, 16 bytes for the UNIQUEIDENTIFIER, and 2 bytes for each character in each NVARCHAR column. Each column in the buffer is strongly typed so that the data flow engine knows not only the size of the column, but also what data can be validly stored within. This combination of columns, data types, and sizes defines a specific type for each buffer, not unlike how a combination of columns, data types, and sizes defines a specific table TYPE in the SQL Server 2008 database engine. The data flow designer in BIDS gives SSIS developers a simple way to view this metadata. Simply right-click any data flow path arrow and select Edit from the shortcut menu. In the Path Metadata pane of the Data Flow Path Editor you can see the names, types, and sizes of each column in the buffer underlying that part of the data flow. Figure 14-13 shows the Data Flow Path Editor for the preceding sample query.
FIgure 14-13 Data flow metadata
456
Part III
Microsoft SQL Server 2008 Integration Services for Developers
It’s worth noting, however, that the metadata displayed in this dialog box is not always the complete metadata for the underlying buffer. Consider the data flow shown in Figure 14-14.
FIgure 14-14 Adding a Derived Column transformation
In this example, a Derived Column transformation has been added to the data flow. This transformation adds a new column, AuditKey, to the data flow. This is a common technique for BI projects, because knowing where the data came from is often almost as important as the data itself. If we were to examine the metadata for the data flow path between the OLE DB Source component and the Derived Column transformation after making this change, it would be identical to what is shown in Figure 14-14. At first glance this makes sense. The AuditKey column is not added until the Derived Column transformation is reached, right? Wrong. This is where the difference between the logical pipeline with which SSIS developers interact and the physical buffers that exist under the covers becomes evident. Because the same buffer is used for the entire data flow (at least in this example—we’ll see soon how this is not always the case), the additional 4 bytes are allocated in the buffer when it is created, even though the column does not exist in the source query. The designer in BIDS is intelligent enough to hide any columns in the buffer that are not in scope for the selected data flow path, but the memory is set aside in each row in the buffer. Remember that the bulk of the performance overhead in BI projects occurs in the data flow, and by association proper use of the buffer is very important if you want to create packages that perform well at production scale.
Variable Width Columns It’s worth noting that the SSIS data flow has no concept of variable width columns. Therefore, even though the columns in the preceding example are defined using the NVARCHAR data type in the source system and are thus inherently variable length strings, the full maximum width is always allocated in the data flow buffer regardless of the actual length of the value in these columns for a given row.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
457
Why is this? Why does SSIS pass up what appears to be such an obvious opportunity for optimization, when the SQL Server database engine has done so for years? The answer can be found in the different problems that SQL Server and SSIS are designed to solve. The SQL Server database stores large volumes of data for long periods of time, and needs to optimize that storage to reduce the physical I/O involved with reading and writing the data. The more records that can be read or written in a single I/O operation, the better SQL Server can perform. The SSIS data flow, on the other hand, is designed not to store data but to perform transformations on the data in memory. To do this as efficiently as possible, the data flow engine needs to know exactly where to find each field in each record. To do that as efficiently as possible, it can’t worry about tracking the length of each field in each record. Instead, by allocating the full size for each column, the SSIS data flow engine can use simple pointer arithmetic (which is incredibly fast) to address each record. In the preceding example, if the first record in the buffer is located at memory address 0x10000, the second record is located at 0x102A6, the third is located at 0x1054C, and the LoginID column of the fourth record is located at 0x67604. Because each column width is fixed, this is all perfectly predictable, and it performs like a dream. All of this strict typing in the buffer has both pros and cons. On the plus side, because SSIS knows exactly what data is stored where at all times it can transform data very quickly, yielding excellent performance for high-volume data flows. This is the driving reason that SSIS is so picky about data types: The SSIS team knew that for SSIS to be a truly enterprise-class ETL platform, it would have to perform well in any scenario. Therefore, they optimized for performance wherever possible. The negative side of the strict data typing in the SSIS data flow is the limitations that it places on the SSIS developer. Implicit type conversions are not supported; not even narrowing conversions can be performed implicitly. This is a common cause for complaint for developers who are new to SSIS. When you’re converting from a 4-byte integer to an 8-byte integer, there is no potential for data loss, so why can’t SSIS “just do it?” You can do this in C#—why not in SSIS? The answer lies in the problems that each of these tools was designed to solve. C# is a general-purpose programming language where developer productivity is a key requirement. SSIS is a specialized tool where it is not uncommon to process tens of millions of records at once. The run-time overhead involved with implicit data type conversion is trivial when dealing with a few values, but it doesn’t scale particularly well, and SSIS needs to scale to handle any data volume.
458
Part III
Microsoft SQL Server 2008 Integration Services for Developers
How Integration Services uses Metadata Many of the most commonly asked questions about working with the SSIS data flow task arise from a lack of understanding of how SSIS uses metadata. This is especially the case for developers with a DTS background. DTS was much more flexible and forgiving around metadata, which often made it slower than SSIS but also made it easier to work with data from less-structured data sources such as Microsoft Office Excel workbooks. SSIS needs complete, consistent, stable metadata for everything in the data flow, and can often be unforgiving of changes to data sources and destinations because its metadata is no longer in sync with the external data stores. The next time you find yourself asking why a data flow that was working yesterday is giving you warnings and errors today, the first place you should generally look is at the metadata.
In the chapters ahead, we will revisit the topic of data flow metadata many times—it is the foundation of many of the most important tasks that SSIS developers perform. If it is important for .NET developers to understand the concepts of value types and reference types in the CLR, or for SQL Server developers to understand how clustered and nonclustered indexes affect data access performance, it is equally important for SSIS developers to understand the concepts of metadata in the SSIS data flow. SQL Server Books Online has additional coverage of SSIS metadata—look for the topic “Data Flow Elements.” So far we’ve been assuming that all of the buffers for a given data flow have the same metadata (and therefore are of the same type) and that the same set of buffers will be shared by all components in the data flow. Although this is sometimes the case, it is not always true. To understand the details, we need to take a look at the two different types of outputs that SSIS data flow components can have: synchronous and asynchronous.
Synchronous Data Flow Outputs A data flow component with synchronous outputs is one that outputs records in the same buffer that supplied its input data. Consider again the data flow shown in Figures 14-4 and 14-5. You can see that the Derived Column transformation is adding a new AuditKey column to the data flow. Because the Derived Column transformation has one synchronous output, the new column value is added to each row in the existing data flow buffer. The primary advantage to this approach is that it is incredibly fast. No additional memory is being allocated and most of the work being performed can be completed via pointer arithmetic, which is a very low-cost operation. Most of the workhorse transformations that are used to build the ETL components of real-world BI applications have synchronous outputs, and it is common to design data flows to use these components wherever possible because of the superior performance that they offer.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
459
Asynchronous Data Flow Outputs The disadvantage to synchronous outputs is that some operations cannot be performed on existing records in an existing memory buffer. For these operations, the SSIS data flow also supports asynchronous outputs, which populate different data flow buffers than the buffers that provide the input records for the data flow component. For example, consider a data flow that uses an Aggregate transformation and performs the same type of operations as the SQL GROUP BY clause. This Aggregate transformation accepts records from an input buffer that contains detailed sales information, with one input record for each sales order detail line. The output buffer contains not only different columns with different data types, but also different records. Instead of producing one record per sales order detail line, this transformation produces one record per sales order. Because data flow components with asynchronous outputs populate different memory buffers, they can perform operations that components with synchronous outputs cannot, but this additional functionality comes with a price: performance. Whenever a data flow component writes records to an asynchronous output, the memory must be physically copied into the new buffer. Compared to the pointer arithmetic used for synchronous outputs, memory copying is a much costlier operation. Note You might see the terms synchronous and asynchronous used to describe data flow transformations and not their outputs. A single data flow component (and not just transformations) can have multiple outputs, and each asynchronous output is going to have its own buffers created by the data flow engine. Many publications (including parts of the SSIS product documentation) regularly refer to synchronous and asynchronous transformations because that is the most common level of abstraction used when discussing the SSIS data flow. You only need to be concerned about this level of detail when you’re building a custom data flow component through code, or when you’re really digging into the internals of a data flow for performance tuning and optimization.
Log Providers Log providers are SSIS components that, as their name implies, log information about package execution. Logging is covered in more depth in a later chapter; for now it’s sufficient to understand that each log provider type is responsible for writing this information to a different destination. SSIS includes log providers for logging to the Windows Event Log, text files, XML files, SQL Server, and more. In addition, .NET developers can create their own custom log providers by inheriting from the Microsoft.SqlServer.Dts.Runtime.LogProviderBase base class. Logging can be enabled very granularly—at the level of a specific task and interesting
460
Part III
Microsoft SQL Server 2008 Integration Services for Developers
events for that task—rather than globally. In this way logging is both efficient and effective. Figure 14-15 shows a list of some of the events that can be logged from the configuration dialog box.
FIgure 14-15 SSIS logging events
You’ve probably noticed a trend by now: Practically any type of component that makes up the SSIS architecture can be developed by .NET developers using the SSIS managed API. This is one of the core aspects of SSIS that makes it a true development platform—in addition to all of the functionality that is provided out of the box, if you need something that is not included, you can build it yourself. Although SSIS development deserves a complete book of its own, we will take a look at it in greater depth in Chapter 19.
Deploying Integration Services Packages SSIS provides many different features and options related to package deployment—so many, in fact, that Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services,” is devoted almost entirely to deployment. One of the most important things to keep in mind is that a smooth deployment begins not when the package development is complete, but when package development is beginning, through the proper use of package configurations.
Chapter 14
Architectural Components of Microsoft SQL Server 2008 Integration Services
461
Package Configurations Package configurations are an SSIS feature that allows you to store the values for properties of package tasks outside of the package itself. Although the details are different, configurations are conceptually very similar to using a web.config file in an ASP.NET application: The values that need to change as the application moves from development to test to production (usually database connection strings and file/folder paths) are stored externally to the package code so that they can be updated separately without requiring the application logic to be modified. Package configurations are implemented by the SSIS runtime, which reads the configuration values when the package is being loaded for execution and applies the values from the configuration to the task properties that have been configured. SSIS lets you select from a variety of different configuration types, including XML files, SQL Server databases, the Windows registry, and Windows environment variables. SSIS also supports indirect configurations, where you store the path to an XML file (or the registry path or SQL Server connection string) in a Windows environment variable and then reference the environment variable from the SSIS package being configured. The primary goal of package configurations is to make the packages location-independent. To have a smooth deployment, any reference that an SSIS package makes to any external resource should be included in a package configuration. That way, when either the package or the external resource is moved, only the configuration information must be updated; the package need not be changed.
Package Deployment Options Just as SSIS provides a variety of location options for storing package configuration information, it also provides a variety of location options for package deployment. SSIS packages can be deployed either to the file system or to a SQL Server instance. As mentioned earlier, SSIS includes utilities to simplify deployment, including DTEXEC.exe, DTEXECUI.exe, and DTUTIL.exe. If packages are deployed to a SQL Server instance, the packages are stored in the dbo.sysssispackages table in the msdb system database. This allows packages to be backed up and restored with the system databases, which is often desirable if the database administrators responsible for maintaining the SQL Server instances used by the SSIS application are also responsible for maintaining the SSIS application itself. If packages are deployed to the file system, they can be deployed to any location as DTSX files, or they can be deployed to a default folder that the SSIS service monitors. (This folder is called the SSIS Package Store and is located by default at C:\Program Files\Microsoft SQL Server\100\DTS\Packages.) If packages are deployed to this default folder, the packages can be monitored using SSMS, as described in the previous section on SSIS tools and utilities.
462
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Keep in mind that although you have multiple deployment options, no single option is best. The deployment options that work best for one project may not be best for another—it all depends on the context of each specific project.
Summary In this chapter we took an architectural tour of the SSIS platform. We looked at packages and the SSIS runtime, which is responsible for package execution. We examined control flow and data flow—the two primary functional areas within each package—and the components that SSIS developers use to implement the specific logic needed in their ETL projects. We looked at the data flow pipeline that delivers the core high-performance ETL functionality that makes SSIS ideally suited to enterprise-scale data-warehousing projects. We looked at the tools and utilities included with the SSIS platform for developing, deploying, and managing SSIS applications, and we looked at the .NET object model that enables developers to extend and automate practically any aspect of SSIS. And we’ve just scraped the surface of what SSIS can do. We are sometime asked whether SSIS is required for all BI projects. The technical answer is no. However, we have used these built-in ETL tools for every production BI solution that we’ve designed, developed, and deployed—ever since the first release of SSIS in SQL Server 2005. Although you could make the case that you could accomplish BI ETL by simply writing data access code in a particular .NET language or by writing database queries against RDBMS source systems, we strongly believe in using SSIS as the primary ETL workhorse for BI solutions because of its graphical designer surfaces, self-documenting visual output, sophisticated error handling, and programmatic flexibility. Another consideration is the features that you’ll need for your solution. Do be aware of the significant feature differences in SSIS between the Enterprise and Standard editions. Another consideration is that the only ETL tool available in the Workgroup and Express editions is the Import/Export Wizard. No SSIS features are included in those editions. In the chapters ahead we’ll look at each of the topics introduced in this chapter in greater depth, focusing both on the capabilities of the SSIS platform and best-practice techniques for using it to develop and deliver real-world ETL solutions. In the next chapter we’ll focus on using BIDS to create SSIS packages, so get ready to get your hands dirty!
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio In Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration Services,” we looked at the major components that make up the Microsoft SQL Server 2008 Integration Services (SSIS) platform, including packages, control flow, data flow, and more. Now we’re going to work with the primary development tool used to create Integration Services packages: Business Intelligence Development Studio (BIDS). As an Integration Services developer, you’ll spend the majority of your project time working in BIDS, so you need to understand how to use the tools that BIDS provides to work with packages. In this chapter, we examine the SSIS tools that BIDS adds to Microsoft Visual Studio, and how each one is used when developing SSIS packages. We’ll also look at the workhorse components that SSIS developers use when building a business intelligence (BI) application. We won’t include a laundry list of tasks and transformations—SQL Server Books Online does an excellent job of this. Instead, we’ll focus on the most commonly used components and how they can best be used when developing real-world solutions with SSIS.
Integration Services in Visual Studio 2008 As you know by now, BIDS is the Visual Studio 2008 integrated development environment, which includes new project types, tools, and windows to enable SQL Server BI development. BIDS provides the core tools that SSIS developers use to develop the packages that make up the extract, transform, and load (ETL) component of a BI application. But BIDS also represents something more—it represents a major shift from how packages were developed using Data Transformation Services (DTS) in the days of SQL Server 2000 and SQL Server 7.0. One of the biggest differences between DTS and SSIS is the tools that package developers use to build their ETL processes. In DTS, developers worked in SQL Server Enterprise Manager to build DTS packages. Enterprise Manager is primarily a database administrator (DBA) tool, so DBAs who needed to build DTS packages generally felt right at home. And because it was often DBAs who were building the DTS packages, this was a good thing.
463
464
Part III
Microsoft SQL Server 2008 Integration Services for Developers
But one major drawback of using Enterprise Manager for DTS package development is that Enterprise Manager isn’t really a development tool—it’s a management tool. Because of this, it was difficult to treat DTS packages as first-class members of an application. There was no built-in support for source control, versioning, or deployment. Few of the tools and procedures that were used to store, version, and deploy the other parts of the BI application (such as source code, reports, SQL scripts, and so on) could be easily used with DTS packages, and this was due in large part to the reliance of DTS on Enterprise Manager. With SSIS, package development takes place in Visual Studio. This means that it’s now much easier to integrate SSIS into the overall software development life cycle, because you can now use the same tools and processes for your SSIS packages that you use for other project artifacts. This is also a good thing, but it too has its drawbacks. One drawback is that many DBAs who are familiar with Enterprise Manager or SSMS might not be as familiar with Visual Studio. So to take advantage of the capabilities of the SSIS platform, these DBAs need to also learn a new—and often complex—set of tools. If you’re a software developer who has spent a lot of time building software in Visual Studio, the information in the next few sections might seem familiar to you, but please keep reading. Although we’re going to talk about some familiar concepts such as solutions and projects and some familiar tools such as Solution Explorer, we’re going to focus as much as possible on the details that are specific to SSIS. The remainder of this section is an overview of how Visual Studio is used to develop SSIS packages, including a tour of the project template and the various windows and menus and tools in Visual Studio that are unique to SSIS.
Creating New SSIS Projects with the Integration Services Project Template The first step when building an ETL solution with SSIS is to create a new SSIS project in BIDS. From the main menu in BIDS, select File, New, and then Project. You’ll be presented with the New Project dialog box shown in Figure 15-1. This screen shot represents BIDS without an existing full Visual Studio 2008 installation. When you are working only with BIDS, the only project types available are Business Intelligence Projects and Other Project Types. If you have a full installation of Visual Studio 2008, you would also have C# and Visual Basic .NET project types. In the New Project dialog box, select Business Intelligence Projects from the Project Types section, and then select Integration Services Project from the Templates section. Enter a name and location for your project and then click OK to create the SSIS project and a solution to contain it.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
465
FIgure 15-1 BIDS project templates
After you complete the New Project dialog box and click OK, you’ll see the contents of the project created from the SSIS project template. Figure 15-2 shows a new SSIS project in Visual Studio—this is the end product of the project template selected in the New Project dialog box.
FIgure 15-2 SSIS project template in Visual Studio
In Visual Studio, a project is a collection of files and settings—for an SSIS project, these are generally .dtsx files for the packages in the project and settings related to debugging and deployment. A solution in Visual Studio is a collection of projects and settings. When you
466
Part III
Microsoft SQL Server 2008 Integration Services for Developers
create a BI project using the SSIS template in BIDS, you can add items that are related to SSIS development to your solution. These items include new data sources, data source views (DSVs), regular SSIS packages, or SSIS packages that are created after running the SQL Server Import And Export Wizard.
Viewing an SSIS Project in Solution Explorer When you’re working in Visual Studio, you always have a solution that contains one or more projects, and Visual Studio shows this structure in the Solution Explorer window. Figure 15-3 shows the Solution Explorer window for a solution containing a new SSIS project.
FIgure 15-3 Solution Explorer, showing a new SSIS project
Tip If you look in Solution Explorer and see only your project with no solution displayed, don’t worry. By default, Visual Studio displays the solution node in the Solution Explorer window only if there is more than one project in the solution. If you want to display the solution node regardless of the number of projects in your solution, you can open the Options dialog box from the Tools menu in Visual Studio. Select the Projects And Solutions page in the list on the left and choose the Always Show Solution check box. Within the Solution Explorer window, each SSIS project contains the same four folders: Data Sources, Data Source Views, SSIS Packages, and Miscellaneous. Although data sources and data source views are vital components in SSAS projects, they’re used much less often when working with SSIS. Data sources allow connection properties to be shared between packages in an SSIS project, but they’re strictly a Visual Studio tool and are unrelated to the SSIS runtime. Because of this, any packages that rely on data sources to set connection properties during development need to be updated to use another mechanism (such as package configurations) during deployment. Tip Because of the restrictions just defined, we usually don’t use project-level data sources. We prefer to use package-specific data sources that are defined in package configurations.
The SSIS Packages folder contains all the packages in the project; to add a new package, right-click on the folder and select New SSIS Package from the shortcut menu. You can also
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
467
add an existing package to the project, import a DTS package, and more, all from the same shortcut menu. The Miscellaneous folder is used to store any other files that are related to the packages in the project. Often these are XML configuration files, sample data for use during development, or documentation related to the project, but any type of file can be added by right-clicking on the project (oddly enough, not by clicking on the Miscellaneous folder) and selecting Add Existing Item from the shortcut menu.
Data Sources and Data Source Views Data sources and data source views in SSIS function similarly to the way they work in SSAS—that is, a data source creates a global connection, and a data source view allows you to create abstractions (similar to views) using the source data. As in SQL Server 2008 Analysis Services (SSAS), in SSIS DSVs are most useful when you do not have permission to create abstractions directly in the source data. It’s important that you understand that changes in the underlying metadata of the source will not be automatically reflected in the DSV—it must be manually refreshed. DSVs are not deployed when SSIS packages are deployed. Also, DSVs can be used only from the Sources, Lookups, or Destination components. Although there is some usefulness to both data source and DSV objects, we generally prefer to use package configurations (which will be covered in Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services”) because we find configurations to provide us with a level of flexibility that makes package maintenance and deployment simpler.
If you have experience building .NET Framework applications in Visual Studio, you’ll probably notice that SSIS projects do not have all the same capabilities as C# or Visual Basic projects. For example, files are not sorted alphabetically—instead, your packages are sorted in the order in which they are added to the project. You can re-sort them by manually editing the .dtproj file for the project, but there is no support for this built into Visual Studio itself. You can, however, use the BIDS Helper tool available from CodePlex to add package sorting functionality to your SSIS projects. Also, you cannot add your own folders and subfolders to an SSIS project; the four folders described earlier and shown in Figure 15-3 are all you get. Still, despite these differences, working with SSIS in Visual Studio should be a productive and familiar experience all around.
Using the SSIS Package Designers Once you’ve had a chance to look at the resources available in the Solution Explorer window, the next logical step is to move on to the package designers themselves. The three primary designers (control flow, data flow, and event handler) and a viewer (Package Explorer) for SSIS packages are accessed on the tabs shown in Figure 15-4.
468
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 15-4 Tabs for the SSIS package designers
Each tab opens a different designer: ■■
Control flow designer The Control Flow tab opens the control flow designer, which presents a graphical designer surface where you will build the execution logic of the package using tasks, containers, and precedence constraints.
■■
Data flow designer The Data Flow tab opens the data flow designer, which presents a graphical designer surface for each data flow task in your package. This is where you will build the ETL logic of the package using source components, destination components, and transformation components.
■■
Event handler designer The Event Handlers tab opens the event handler designer, which presents a graphical designer surface where you can build custom control flows to be executed when events are fired for tasks and containers within the package, or for the package itself.
Note Each of the three designers includes a Connection Managers window. As you saw in Chapter 14, connection managers are package components that are shared between all parts of the package that need to use the external resources managed by the connection managers. This is represented inside the Visual Studio designers by the inclusion of the Connection Managers window on every designer surface where you might need them. Unlike the Control Flow, Data Flow, and Event Handlers tabs, the Package Explorer tab doesn’t open a graphical designer surface. Instead, it gives you a tree view of the various components that make up the package, including tasks, containers, precedence constraints, connection managers, data flow components, and variables, as shown in Figure 15-5. Unlike the package designer tabs, the Package Explorer tab is not used to build packages. Instead, it presents a single view where the entire structure of the package can be seen in one place. This might not seem important right now, but SSIS packages can become pretty complex, and their hierarchical nature can make them difficult to explore through the other windows. For example, tasks can be inside containers, containers can be inside other containers, and there can be many Data Flow tasks within a package, each of which can contain many components. Also, the event handler designer does not provide any way to see in one place what event handlers are defined for what objects within the package, which is a limitation that can easily hide core package functionality from developers who are not familiar with the package’s design. The Package Explorer gives SSIS developers a single place to look to see everything within the package, displayed in a single tree that represents the package’s hierarchy.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
469
FIgure 15-5 Package Explorer tab in the SSIS designer
Although you cannot build a package by using the Package Explorer, it can still be a valuable jumping off point for editing an existing package. Many objects, such as tasks, precedence constraints, and connection managers can be edited from directly within the Package Explorer tab—just double-click on the object in the package explorer tree and the editor dialog box for that component will appear. This holds true for Data Flow tasks as well; if you double-click on a Data Flow task icon within the package explorer tree, the designer surface for that Data Flow task will open.
Working with the SSIS Toolbox When you’re working with the control flow, data flow, or event handler designers, Visual Studio displays a Toolbox window that contains the components that you use to build your package. The components displayed in the Toolbox are context sensitive: when you’re building a data flow, it contains transformations, source, and destination components; when you’re building a control flow or event handler, the Toolbox contains tasks and containers, as shown in Figure 15-6. The Toolbox works the same way for SSIS projects as it does for other Visual Studio projects with graphical designers (such as Windows Forms and Web Forms projects)—you simply drag the components to the designer surface to implement the logic your package requires.
470
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 15-6 Control Flow Toolbox
Notice that when you try dragging items (both tasks and components) to the designer surface, you receive immediate feedback about the status of the item as soon as you drop it onto the designer surface. If a task or component requires additional configuration information to execute successfully, either a red icon with a white X or a yellow triangle with a black exclamation mark appears on the task or component (rectangle) itself. If you pass your mouse over the task or component in the design window, a tooltip appears with more information about what you must to do to fix the error. An example is shown in Figure 15-7.
FIgure 15-7 Design-time component error
It’s also possible that the design environment will display a pop-up dialog box with an error message when you attempt to open, configure, or validate the default configuration of a task. It’s important that you understand that the design environment does extensive designtime validation. This is to minimize run-time package execution errors.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
471
Maintenance Plan Tasks If you refer to Figure 15-6, you’ll notice that there is a second group of tasks at the bottom, labeled Maintenance Plan Tasks. Database maintenance plans in SQL Server 2008 are implemented as SSIS packages, and there are specialized tasks included in SSIS to support them, such as the Shrink Database task, the Back Up Database task, and the Update Statistics task. Although SSIS developers can include any of these tasks in their packages, they generally do not apply to business intelligence projects, and as such we will not be covering them in this book.
After you create a package in BIDS, you can execute it by pressing F5 or selecting Start Debugging from the Debug menu. When you run an SSIS package, an additional tab (the Progress tab) becomes available for the particular SSIS package that you’re working with. It shows you detailed execution information for each task included in the package. Also, the control flow designer surface colors the tasks (rectangles) to indicate execution progress and status. Green indicates success, yellow indicates in progress, and red indicates failure. Shown in Figure 15-8 is a sample Progress tab.
FIgure 15-8 SSIS includes a package execution Progress tab.
You can also set breakpoints on tasks and use Visual Studio debugging techniques to halt and examine execution at particular points of package execution. These include viewing the value of executing variables in the BIDS debugging windows—that is, locals, watch, and so
472
Part III
Microsoft SQL Server 2008 Integration Services for Developers
on. We’ll cover common debugging scenarios related to SSIS packages used in BI scenarios in Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services.”
Choosing from the SSIS Menu One of the tools that the SSIS project template in BIDS adds to Visual Studio is an SSIS menu, which is added to the main Visual Studio menu bar, as shown in Figure 15-9.
FIgure 15-9 SSIS menu
The SSIS menu is a centralized way to access many of the SSIS-specific tools and windows. As with any Microsoft product, there are many ways to access the SSIS-specific functionality available in BIDS. We most often use the technique of right-clicking on the designer surface and Solution Explorer items. Doing this opens shortcut menus that contain subsets of menu options. As we work our way through the SSIS functionality, we’ll cover all the items listed on the SSIS menu. At this point, we’ve looked at the major components—windows, menus, and designers—that SSIS adds to Visual Studio. In the next two sections, we take a closer look at the control flow and data flow designers. In the section after that, we start drilling down into the details of using these tools to build real-world SSIS packages. Note One of the goals of this book is to move beyond the information that is available in SQL Server Books Online and to give you more information, more depth, and more of what you need to succeed with real-world SQL Server BI projects. Because of this, we’re not going to try to reproduce or replace the excellent SSIS tutorial that is included with SQL Server. Books Online includes a detailed five-lesson tutorial that has step-by-step instructions for building and then extending a simple but real-world SSIS package. Search SQL Server Books Online for the topic, “Tutorial: Creating a Simple ETL Package,” and try it out now.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
473
Connection Managers We introduced connection managers in Chapter 14 as gateways through which SSIS packages communicate with the outside world. As such, connection managers are a mechanism for location independence for the package. In this section, we’re going to revisit connection managers, changing the focus from the SSIS architecture to using connection managers within Visual Studio. Adding connection managers to your package is a straightforward process—simply rightclick on the Connection Managers window at the bottom of the designer surface and then select one of the New options from the shortcut menu. Also, most of the tasks and data flow components that use connection managers include a New Connection option in the dropdown list of connection managers so that if you forget to create the connection manager before you create the component that needs to use it, you can create it on the fly without needing to interrupt your task.
Standard Database Connection Managers SSIS includes three primary connection managers to connect to database data: the ADO.NET connection manager, OLE DB connection manager, and ODBC connection manager. When you’re choosing among them, the primary factor is usually what drivers (which are also called providers) are available. These connection managers all rely on existing drivers. For example, to connect to an Oracle database through an OLE DB connection manager, you must first install the Oracle client connectivity components. SSIS doesn’t re-invent the existing client software stack to connect to databases. The OLE DB connection manager can be used to connect to file-based databases as well as server-based databases such as SQL Server and Oracle. Note Other connection managers are available for SSIS as a download for Enterprise edition customers after SQL Server 2008 RTMs. Additional information can be found at http://ssis.wik.is/ SSIS_2008_Connectivity_Options. If you’re looking for more details on SSIS connectivity options, the SSIS product group created a Connectivity Wiki, where members of the product group, along with third-party partners and vendors, post information related to SSIS connectivity. You can find the wiki at http://ssis.wik.is. CodePlex also contains a sample that shows you how to programmatically create a custom connection manager at http://www.codeplex.com/MSFTISProdSamples.
474
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Other Types of Connection Managers Some considerations related to the Raw File connection manager are that it requires a specific format for input. The regular file connection managers (File and Flat File) are straightforward. Also, the SSAS connection manager is simple to use—just provide a valid connection string to a particular SSAS instance. To see a complete list of available connection managers, right-click the Connection Managers window (at the bottom of the control flow, data flow, or event handler designer surface) and then click New Connection. New to SQL Server 2008 is the Cache connection manager. Used with the Cache Transform, this connection manager is used to configure the behavior of the Lookup transformation’s cache—in particular, it can be used to improve the performance of lookups in the data flow.
Control Flow In this section, we take a closer look at some of the core tasks used in BI ETL scenarios. As we do this, we’ll describe different capabilities in the platform. We won’t cover the details of every task in depth, because some tasks are virtually self-documenting and we want to focus on building SSIS packages that facilitate BI ETL. So we’ll work to build a foundation in particular task capabilities in this section. To get started, we’ll work with a package that is part of the samples that are available for download from www.codeplex.com/MSFTISProdSamples. When you unzip the file, it creates a folder named MSFISPRodSamples-15927. Navigate to the folder named Katmai_August2008_RTM\Package Samples\ExecuteProcess Sample. The solution is named ExecuteProcessPackage Sample, and it contains a package called UsingExecuteProcess.dtsx. To see this sample, download and install the SQL2008.Integration_Services.Samples.xNN.msi file from CodePlex. (xNN represents the type of operating system you’re using—x64 or x86.) By default, the samples will be installed to C:\Program Files\Microsoft SQL Server\100\ Samples\. Open the package in BIDS by clicking on the ExecuteProcess.sln file located in the Integration Services\Package Samples\ExecuteProcess Sample folder in your installation folder. Then double-click on the UsingExecuteProcess.dtsx file in Solution Explorer. After you do that, the control flow will display the configured tasks for this package and the Toolbox window will list the control flow tasks, as shown in Figure 15-10. Note As of the writing of this book, the sample package was written for SSIS 2005, so when you open it, the upgrade package to SSIS 2008 wizard starts. We simply upgraded the sample package using the default wizard settings for our example.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
475
FIgure 15-10 Control flow designer surface and Connection Managers window
Note that at the bottom of the control flow designer surface, the Connection Managers window shows that this package has four associated defined connections. You’ll recall from the previous chapter that each represents a connection to some sort of data repository. We’ll look at connection manager properties in more detail in a later section of this chapter as well. Our example includes flags on the Execute Process task and two of the four connection managers. These flags are indicated by small pink triangles in the upper left of the graphical shapes representing these items. These flags are part of the functionality that the free BIDS Helper tool adds to SSIS packages. In this example, the indicator shows that the tasks or connections have expressions associated with them. We introduced the idea of SSIS expressions in Chapter 14. You’ll recall that they are formulas whose resultant values are part of a component’s configuration. We’ll take a closer look at SSIS expressions in the last section (“Expressions”) of this chapter. Note What is BIDS Helper? Although BIDS Helper is not an official part of SSIS, you can download it from CodePlex. This powerful, free utility adds functionality and usability to BIDS by adding many useful features to both SSAS and SSIS. You can get it at http://www.codeplex.com/ bidshelper.
476
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Control Flow Tasks The sample package shown in Figure 15-10 contains five control flow tasks: three Execute SQL tasks, an Execute Process task, and a Data Flow task. It also contains a Foreach Loop container. The loop container contains a Script task. These three task types and the Execute Package task and Script task are the five key control flow tasks that we most frequently use in SSIS packages that perform ETL for BI solutions. As we examine the commonly used tasks, we’ll also discuss the new Data Profiling task in detail. In this particular sample, the task items are using the default task names and there are no annotations. This is not a best practice! As we showed in the previous chapter, one of the reasons to use SSIS, rather than manual scripts or code, is because of the possibility of augmenting the designer surface with intelligent names and annotations—you can think of this as similar to the discipline of commenting code. So what does this package do exactly? Because of the lack of visible documentation, we’ll open each of the tasks and examine the configurable properties. We’ll start with the first Execute SQL task (labeled Execute SQL Task 2). Although you can configure some of the task’s properties by right-clicking and then working with the Properties window, you’ll probably prefer to see the more complete view. To take a more detailed look, right-click the task and then click Edit. The dialog box shown in Figure 15-11 opens.
FIgure 15-11 Execute SQL Task Editor dialog box
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
477
This dialog box contains four property pages to configure: General, Parameter Mapping, Result Set, and Expressions. Most of the properties are self-documenting—note the brief description shown on the bottom right of the dialog box. In the General section, you configure the connection information for the computer running SQL Server where you want to execute the SQL statement. The query can originate from a file, direct input, or a variable. You can also build a query using the included visual query builder. If your query uses parameters, and most will, you’ll use the Parameter Mapping page to configure the parameters. The next task in the control flow is another Execute SQL task. If you examine that task’s properties, you’ll see that it executes a stored procedure called CreateProcessExecuteDest. Next you’ll see that there is a small fx icon next to the green precedence constraint arrow immediately below this task. This icon indicates that the constraint includes an expression. We’ll take a closer look at SSIS expressions in the last section of this chapter. Next is an Execute Process task. This task allows you to add an executable process to your control flow. Similar to the Execute SQL task, right-clicking the task and selecting Edit opens a multipage dialog box where you can view and configure the available properties, as seen in Figure 15-12. Note that this task causes an executable file named expand.exe to run. Also, note that there’s a small pink triangle in the upper left corner of the task box. This indicates that an SSIS expression is associated with the execution of this task. To see this expression, click on the Expressions page of the Execute Process Task Editor dialog box. As mentioned, the BIDS Helper utility adds this indicator—SSIS does not include visual indicators on tasks for the use of expressions.
FIgure 15-12 Execute Process Task Editor dialog box
So far drilling down into our three tasks has followed a very similar path; however, after you click Edit on the fourth task, the Data Flow task, something entirely new happens. You’re taken to a different interface in SSIS—the data flow designer. Why is this?
478
Part III
Microsoft SQL Server 2008 Integration Services for Developers
As mentioned in the previous chapter, starting with SSIS in SQL Server 2005, Microsoft added a designer to develop data flows. Although it might seem confusing at first, this will make perfect sense as you work with SSIS more and more. The reason is that the majority of the work is often performed by this one task. In DTS in SQL Server 2000, the data flow was not separated. This lack of separation resulted in packages that were much more difficult to understand, debug, and maintain. We’ll take a closer look at the data flow designer shortly; we just wanted to point out this anomaly early in our discussion because we’ve seen this confuse new SSIS developers. We mentioned that we frequently use the Execute Package task and the Script task in BI scenarios. As you might have guessed by now, the Execute Package task allows you to trigger the execution of a child package from a parent package. Why this type of design is advantageous will become more obvious after we discuss precedence constraints in the next section. Before we do that, we’d like to mention a significant enhancement to a frequently used task—the Script task. New to SQL Server 2008 is the ability to write SSIS scripts in C# or Visual Basic .NET and using Visual Studio Tools for Applications (or VSTA). Previous versions of SSIS scripts were limited to Visual Basic .NET and VSTA only. We devote Chapter 19, “Extending and Integrating SQL Server 2008 Integration Services,” to scripting and extending SSIS, and we’ll have more complete coverage and examples of using this task there. Integration Services contains many other useful tasks besides the key ones just covered. One we’re particularly excited about is the Data Profiling task, which was developed by Microsoft Research and is newly available in SQL Server 2008. For BI scenarios, we also use the Bulk Insert, Analysis Services Execute DDL, Analysis Services Processing, and Data Mining Query tasks. We’ll look at all these tasks in the next chapter as we dive into specific scenarios related to BI, such as ETL for dimensions and fact table load for OLAP cubes and for data preparation for load into data mining structures and models. Before we do that, though, let’s continue our tour of the basics of SSIS packages by looking at other types of containers available on the control flow designer surface.
Control Flow Containers In addition to Task Host containers, which is the formal name for the objects (represented as rectangles) that host individual tasks, the SSIS control flow contains additional container types. You work with Task Host containers directly only when you’re developing custom tasks. So, unless you’re doing that, they’ll be invisible to you. These additional container types are the Sequence, Foreach Loop, and For Loop containers. To group tasks together in sequence, you first drag a Sequence container onto the control flow designer surface; next you drag your selected tasks onto the designer surface, dropping them inside the container. The mechanics of the Foreach and For Loop containers are identical to the Sequence container.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
479
Also note that there are many configurable properties associated with a Sequence container. These properties relate to the execution of the contained tasks as a group. They allow you to create transactions and set specific failure points. In addition to the Sequence, Foreach, and For Loop containers, SSIS contains a generic group container. You implement the group container by selecting tasks that you want to group together and then selecting Group from the shortcut menu. Figure 15-13 shows a sample of task groups together. You can nest grouping containers inside of one another as well. Unlike the Sequence container described earlier, the group container is a design convenience only. After using it, you should rename the group container to a meaningful name and possibly annotate the contents. You can collapse any type of grouping container so that the designer surface is more easily viewable. Another way to view large packages is to use the four-headed arrow that appears on the designer surface if the package contents are larger than the available designer surface. If you click that arrow, a small pop-up window appears. From that window, you can quickly navigate to the portion of the package you want to work with. An example of the pop-up window is shown in Figure 15-13.
FIgure 15-13 SSIS package navigational window
In addition to using grouping containers to make a package easier to understand by collapsing sections of tasks, or to sequence (or loop) through tasks, another reason to group tasks is to establish precedence constraints for multiple tasks. To cover that topic, we’ll next drill into your options for configuring precedence constraints. Before we do that, however, we’d be remiss if we didn’t also mention a couple of other reasons to use containers. These include being able to manage properties for multiple tasks at once, being able to limit variable scope to tasks that are part of a container, and being able to quickly and easily disable portions of a package. Disabling a portion of an SSIS package is done by right-clicking on the object (task or container) on the control flow or event handler designer surface and then clicking the Disable option. Any disabled objects are grayed out on the designer surface.
480
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Precedence Constraints As mentioned in Chapter 14, on the control flow designer surface you have three principal options for configuring precedence between tasks. These three options include proceeding to the next task after successful completion of the previous task, proceeding to the next task after failure of the previous task, or proceeding to the next task on attempted completion of the prior task. Attempted completion means to proceed to the next task whether or not the previous task succeeds or fails. These options are shown on the designer workspace as colored arrow-lines that connect the constrained tasks and mark them with the following colors: Success (green), Failure (red) or Completion (blue). Of course it’s common to have more than one precedence constraint associated with a particular task. The most typical situation is to configure both Success and Failure constraints for key tasks. Following is an example of using two constraints with a single source and destination task. You might choose to associate two Execute SQL tasks for this BI ETL scenario. For example, after successfully completing an Execute SQL task, such as loading a dimension table, you might then want to configure a second Execute SQL task, such as loading a fact table. However, you might also want to load neither the dimension nor fact table if the dimension table task does not complete successfully. A first step in this scenario would be to configure a failure action for the first Execute SQL task. For example, you can choose to send an e-mail to an operator to notify that person that the attempted load has failed. Note In the preceding scenario, you might also want to make the two Execute SQL tasks transactional—that is, either both tasks execute successfully, or neither task executes. SSIS includes the capability to define transactions and checkpoints. We’ll cover this functionality in Chapter 16.
To add a constraint, simply right-click on any task and then click Add Precedence Constraint. You can add a constraint from the resulting dialog box by selecting another task to use as the end point for the constraint. If you right-click the newly created constraint and choose Edit, you’ll then see the dialog box in Figure 15-14. In addition to being able to specify the desired logic for multiple constraints—that is, logical AND (all constraints must be true) or logical OR (at least one constraint must be true)—you can also associate a custom expression with each constraint. If you select a logical AND, the arrow appears as a solid line; if you select a logical OR, the arrow appears as a dotted line. In Figure 15-14 we’ve also added an expression to our example constraint. We go into greater detail about the mechanics of expressions later in this chapter. At this point, we’ve configured our constraint to be set for the Success execution condition or the value of the expression evaluating to True. Your options are Constraint, Expression, Constraint And Expression, Expression Or Constraint.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
481
FIgure 15-14 SSIS Precedence Constraint Editor
Figure 15-15 shows how this type of constraint is represented on the designer surface—by a green dashed arrow and a small blue expression icon (fx). Because of the power of constraints, it’s important to document desired package behavior in sufficient detail early in your BI project. We often find that new SSIS developers underuse the capabilities of constraints, expressions, and group containers.
FIgure 15-15 SSIS precedence constraint showing OR condition and an associated expression
One final point about constraints and groups is that SSIS allows you to create fully transactional packages. We’ll talk more about this in the specific context of star schema and OLAP cube loading in BI projects in Chapter 17, “Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions.” Next we’ll take a closer look at the key Data Flow task.
482
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Data Flow This section is an introduction to building a package’s data flow in BIDS, with information on how to use sources, transformations, and destinations as the building blocks for the data logic of a package. Common or important transformations are highlighted and described, but this section does not provide a laundry list of all data flow components. Instead, this section focuses on common concepts and real-world usage scenarios, and it refers you to SQL Server Books Online for the complete list of all components. As mentioned, it’s in no way a requirement that every SSIS package include one or more Data Flow tasks. For an example of an SSIS package that does not include a data flow, see the ProcessXMLData.sln solution, which is part of the CodePlex SSIS samples. It’s installed at C:\Program Files\Microsoft SQL Server\100\Samples\Integration Services\Package Samples\ ProcessXMLData Sample by default. That being said, most of your packages will include at least one Data Flow task, so we’ll take some time to look at how this type of task works. To get started, we’ll continue to work with the UsingExecuteProcess.dtsx sample package that we looked at earlier in this chapter. Note that it contains one Data Flow task. To see more, right-click that task and then click Edit. Doing this opens the task’s contents on the data flow designer surface. This is shown in Figure 15-16. Also, note that the contents of the Toolbox now reflect items available for use on the Data Flow designer surface—sources, transformations, and destinations.
FIgure 15-16 Sample data flow designer surface in SSIS
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
483
In this sample, the contents of the sample Data Flow task are simple. This task contains only a single source, a transformation, and a single destination. In this sample, the transformation is a Data Conversion transformation. We’ll see in Chapter 17 that in particular BI scenarios, data flow configurations can sometimes be much more complex than this example. We’ll start by taking a closer look at available source components.
Data Flow Source Components As shown in Figure 15-17, SSIS data flow sources include the following six sources: ■■
ADO NET Source
■■
Excel Source
■■
Flat File Source
■■
OLE DB Source
■■
Raw File Source
■■
XML Source
FIgure 15-17 Data flow sources in SSIS
The most frequently used sources in BI scenarios are OLE DB Source, ADO NET Source, and Flat File Source. We have also occasionally used Excel Source, Raw File Source, and XML Source. Our sample package (UsingExecuteProcess.dtsx) uses an OLE DB source, so we’ll examine that type of source in greater detail next. Right-click the OLE DB source and select Edit to open the dialog box shown in Figure 15-18. Here you’ll associate a connection manager and source query (or table or view selection) with the OLE DB source. This selection can also be based on the results of a variable. Next you make any name changes to the available columns for output, and finally you configure error output information at the level of each available column. We’ll take a closer look at error handling later in Chapter 16.
484
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 15-18 OLE DB Source Editor
Most components (for example, sources, destinations, and transformations) in the data flow designer have two edit modes: simple (which we covered earlier) and advanced. You can see the advanced editor by right-clicking on a component and then clicking on Show Advanced Editor. The advanced editing view allows you to take a closer look at assigned values for metadata, as well as to set additional property values. One advanced property we sometimes work with is found on the Component Properties tab. On this tab, you have the ability to change the default value of True for the ValidateExternalMetadata property to False. This turns off validation of metadata, such as database structures, file paths, and so on. We sometimes find this helpful if we’re planning to define these values later in our SSIS package development cycle—usually by associating the values with variables. Before we move to the destinations, we remind you that in the data flow designer, connections between sources, transformations, and destinations are called paths, not precedence constraints as they were called in the control flow designer. This is an important distinction. Even though both items are represented by red or green arrows on their respective designer surfaces they, in fact, represent different functionality. If you right-click on the OLE DB source and then click Add Path, you’ll see a dialog box that allows you to create a path (called a connector) between a source and a transformation or destination. Although working with sources is pretty self-explanatory, we want to take a minute to discuss the specifics of the Raw File source and a common use case for it. The first point to
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
485
understand is that a raw file is created as the output of using a Raw File destination. Another point to consider is that the Raw File source does not use a connection manager—you simply point it to the file on the server. The Raw File source is particularly lightweight and efficient because of its design. The Raw File source does have some limitations, though—such as the fact that you can only remove unused columns (you cannot add columns to it). A common business scenario for its use is to capture data checkpoints, which can be used later if you need to recover and restart a failed package. We’ll talk more about creating self-healing packages by using checkpoints in Chapter 16.
Destination Components You might be wondering why the destination components contain some different destinations than are available in the source components. Note that in addition to the familiar (from source components) ADO NET Destination, Excel Destination, Flat File Destination, OLE DB Destination, and Raw File Destination, you can also select from additional, more specialized destinations. However, there is no XML destination included. The ADO.NET destination is new to SQL Server 2008. It contains an advanced option that allows you to configure the destination batch size as well. We manually set the batch size for very large data transfers so that we can control the performance of an SSIS package more granularly. The destination components are mostly optimized to work with various components of SQL Server itself. They include the Data Mining Model Training, DataReader, Dimension Processing, Partition Processing, SQL Server, and SQL Server Compact destinations. Also, you have a Recordset destination available. These are shown in Figure 15-19.
FIgure 15-19 Data flow destinations
486
Part III
Microsoft SQL Server 2008 Integration Services for Developers
At this point, we’ll just take a look at the ADO.NET destination, mostly for comparison to the ADO.NET source component. As with most other types of SSIS data sources, using SSIS data destination components generally requires that you associate them with a connection manager (except in the case of the Raw File destination). A typical workflow first connects either a green or red arrow to a data destination; it then configures a connection manager; and finally, it configures any other properties that are needed to provide the business functionality required. In earlier chapters, we briefly looked at the destinations that relate directly to BI projects— that is, Data Mining Model Training, Dimension Processing, and Partition Processing. In the next chapter, when we look at example packages for BI scenarios, we’ll revisit these BI-specific destinations. First, though, we’ll continue on our journey of exploring basic package design. To do that, we’ll look next at data flow transformations.
Transformation Components Many transformations are available in Integration Services. SQL Server Books Online does a good job providing a basic explanation of the 29 available transformations under the topic “Integration Services Transformations,” so we won’t repeat that information here. Rather, we’ll consider categories of business problems common in BI ETL scenarios and look at transformations that relate to these problems. Before we start, we’ll remind you that because we’re working in the data flow designer, we can connect to transformations from sources and other transformations. Transformations can also be connected to other transformations and to destinations. This creates the path or paths for our data flow or flows. These input or output connections are shown on the designer surface by the now familiar green (for good rows) or red (for bad rows) arrows. To help you to understand the capabilities available in SSIS as data transformations, we’ll group the available transformations by function. For reference, see all available transformations in the Toolbox window shown in Figure 15-20. We’ll start will those that relate to data quality—Audit, Character Map, Conditional Split, Fuzzy Grouping, Fuzzy Lookup, Lookup, Percentage Sampling, Row Sampling, Row Count, Term Extraction, and Term Lookup. Two considerations are related to this group of transformations. The performance of the Lookup transformation has been improved in SQL Server 2008. Also, the new Cache Transform transformation can be used to manage a cache for lookups. We’ll look next at the transformations that we most commonly use to prepare data for loading into star schemas or data mining structures: Aggregate, Cache Transform, Copy Column, Data Conversion, Derived Column, Export Column, Import Column, Merge, Merge Join, Multicast, OLE DB Command, Pivot, Unpivot, Script Component, Slowly Changing Dimension, Sort, and Union All. To do this, we’ll explore the Aggregate transformation in more detail.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
487
FIgure 15-20 Data flow transformations
Open the Calculated Columns sample package contained in the CodePlex SSIS samples mentioned earlier in this chapter. As with the other package, the package version as of the writing of this book is 2005, so the Upgrade Package Wizard starts when you open the package. Click through the wizard to upgrade the package, and then double-click the package named CalculatedColumns.dtsx to open it in the BIDS designer. Double-click the Data Flow task named Calculate Values to open the contents of that task on the data flow designer. Note the Aggregate transformation, which is named Sum Quantity And LineItemTotalCost. As its name indicates, this transformation is used to perform column-level aggregation, which is a common preparatory task in loading data into a star schema. After this transformation has been connected to an incoming data flow, you edit this transformation to configure the type of aggregation to perform. Each column in the data must be aggregated. As with the source and destination components, there is a simple and advanced editing view for most transformations. Figure 15-21 shows edit information for a sample Aggregate transformation. Finally, we’ll review a transformation that relates to working with data mining structures—the Data Mining Query. As we mentioned in Chapter 13, “Implementing Data Mining Structures,” the Data Mining Query transformation allows you to configure a DMX prediction query (using the PREDICTION JOIN syntax) as part of a data flow. It requires that all input columns be presorted. We’ll soon be looking at SSIS packages that use these various transformations in Data Flow tasks that solve BI-related ETL issues. We’re not quite ready to do that yet, however, because
488
Part III
Microsoft SQL Server 2008 Integration Services for Developers
we have a few other things to look at in basic SSIS package construction. The next item is data viewers.
FIgure 15-21 An Aggregate transformation
Integration Services Data Viewers A great feature for visually debugging your data flow pipeline is the data viewer capability in SSIS. These viewers allow you to see the actual data before or after a transformation. You can view the data in a graphical format (histogram, scatter plot, or column chart), or you can view a sampling of the actual data in grid form as it flows through the pipeline you’ve created. To create a data viewer, you right-click on a red or green path and then click Data Viewers. Then select the type of data viewer you want to use and which data from the data flow (columns) you want to view. The dialog box you use for doing this is shown in Figure 15-22. After you select and configure data viewers, they appear as a tiny icon (grid with sunglasses) next to the path where you’ve configured them. Data viewers produce results only inside of BIDS. If you execute your package outside of BIDS, data viewer results won’t appear. To see how they work, we’ve created a simple package using the relational sample database AdventureWorks LT that uses the Aggregate transformation to aggregate some table data. We’ve added a data viewer before and after the transformation.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
489
FIgure 15-22 Data viewers
When you run the package in BIDS, the package execution pauses at the data viewer. You can then view or copy the results to another location and then resume execution of the package. In the following example, we’ve halted execution before the transformation and we’re using two data viewers—grid and column chart. To continue execution of the package, click the small green triangle in the data viewer window. You can see the results of the data flow in Figure 15-23.
FIgure 15-23 Data viewers are visual data flow debuggers.
490
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Variables Although we’ve mentioned variables several times in this chapter, we haven’t yet considered them closely. In this section, we examine variables in the context of Integration Services. Like any other software development platform, Integration Services supports variables as a mechanism for storing and sharing information. For developers coming from a traditional programming language such as Microsoft Visual Basic or Visual C#, or for those who are experienced with database programming in a language such as Transact-SQL, variables in SSIS might seem a little strange, mostly because of the differences in the SSIS platform itself. To declare variables in SSIS, you use the Variables window found in the visual designers.
Variables Window To open the Variables window, select View, Other Windows, Variables Or SSIS, and then Variables from the menu bar. You can also display it by right-clicking in an empty area in the package designer and choosing Variables. To define a new variable in a package, click the Add Variable button (the first button on the left) on the toolbar in the Variables window, as shown in Figure 15-24.
FIgure 15-24 Adding a variable
The Variables window is another window added to Visual Studio with the SSIS project template. It shows the variables defined within the package and provides the capability to add, delete, or edit variables. It’s worth noting that by default the Variables window displays only user variables that are visible (based on their scope, which we’ll cover later in this section) to the currently selected task or container. As mentioned in Chapter 14, to view system variables or to view all variables regardless of their scope, you can use the third and fourth buttons from the left on the Variables window toolbar to toggle the visibility of these variables on and off. Note BIDS Helper adds a new button to the Variables window toolbar called Move/Copy Variables To A New Scope. It is shown in Figure 15-24 as the second-to-last button from the left. This button adds functionality to SSIS that allows you to edit the scope of the variable using a pop-up dialog box called Move/Copy Variables.
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
491
Variable Properties In most development platforms, a variable simply holds or references a value, and the variable’s data type constrains what values are allowed. In SSIS, however, variables are more complex objects. Although each variable has a data type and a value, each variable also has a set of properties that control its behavior, as shown in Figure 15-25, which presents the properties for an SSIS variable.
FIgure 15-25 Variable properties
As you can see from Figure 15-25, there are quite a few properties, not all of which can be set by the developer. Here are some of the most important properties that each variable has: ■■
Description The Description property is essentially variable-level documentation. It does not affect the variable’s function at all, but it can help make the package easier to maintain.
■■
EvaluateAsExpression This Boolean property determines whether the Value property is a literal supplied by the developer at design time (or set by a package component at run time) or if it’s determined by the variable’s Expression property. If EvaluateAsExpression is set to True, any value manually assigned to the Value property for that variable is ignored, and the Expression property is used instead.
■■
Expression The Expression property contains an expression that, when the EvaluateAsExpression property is set to True, is evaluated every time the variable’s value is accessed. We’ll go into more detail about expressions in the next section, but for now
492
Part III
Microsoft SQL Server 2008 Integration Services for Developers
keep in mind that having variables based on expressions is a crucial technique that is part of nearly every real-world SSIS package. ■■
Name This property sets the programmatic name by which the variable will be accessed by other package components. SSIS variable names are always case sensitive. Forgetting this fact is a common mistake made by developers coming from non–case sensitive development environments such as Visual Basic or Transact-SQL. It’s also important to remember that SSIS variable names are case sensitive even when being accessed from programming languages that are not inherently case sensitive themselves, such as Visual Basic.
■■
Namespace All SSIS variables belong to a namespace, and developers can set the namespace of a variable by setting this property, which serves simply to give additional context and identity to the variable name. By default, there are two namespaces: all predefined variables that are supplied by the SSIS platform are in the System namespace, and all user-defined variables are in the User namespace by default. Please note that variable namespaces are also case sensitive.
■■
RaiseChangedEvent This Boolean property determines whether an event is fired when the value of the variable changes. If the RaiseChangedEvent property is set to True, the OnVariableValueChanged event for the package is fired every time the variable’s value changes. You can then build an event handler for this event. To determine which variable changed to cause the event to fire (because there can be any number of variables within a package with this property set), you can check the value of the VariableName system variable within the OnVariableValueChanged event handler.
■■
Scope Although the Scope property cannot be set in the Properties window, it’s a vital property to understand and to set correctly when the variable is created. The Scope property of a variable references a task, a container, or an event handler within the package, or the package itself, and it identifies the portions of a package where the variable can be used. For example, a variable defined at the scope of a Sequence container can be accessed by any task within that container, but it cannot be accessed by any other tasks in the package, while a variable defined at the package scope can be accessed by any task in the package. The scope of a variable can be set only when the variable is created. To specify a variable’s scope, click on the package, container, or task and then click the Add Variable button shown in Figure 15-25. If a variable is created at the wrong scope, the only way to change it is to delete and re-create the variable if you are using SSIS out of the box. BIDS Helper includes a tool that allows you to easily change variable scope.
■■
Value The Value property is self-explanatory; it’s the value assigned to the variable. But keep in mind that if the variable’s EvaluateAsExpression property is set to True, any value entered here is overwritten with the value to which the variable’s expression evaluates. This is not always obvious because the Properties window allows you to enter
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
493
new values even when the variable’s EvaluateAsExpression property is set to True. It simply immediately replaces the manually entered value with the expression’s output. ■■
ValueType The ValueType property specifies the data type for the variable. For a complete list of SSIS data types, see the topic “Integration Services Data Types” in SQL Server Books Online.
System Variables In addition to the user variables created in a package by the package developer, SSIS provides a large set of system variables that can be used to gather information about the package at run time or to control package execution. For example, it’s common to add custom auditing code to a package so that the package writes to a database table information about its execution. In this scenario, the package developer can use the PackageName system variable to log the package’s Name property, the PackageID system variable to log the package’s unique identifier, the ExecutionInstanceID system variable to log the GUID that identifies the executing instance of the package, and the UserName system variable to log the user name for the Windows account that is executing the package. For a complete list of SSIS system variables, see the topic “System Variables” in SQL Server Books Online.
expressions As you’ve probably noticed, SSIS variables and expressions often go hand in hand. Expressions can be used in many different parts of a package, including tasks, containers, precedence constraints, and connection managers. These items support having expressions applied to one of their properties. For example, a File connection manager can use an expression to set its ConnectionString property. Developers can add expressions on most—but not all—properties of most—but not all— package items. Selecting the Expression property in the Properties window and clicking on the ellipsis button opens up an editor dialog box where the developer can select a property and then build an expression for that property. Expressions are incredibly powerful, but they can also be incredibly frustrating at times. This is not because of their functionality, but because of their discoverability or lack thereof. BIDS does not provide any way to tell where expressions are being used within a package, except by manually expanding the Expressions collection for each object. As you can imagine, this is less than ideal and can be the cause of some frustration when working with complex packages.
494
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Fortunately, help is available. As mentioned, BIDS Helper has two features designed to ease this pain. One is the Expression And Configuration Highlighter, which places a visual indicator on each task or connection manager that has a property set via an expression or a package configuration. The other is the Expressions List window, which displays a list of all object properties in the package for which expressions have been configured. The Expressions List window also allows developers to edit the expressions, so it’s more than just a read-only list.
Variables and Default Values Within a Package Although we’ve already looked at many different aspects of variables in SSIS packages, so far we haven’t really said much about actually using them. There are quite a few common scenarios for using variables: ■■
Capturing query results The Execute SQL task can execute SQL queries and stored procedures that can return scalar values or tabular result sets. The editor for the Execute SQL task has a Result Set tab where you can specify variables in which to store the values returned by the query.
■■
Counting rows in the data flow The Row Count transformation has a VariableName property; the variable specified here will have assigned to it the number of rows that pass through the Row Count transformation when the Data Flow task executes. This value can then be used later on in the package’s control flow.
■■
Dynamic queries The Execute SQL task has an SqlStatementSourceType property, which can be set to Variable. Then the value of the variable identified by the task’s SqlStatementSource property will be used for the query being executed. In addition, the OLE DB source component has a SQL Command From Variable command type that operates much the same way—you can store the SELECT statement in a variable so that it can be easily updated. A common use for this technique is to have the WHERE clause of the SELECT statement based on an expression that, in turn, uses other variables to specify the filter criteria.
■■
Foreach Loop enumerators This container supports a number of different enumerators—things that it can loop over. One of the built-in enumerators is the Foreach File Enumerator. This will loop over all the files in a folder and assign the name of the current file to the Value property of a variable specified by the package developer. In this scenario, any number of expressions can be based on the variable that contains the file name, and those expressions will always be current for the current file.
In scenarios like the ones just described, where a variable is updated by a package component such as the Foreach Loop container and then used in expressions elsewhere in the package, it’s important that the variable is configured with an appropriate default value. For example, if a connection manager’s ConnectionString property is based on an expression that
Chapter 15
Creating Microsoft SQL Server 2008 Integration Services Packages
495
uses a variable that will contain a file name at run time, the variable should be assigned a default value that is a valid file path. What’s appropriate for the default value depends on the purpose of the variable and the value it’s designed to hold. If the variable is an integer that will hold a row count, –1 is generally a safe default value, both because it’s a valid number and because it’s unlikely to occur during package execution, making it obvious that it’s the design-time default. If the variable is a string that will hold a table name for use in an expression that will be used to build a SELECT statement, the default value should generally be a valid table name in the target database. There will certainly be many other situations where a default value must be selected. There is no single solution that will fit every situation, but in most cases the default value should be an example of the type of value that will be assigned to that variable during package execution. Just keep in mind that the variables will be used during package validation before the package executes, and that will keep you on the right track. There are many other ways to use variables in SSIS packages, but these common scenarios probably are enough to show how flexible and powerful SSIS variables can be. In the chapters ahead, we’ll look at more examples of using variables with specific package components, but this should be enough to get you started.
Summary In this chapter, we covered most of the basic package mechanics, looking at many of the tools that BIDS provides to enable package development and at the scenarios in which they can be used. We reviewed the tasks available in the Control Flow task and components in the Data Flow task. We also looked at how precedence constraints are implemented, using both constraints and expressions. In Chapter 16, we look at some of the more advanced features in Integration Services, from error handling, events and logging, and checkpoints and transactions to data profiling. You will begin to see specific examples that relate to BI solutions.
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services In Chapter 15, “Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio,” we looked at the mechanics of working with some of the major components that make up the Microsoft SQL Server 2008 Integration Services (SSIS) platform. We covered the development of packages with Business Intelligence Development Studio (BIDS), control flow, data flow, variables, expressions, and connection managers. In this chapter, we work with some of the more advanced features available when developing Integration Services packages, including error and event handling, logging, debugging, and checkpoints and transactions. We also recommend some best practices you should follow when designing Integration Services packages. Finally, we introduce data profiling, a new feature in SQL Server 2008. The chapter progresses from the more general activities involved in all Integration Services packages to information more specific to business intelligence (BI), such as OLAP cube loading and maintenance and data mining structure loading.
Error Handling in Integration Services One compelling reason to use SSIS to perform ETL—rather than simply using Transact-SQL scripts or custom-coded solutions—is its ease of general business logic configuration, particularly for error handling. Error-handling responses are configured differently depending on where the error originates. Let’s start by looking at the location of most errors that you’ll wish to trap and correct—namely, the data flow. To do this you’ll work at the level of components on the data flow designer surface. For most sources and destinations you can edit the component and you will see an Error Output page like the one shown in Figure 16-1. The RAW File source and destination components do not allow you to configure an error output. The example in Figure 16-1 is an ADO.NET data source that has been configured to provide data from the AdventureWorksLT SalesLT.Customer table (all columns). The Error Output page allows you to configure Error and Truncation actions at the column level. The default value is Fail Component for column-level errors or truncations. It is important that you understand what types of conditions constitute data flow errors. Here’s the definition from SQL Server 2008 Books Online: An error indicates an unequivocal failure, and generates a NULL result. Such errors can include data conversion errors or expression evaluation errors. For example, 497
498
Part III
Microsoft SQL Server 2008 Integration Services for Developers
an attempt to convert a string that contains alphabetical characters to a number causes an error. Data conversions, expression evaluations, and assignments of expression results to variables, properties, and data columns may fail because of illegal casts and incompatible data types.
FIgurE 16-1 Column-level error handling in the data flow
Truncations occur when you attempt to put a string value into a column that is too short to contain it. For example, inserting the eight-character string December into a column defined as three characters would result in a truncation. You can change the value to either Ignore Failure or Redirect Row. If you choose the latter, you must connect the error output from the component to the input of another component so that any possible error or truncated rows are handled. The error output is the red path arrow, and connecting it to another component allows these error (or truncated) rows from the first component to flow to the second component. In addition to understanding how to configure error handling in the data flow, it is also important that you understand the default error-handling behavior of an SSIS package, container, or task. At this level you will work with a couple of object-level property views to configure a number of aspects of error handling, such as the number of allowable errors (with a default of 1 for the package, container, or task), whether an object will execute successfully (useful for debugging because you can set a task to fail), and more. Figure 16-2 shows some of the properties for a package that relate to error handling and execution: FailPackageOnFailure, FailParentOnFailure, and MaximumErrorCount. The ForceExecution Result property (not pictured) allows you to force the object to succeed or fail.
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
499
FIgurE 16-2 Package-level default property settings
Containers and tasks have similar properties. Note the FailParentOnFailure property: The default setting is False for all objects. Of course, changing this property to True is only meaningful at a level lower than the entire package, meaning a container or task. If you do change this value to True for a container or for a task, it will cause the parent object (the container or package) to fail if its child object fails. We’ll talk more about task, container, and package error handling (and recovery) later in this chapter, when we discuss checkpoints and transactions. Before we do that, however, we’ll take a closer look at events and logging, because you’ll often include responses to events that fire in your error-handling strategy.
Events, Logs, Debugging, and Transactions in SSIS On the Event Handlers tab in BIDS, you can configure control flows that run in response to particular types of events firing at the package, container, or task level. You can define event handler (control) flow responses to 12 different types of events. When you define a response, you select the entire package or a selected container or task and then select the type of event you are interested in. We most commonly use these event handler types for error responses: OnError and OnTaskFailed.
500
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Another use of event handlers is to record package execution activity for compliance (auditing) purposes. For auditing, we often use the event handlers OnExecStatusChanged, OnProgess, and OnVariableValueChanged. We also sometimes want to trap and log events during the early development phases of our SSIS projects. Using this technique, we can capture, view, and analyze activities that fire events in our prototype packages. This helps us to better understand execution overhead during early production and even pilot testing. This type of testing is, of course, most useful for packages that are used to update OLAP cubes and data mining structures, rather than packages that perform one-time initial SSAS object loads. Update packages are frequently reused—we typically run them daily. Insert packages are run far less frequently. Events for which you have configured a custom control flow will appear in bold in the list of possible events to capture for the particular object (the package, container, or task). We’ve configured a simple example (shown in Figure 16-3) using the OnPostExecute event for the entire package.
FIgurE 16-3 Package-level event handlers
After you select the object in the Executable list, select the event handler that you want to create a control flow for, and then double-click the designer surface to activate the event. To configure it, drag control-flow tasks to the Event Handlers designer surface. This surface functions similarly to the Control Flow designer surface. Continuing our example, we created a very simple package with one Execute SQL task in the main control flow, and then we added an event handler for the OnPostExecute event for the package. Our event handler control flow contains one Execute SQL task. This task executes when the OnPostExecute event fires from the main control flow. Expanding the Executable drop-down list on the Event
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
501
Handlers tab of BIDS shows both Execute SQL tasks, as seen in Figure 16-4. Note that you may have to close and reopen the package in BIDS to refresh the view.
FIgurE 16-4 Event Handler Executables object hierarchy
As you’ll see in a minute, even if you don’t configure any type of formal package logging, when you use BIDS to execute a simple sample package, you can already see the output of the Event Handler control flow (after the event fires). A particularly interesting system variable available in event handlers is the Propagate variable. This variable is used to control the bubbleup behavior of events—the characteristic of low-level SSIS tasks to fire events and then to send (bubble up) those events to their parent containers. These containers can be SSIS containers, such as Foreach Loop containers, or even the parent package itself. In addition to the Propagate variable, SSIS adds several variables scoped to the OnError event, including Cancel, ErrorCode, ErrorDescription, EventHandlerStartTime, LocaleID, Source Destination, SourceID, SourceName, and SourceParentGUID. These variables are quite useful in packages where you wish to capture details about packages or package component errors.
Logging and Events The Execution Results tab is a kind of a quick logging for package execution in BIDS. If you have configured any event handlers, as we did in the previous example with OnPostExecute event for the package, this window includes event execution information about these events as well. Figure 16-5 shows the logging that occurs during package execution.
502
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgurE 16-5 The Execution Results tab includes information about fired event handlers.
You might also want to capture some or all of the package execution information to a permanent log location. SSIS logging is configured at the package level. To do this you simply right-click the control flow designer surface and then click Logging. This will open a dialog box that includes five built-in options for logging locations to capture activity around package execution. You can select from multiple log types on the Providers And Logs tab, as shown in Figure 16-6. You can specify multiple logging providers for the same package, so for example, you could choose to configure the package to log information to both SQL Server and a text file. Note When you configure a log provider for SQL Server, you must specify a connection manager. A logging table named sysssislog will be created (if it doesn’t already exist) in the database that the connection manager uses. It is created as a system table, so if you are looking for it in SSMS, you must open the System Tables folder in Object Explorer. Under Integration Services 2005, the log table name was sysdtslog90.
FIgurE 16-6 Log locations
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
503
After you configure your log locations on the Providers And Logs tab, you select one or more event types to capture on the Details tab, shown in Figure 16-7. The Details tab also includes an advanced view that allows you to select the columns of information you’d like to include for each type of event selected. You can save the logging configuration information to an XML file by choosing Save; you can load an existing logging configuration file with the Load button.
FIgurE 16-7 Package-level events
After you execute a package in BIDS for which you’ve configured logging, you can view the log results in the Log Events window inside of BIDS. You open this window by selecting Log Events on the SSIS menu. Logging is used for two main reasons: first for tuning and debugging packages during the early phases of your BI project, and later for meeting compliance requirements after the SSIS packages have been deployed into production environments. Be aware that logging all events for entire packages is quite resource-intensive and produces very verbose results. In our simple example we’ve captured all events (except Diagnostic) for an SSIS package with a single control flow task, which executes the simplest possible query (Use Master). The Log Events window output from executing this simple package is shown in Figure 16-8.
504
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgurE 16-8 Log Events output window
The logging events you can capture and log vary by type of task in the control flow. The most notable example of this occurs when you configure logging on a package that includes a data flow task. You’ll notice that when you do this, the Configure SSIS Logs dialog box contains more possible events to log than Figure 16-8 shows. The new types of events that you can trap include several related to the activity in the data flow. Figure 16-9 shows this expanded dialog box with several of the pipeline event types called out. Trapping activity around the pipeline or data flow allows you to understand the overhead involved in the data flow at a very granular level. This is important because, as we’ve said, the data flow is often where SSIS package bottlenecks can occur. Tip For a complete listing of the custom log entries on a task-by-task basis, see the SQL Server Books Online topic “Custom Messages for Logging.”
FIgurE 16-9 Configure SSIS Logs dialog box
As with many other advanced capabilities of SSIS, we caution that you should plan to use just enough logging in production situations. We often find either none or too much. Usually at
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
505
least a couple of event types should be logged, such as OnError or OnTaskFailed. If you find yourself on the other side of the logging debate—that is, logging a large amount of information—you should be aware that such expansive logging can add unnecessary processing overhead to your SSIS server. For an extension to the logging capabilities included in SSIS, see the CodePlex project at http://www.codeplex.com/DTLoggedExec, which was written by Davide Mauri, one of this book’s authors. The tool is an enhancement to DTExec that gives you a much more powerful logging capability. Tip If you are logging your SSIS package execution to SQL Server and you have an SSRS installation available, you can download some SSRS report templates (RDL files) that will allow you to view your SSIS package logs using the SSRS interface. These were originally developed for Integration Services 2005, so you will need to update them to use the new log table name sysssislog. These templates are freely downloadable from http://www.microsoft.com/downloads/ details.aspx?familyid=D81722CE408C4FB6A4292A7ECD62F674&displaylang=en.
Debugging Integration Services Packages You have the ability to insert breakpoints visually into SSIS packages. You can insert breakpoints into tasks in the control flow or into tasks in event handlers. You insert breakpoints by right-clicking the package designer surface or an individual task or container, and then clicking Edit Breakpoints. Figure 16-10 shows the Set Breakpoints dialog box. Here you select the event(s) and (optionally) the conditions under which the breakpoint will pause execution of the package.
FIgurE 16-10 The Set Breakpoints dialog box
506
Part III
Microsoft SQL Server 2008 Integration Services for Developers
You set the condition by selecting one of the options from the drop-down menu in the Hit Count Type column. After you have successfully configured a breakpoint, a red dot is placed on the task that has a breakpoint associated with it. As you execute an SSIS package that has defined breakpoints in BIDS, package execution halts at the defined breakpoints and the common debugging windows become available. We find the Locals, Watch, and Call Stack debugging windows to be most useful when debugging SSIS packages. Of course, you learned in Chapter 15 that you can also associate one or more data viewers with paths (arrows) in the data flow. You saw that the data viewers allow you to perform a type of visual data flow debugging because they halt execution of data flow until you manually continue it by clicking the green triangle on the data viewer surface. You may also remember that in addition to being able to view the data in the data flow in various formats (grid, histogram, and so on) you can also copy that data to another location for further analysis and then continue package execution.
Checkpoints and Transactions SSIS package checkpoints are a type of optional save point that allows a package to be restarted from a point of failure. We have used checkpoints with SSIS packages that may include occasional failures resulting from (for example) a component that connects to a public Web service to retrieve data from the Internet. A checkpoint is literally an XML file that contains information about which components ran successfully at a particular execution date and time. You can configure checkpoints at the package level only. Checkpoints are often used in conjunction with transactions. We’ll get into more detail about configuring package transactions shortly. The key properties available for packages that are associated with checkpoint and transaction settings are CheckpointFileName, CheckpointUsage, and SaveCheckpoints. To enable checkpoints for an SSIS package, you must first supply a CheckpointFileName property by configuring that property. You can hard-code this XML file name, but a better practice is to use variables and expressions to dynamically create a unique file name each time the package is executed (based on package names, execution date and time, and so on). Next you change the default property value for CheckpointUsage from never to IfExists. Then you change the SaveCheckpoints value from False to True. The next step in the checkpoint configuration process is to change the default settings on particular control flow tasks or containers to cause the package to fail immediately when a key task or component fails. The default behavior of a package is to continue to attempt to execute subsequent tasks and containers in a control flow if they are not connected to the failing task by Success precedence constraints or they are connected to the failing task by Failure or Completion precedence constraints. When using checkpoints, however, you want the package to stop and create a checkpoint file as soon as a task fails.
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
507
To write a checkpoint you must change the default setting of False to True for FailPackage OnFailure for any tasks or containers that you wish to involve in the checkpoint process. This will cause the SSIS runtime to stop package execution as soon as the task or container reports a failure, even if subsequent tasks are connected to the failing task with Completion or Failure constraints. If you set the FailParentOnFailure property for a task or container to True, this component can participate in a transaction, but no checkpoint will be written if the particular component fails. You may be wondering whether it is possible to create checkpoints in a data flow. The technical answer is “not by using SSIS checkpoints.” We do use the Raw File destination component to create save points in a particular data flow. Remember that the Raw File is a particularly fast and efficient storage mechanism. We use it to create a temporary storage location for holding data that can be used for task or package restarts, or for sharing data between multiple packages. Checkpoints are frequently used with manually defined transactions, so we’ll take a look at how to do that next. Tip If you are using SQL Server 2005 or 2008 Enterprise edition data sources in your package, another potential method of implementing package restarts is to take advantage of the database snapshot feature introduced in SQL Server 2005. If you are not familiar with this technique, read the SQL Server Books Online topic “Creating a Database Snapshot.” You can use event handler notifications in conjunction with transactions. By combining these two powerful features, you can establish package rollback scenarios that allow you to revert to a previously saved version of your source database. To do this, you configure a call to a specific database snapshot at a particular package failure point.
Configuring Package Transactions SSIS packages include the ability to define transactions at the package, container, or task level. You configure two key property settings to enable transactions. The first is Isolation Level, which has the following settings available: Unspecified, Chaos, Read Uncommitted, Read Committed, Repeatable Read Serializable (this is the default and the most restrictive isolation level), and Snapshot. For more information about the locking behavior that each of these isolation levels produces, read the SQL Server Books Online topic “Isolation Levels in the Database Engine.” The second setting is TransactionOption, which has the following settings available: Supported, Not Supported, and Required. The default setting is Supported, which means that the item (package, container, or task) will join any existing transaction, but will not start a unique transaction when it executes. Required means that a new transaction is originated by the package, container, or task configured with that setting only if an existing higherlevel transaction does not already exist. For example, if a container is set to Required and no
508
Part III
Microsoft SQL Server 2008 Integration Services for Developers
transaction is started at the package level, invocation of the first task in that container starts a new transaction. However, in the same scenario, if the container’s parent (in this case, the package) had already started a transaction, the firing of the first task in the container causes the container to join the existing transaction. Tip Configuring checkpoints and transaction logic looks deceptively simple. You just change the default configuration settings of a couple of properties, right? Well, yes—but real-world experience has taught us to validate our logic by first writing out the desired behavior (white-boarding) and then testing the results after we’ve configured the packages. “Simpler is better” is a general best practice for SSIS packages. Be sure that you have valid business justification before you implement either of these advanced features, and make sure you test the failure response before your deploy your package to production!
Transactions do add overhead to your package. SSIS will use the Microsoft Distributed Transaction Coordinator (MS-DTC) service to facilitate transactions if you use the Transaction Option property in your packages. The MS-DTC service must be appropriately configured and running for the SSIS transactions to work as expected. There is an exception to this if you manually control the transactions. You can use Execute SQL tasks to issue your own BEGIN TRANSACTION and COMMIT TRANSACTION commands to manually start and stop the transactions. Be aware that, if you take this approach, you must set the RetainSame Connection property of the connection manager to True. This is because the default behavior in SSIS is to drop the connection after each task completes, so that connection pooling can be used. A package can use a single transaction or multiple transactions. We prefer a simpler approach, and when configuring transactions we tend to break the package up into smaller packages if we find a need for multiple transactions in a package. Each smaller package can contain a single transaction. We’ll give specific examples related to solving BI-ETL solutions around transactions in Chapter 17, “Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions.” Tip During testing of transactions in SSIS, we find it useful to set the ForceExecutionResult property value to Failure for tasks or containers. This is a quick and easy way to guarantee a failure so that you can easily test your transaction, checkpoint, and recovery package logic. If you use this technique, remember that you’ll also have to set the FailPackageOnFailure property value for each task and container involved in the transaction to True.
As we close this section, we want to point out that you can’t set both checkpoints and transactions on at the package level. You can combine checkpoints and transactions on containers in the package, just not a package-level transaction.
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
509
Best Practices for Designing Integration Services Packages When it comes to designing packages, you have a lot of choices. Learning by trial and error is one way to find out the pitfalls and shortcuts in package design. Another way is to adopt the best practices of experienced package developers. We recommend that you follow these best practices: ■■
Favor more and simpler packages over fewer, complex packages. For BI projects, it’s not uncommon to use one package for each dimension and fact table and sometimes even one package per data source per dimension or fact table.
■■
Evaluate the overhead and complexity of performing advanced transformations while also considering the quantity of data to be processed. Favor intermediate on-disk storage over intensive in-memory processing. We find that disk storage is cheaper and more effective than intensive, in-memory transformations. We often use the Raw File source and destination components to facilitate quick and easy data storage for this type of process.
■■
Evaluate the quality of source data early in the project; utilize error handling, debugging, logging, data viewers, and the Data Profiling task to quickly evaluate the quality of source data to more accurately predict the volume of work to create production packages. Favor creating intermediate on-disk temporary storage areas and small packages that perform individual units of work, rather than large and complex packages, because these simpler packages execute more efficiently, contain fewer bugs, and are easier to maintain and debug. Just like any other code you write, simpler is better.
■■
If choosing to process large amounts of potential bad data, favor preprocessing or intermediate processing on excessively dirty data. Utilize logging, error handling, and events to capture, redirect, and correct bad data. Utilize the “more-to-less” specific methods of identifying and cleaning bad data and their associated transformations; for example, lookup to fuzzy lookup. We’ll cover fuzzy lookups in Chapter 17.
■■
If using intensive and complex in-memory transforms, test with production levels of data prior to deployment, and tune any identified bottlenecks. Tuning techniques include reducing included data (for all transformation types), improving lookup performance by configuring the cache object, writing more effective Transact-SQL queries for the Execute SQL Query object, and more. Note that in SSIS 2008, you can use the cache transformation to bring in data for lookups from OLE DB, ADO.NET, or flat-file sources. (In SSIS 2005, only OLE DB sources were supported as a lookup reference dataset.)
■■
Utilize the self-healing capabilities of SSIS for mission-critical packages. These include bubbling up events and errors and using checkpoints, database snapshots, and transactions. For more information about database snapshots, see the SQL Server Books Online topic “Database Snapshots.”
510
Part III
Microsoft SQL Server 2008 Integration Services for Developers
If you follow these practices when you design packages, you’ll be following practices that we learned the hard way to observe when working with SSIS. As with any powerful tool, if used correctly SSIS can significantly enhance your BI project by helping you get better-quality data loaded into your OLAP cubes and data mining structures faster and more accurately. Tip BIDS Helper includes a useful tool for helping you to understand SSIS package performance—the SSIS Performance Visualization tool. It creates a graphical Gantt chart view of the execution durations and dependencies for your SSIS packages.
Data Profiling The control flow Data Profiling task relates to business problems that are particularly prominent in BI projects: how to deal with huge quantities of data and what to do when this data originates from disparate sources. Understanding source data quality in BI projects—when scoping, early in prototyping, and during package development—is critical when estimating the work involved in building the ETL processes to populate the OLAP cubes and data mining structures. It’s common to underestimate the amount of work involved in cleaning the source data before it is loaded into the SSAS destination structures. The Data Profiling task helps you to understand the scope of the source-data cleanup involved in your projects. Specifically, this cleanup involves deciding which methods to use to clean up your data. Methods can include the use of advanced package transformations (such as fuzzy logic) or more staging areas (relational tables) so that fewer in-memory transformations are necessary during the transformation processes. Other considerations include total number of tasks in a single package, or overall package size. To use the Data Profiling task, simply drag an instance of it from the control Toolbox to the designer surface. You can then set up either a quick profile or a regular profile using the configuration dialog boxes for the Data Profiling task. Figure 16-11 shows the Single Table Quick Profile Form options. You can open it by right-clicking the Data Profiling task, choosing Edit, and then clicking the Quick Profile button in the lower right of the resulting dialog box. Seven profiling options are available. For a more granular (advanced) property configuration, you can work in the Data Profiling Task Editor dialog box. Figure 16-12 shows this dialog box and the available configurable properties for the Column Null Ratio Profile Request profile type.
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
FIgurE 16-11 Data Profile task, Single Table Quick Profile Form dialog box
FIgurE 16-12 Data Profiling Task Editor dialog box
511
512
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Before you start using the Data Profiling task, be aware of a few limitations. Currently it works only with source data from SQL Server 2000 or later. You can work around this limitation by staging data from other sources into SQL Server and performing the profiling on it there. We expect this limitation to change with future updates to the task and the related viewer tool. The Data Profiling task produces an XML-formatted string that you can save to a variable in an SSIS package or to a file through a file connection manager. You can view the profile output files using a new tool called Data Profile Viewer, which is located by default at %Program Files%\Microsoft SQL Server\100\DTS\Binn\DataProfileViewer.exe. Data Profile Viewer is not a stand-alone tool, and you must have SSIS installed on the computer to use it. Data Profile Viewer cannot create profiles by itself—that must be done through an SSIS package. The Data Profiling task can only use ADO.NET connection managers. You might be wondering what each of the seven profiles you see in Figure 16-11 refers to. Each profile provides a different view of the selected source data. The following list briefly describes the functions of each type of profile: ■■
Column Null Ratio Profile Helps you to find unacceptably high numbers of missing values (nulls) in source data of any type. Finding unexpectedly high numbers of nulls could lead you to conclude that you may have to do more manual or more automatic preprocessing on source data. Another possibility is that your source data extract logic might be flawed (for example, you could be using a Transact-SQL statement with multiple JOIN clauses).
■■
Column Statistics Profile Returns information about the specific values in a numeric or datetime column. In other words, it returns the minimum, maximum, average, and standard deviation of the column values. As with the Column Null Ratio Profile, this profile can help you detect the quality of the source data and identify invalid data.
■■
Column Value Distribution Profile This profile produces the overall count of distinct values in a column, the list of each distinct value in a column, and the count of times each distinct value appears. This can be useful for identifying bad data in a column, such as having more than 50 distinct values in a column that should hold the number of states in the United States.
■■
Column Length Distribution Profile As its name indicates, this profile type shows information about the length of the data values contained in the column. It works for character data types, and returns the minimum and maximum length of data in the column. It also returns a list of each distinct length, along with the count of how many column values have that length. Exceptions can indicate the presence of invalid data.
■■
Column Pattern Profile Generates a set of RegEx expressions that match the contents of the column. It returns the RegEx expressions as well as the number of column values that the expression applies to. This is quite powerful and we are particularly eager
Chapter 16
Advanced Features in Microsoft SQL Server 2008 Integration Services
513
to use this profile to help us to identify bad source data. One business example that occurs to us is to validate e-mail address pattern data. ■■
Candidate Key Profile Shows the uniqueness of data values as a percentage—for example, 100 percent unique, 95 percent unique, and so on. It is designed to help you identify potential key columns in your source data.
■■
Functional Dependency Profile Shows the strength of dependency of values in one column to values in another—for example, associating cities and states. Mismatches that are identified could help you pinpoint invalid data.
One additional, advanced type of profile is available: the Value Inclusion Profile Request. It can be added only through the regular (full) dialog box (in other words, not the quick profile dialog box). This type of profile is designed to help identify foreign key candidates in tables. It does this by identifying the overlap in values from columns in two tables—the subset and superset table or view in the advanced properties pane. In this same area, you can also set the threshold levels for matching. See the SQL Server Books Online topic “Value Inclusion Profile Request Options” for more information about the advanced property settings. When you perform data exploration, you often set up the package to save the profile information to a file. After you run a package that includes a Data Profiling task configured this way, an XML file is written to the package’s configured file destination. You can use the new Data Profile Viewer tool to examine the XML output file. Figure 16-13 shows an example of an output file in the Data Profile Viewer tool. Again, this tool is located by default at %Program Files%\Microsoft SQL Server\100\DTS\Binn\DataProfileViewer.exe. You can see that we are looking at the output from the Column Value Distribution Profiles for the columns from the Sales.Customer table in the AdventureWorks2008 sample database. The tool shows the number of distinct column values and the frequency distribution for those values numerically, graphically, and as a percentage of the total for the selected individual column (TerritoryID). Tip Out of the box, the Data Profiling task only allows you to profile a single table. Here’s a great blog entry about a pretty simple work-around that allows you profile all of the tables in a particular database: http://agilebi.com/cs/blogs/jwelch/archive/2008/03/11/usingthedataprofiling tasktoprofileallthetablesinadatabase.aspx.
514
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgurE 16-13 Data Profile Viewer showing the Column Value Distribution Profiles for the Sales.Customer
sample table.
Summary In this chapter we looked at some advanced aspects of working with SSIS packages, covering features including error and event handling, logging, debugging, checkpoints, and implementing transactions. We also took a detailed look at the new Data Profiling task. We’ve really just begun to scratch the surface of the type of data evaluation that you’ll commonly do at the beginning of every BI project. In Chapter 17, we apply all we’ve learned about SSIS package development to very specific BI scenarios, such as ETL for dimension and fact-table loading for OLAP cubes. We also present more tools and techniques focusing on applying the SSIS tools to a data warehousing project.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions In this chapter, we turn our attention to the specific implementation details for packages. We’ll look at SSIS extract, transform, and load (ETL) as it applies to OLAP cubes and data mining structures.
ETL for Business Intelligence As mentioned earlier, you need to load two fundamentally different types of structures via SSIS ETL processes in BI projects: OLAP cubes and data mining structures. Within each of these types of SSAS objects you’ll find a further top-level subset of functionality: the ETL for both the initial structure load (initial cube load, for example) and also for subsequent, regular updates to these structures. These situations have different business requirements and result in different package designs. We’ll tackle both of these situations by providing best practices and recommendations based on our real-world implementations. We’ve said this already, but remember that a primary consideration in any BI project is the determination of resource allocation for the initial ETL for the project. For all but the smallest projects, we’ve elected to work with ETL specialists (either internal staff or contractors) during the initial ETL preparation and actual loading phase of the project. We often find that ETL is 50 to 75 percent of the initial project work hours. We hope that the previous three chapters on SSIS have given you a window into the power and complexity of SSIS package development. We do not commonly find that SQL developers or administrators have the time to master SSAS, SSIS, and SSRS. That said, if your project is simple and your source data is relatively clean, we do occasionally find that smaller teams can implement BI solutions. However, we strongly suggest that the first approach to adding resources to your BI team is around the initial ETL. Another consideration regarding BI ETL is the difference between the initial load of the SSAS objects and the subsequent regular incremental updates of these objects. Deciding to what degree to use SSIS for the initial load is significant. We find that we use a combination of data evaluation, cleansing, and loading techniques to perform the initial load. Generally these techniques include some SSIS packages.
515
516
Part III
Microsoft SQL Server 2008 Integration Services for Developers
SSIS really shines for the incremental updating process, which is usually done at least once daily after the cubes and mining models have been deployed to production. Package capabilities, such as error and event handling, and package recoverability through the use of checkpoints and transactions are invaluable tools for this business scenario. The most common production scenario we’ve used is to hire an SSIS expert to perform the heavy lifting often needed at the beginning of project: creating the specialized packages that perform the major data validation, cleansing, and so on. We then usually develop maintenance and daily updating packages locally. We’ve structured the specifics in this chapter around these topic areas: initial load of cubes, further breakdown into loading of dimensions and fact tables, and then the initial load of data mining structures. We’ll follow with a discussion about best SSIS design practices around incremental updates for both cubes and mining models.
Loading OLAP Cubes If you’ve been reading this book sequentially, you’ll remember that we strongly advised in the section on SSAS OLAP cube design that you create an empty destination star schema structure for each OLAP cube. We further recommended that you base this star schema on grain statements that are directly traceable to business requirements. We assume that you’ve followed these recommended practices and we base our recommendations for initial ETL around this modeling core. That said, the first consideration for initial load is appropriate assessment of source data quality. We cannot overemphasize the importance of this task. Too often we’ve been called in to fix failed or stalled BI projects, and one root cause of these situations is invariably a lack of attention to data-quality checking at the beginning of the project—or skipping it altogether! In the next section we’ll share some techniques we use in SSIS to assess data quality.
Using Integration Services to Check Data Quality In Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services,” you learned that SSIS includes a new control flow task—the Data Profiling task—specifically designed to help you understand the quality of data. We heartily recommend that you make full use of this task during the data-quality evaluation phase of your project. You may be wondering whether you should use anything else. Well, obviously you can use Execute SQL tasks and implement SQL queries on SQL source data. However, we usually prefer to just run the SQL queries directly in the RDBMS (or on a reporting copy) rather than taking the time to develop SSIS packages for one-off situations. However, a number of transformations included with SSIS are not easy to do directly in SQL, such as the Fuzzy Grouping transformation. You can also use some of the included SSIS transformations to get a statistical sampling of rows, including Percentage Sampling, Row Count, and Row Sampling.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
517
Assessing Data Quality with the Fuzzy Grouping Transformation An interesting approach to assessing data quality using SSIS involves the Fuzzy Grouping transformation. This transformation is available only in the Enterprise edition of SQL Server 2008. It allows you to group together rows that have similar values. You can also use it to group exact matches, but we tend to use the fuzzy match configuration much more than the exact match at this phase of the project. We’ve used this type of transformation in BI projects to help us to assess the quality of the source data. We find it particularly useful in scenarios that include a large number of data sources. In one situation we had 32 different data sources, some relational and some not (such as flat file). Several of the data sources included customer information. No single source was considered a master source of customers. Using fuzzy grouping allowed us to find possible matching data much more quickly than other types of cleansing processes, such as scripts. Using this transformation requires you to configure a connection manager to a SQL Server database to house the temporary tables that will be created during the execution of this transformation. On the Columns tab of the Fuzzy Grouping Transformation Editor (shown in Figure 17-1), you select the columns upon which you wish to perform grouping. You can configure additional input columns to be included in the output by selecting the Pass Through option. These columns are not used for grouping purposes, but will be copied to the output. In the bottom area, you can configure the type of grouping and other parameters related to how the column data will be grouped. For more information, see the SQL Server Books Online topic “Fuzzy Grouping Transformation Editor (Columns Tab).”
FIgurE 17-1 Fuzzy Grouping Transformation Editor, Columns tab
The configuration options on the Columns tab include several Comparison Flags, shown in Figure 17-2. You use these flags to more granularly configure the types of data considered to be possible matches. Note that you can set options such as Ignore Case, Ignore Character Width, and more. This allows you to control the behavior of the fuzzy logic tool.
518
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgurE 17-2 Comparison Flags for Fuzzy Grouping transformations
In addition, you’ll find an Advanced tab where you can change the default Similarity Threshold value as well as configure token delimiters such as spaces, tabs, carriage returns, and so on. The similarity threshold allows you to control the rate or amount of similarity required for matches to be discovered. The Fuzzy Grouping transformation performs its grouping based on the configuration, and adds several new columns to the results table to reflect the grouping output. While the transformation is working, the results are stored in a temporary table. After completion these results are included in the output from the component so that they can be used in later transformations or written to a permanent location through a destination component. You’ll most often work with the output column named _score. Result rows with values closer to 1 in the _score column indicate closer matches. Note The Fuzzy Grouping transformation makes significant use of temporary tables, particularly with large data inputs; ensure that you have appropriate space allocated on the instance of SQL Server where this transformation is configured to store its working tables.
Additional Approaches to Assessing Data Quality Another approach you can use in assessing data quality is to quickly create a data mining structure and then use one or more of the simpler data mining algorithms, such as Microsoft Naïve Bayes or Decision Trees, to quickly create a couple of data mining models and get a sense of source data quality by looking at associations, groupings, and outliers. Remember that if you just want to get a quick visual representation of the data quality, you can use the views available in the data source view object (table, PivotTable, chart, and so on) to take a look at what you are working with. We’ve also used the Fuzzy Lookup transformation to help us quickly translate source data from disparate sources into a combined (usually interim) working data store. We then usually perform further processing, such as validation of the lookup values discovered by the fuzzy algorithm.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
519
Transforming Source Data After you’ve done some preliminary examination of source data to assess quality, the next step is to map source data to destination locations. An important consideration at this step is to determine whether you’ll use a staging database as an interim storage location as the source data goes through the various cleansing and validation processes. You should have some sense of the amount of cleansing and transformation that you need to perform after you do a thorough data-quality evaluation. Our experience has shown us time after time that it is well worth purchasing a dedicated server for this staging database. If you choose to do this, you have the added benefit of being able to execute your SSIS packages on your middle tier, rather than on any of the source systems or on your SSAS destination system. Although this dedicated server is certainly not required, we’ve used this configuration more often than not in real-world projects. We definitely favor creating smaller, simpler packages and storing data on disk as it flows through the ETL pipeline, rather than creating massive, complex packages that attempt to perfect the source data in one fell swoop. Interestingly, some of the sample SSIS packages available for download on CodePlex use this everything-in-one-massive-source-package design. Figure 17-3 shows a portion of the AWDWRefresh.dtsx sample package.
FIgurE 17-3 Part of a single, overly complex package shown in BIDS. (The pink triangle signifies BIDS Helper.)
520
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Do you find the package difficult to read and understand when you look at it in BIDS? So do we, and that’s our point—simple and small is preferred. Remember that the control flow Execute Package task allows you to execute a package from inside of another package. This type of design is often called parent and child. It’s a design pattern that we favor.
Using a Staging Server In most situations (whether we’ve chosen to use a dedicated SSIS server or not), we create one SSIS package per data source. We load all types of source data—flat files, Excel, XML, relational, and so on—into a series of staging tables in a SQL Server instance. We then perform subsequent needed processing, such as validation, cleansing, and translations using SSIS processes. It is important to understand that the data stored on this SQL Server instance is used only as a pass-through for cleansing and transformation and should never be used for end-user queries. If you use SQL Server 2008 as a staging database, you can take advantage of several new relational features that can help you create more efficient load and update staging processes. The first of these is the new Transact-SQL MERGE statement. This allows what are known as UPSERTs: INSERTs, UPDATEs, or DELETEs all performed in the same statement, depending on the logic you write. MERGE logic is also useful for building load packages that redirect data depending on whether it is new. You can think of MERGE as alternative to the built-in Slowly Changing Dimension (SCD) transformation for those types of business scenarios. MERGE performs a validation of existing data versus new data on load, to avoid duplicates among other issues. MERGE uses ID values to do these comparisons. Therefore, pay careful attention to using correct (and unique) ID values in data sources that you intend to merge. The following example of a MERGE statement is taken from SQL Server Books Online: USE AdventureWorks; GO MERGE Production.ProductInventory AS pi USING (SELECT ProductID, SUM(OrderQty) FROM Sales.SalesOrderDetail sod JOIN Sales.SalesOrderHeader soh ON sod.SalesOrderID = soh.SalesOrderID AND soh.OrderDate = GETDATE() GROUP BY ProductID) AS src (ProductID, OrderQty) ON (pi.ProductID = src.ProductID) WHEN MATCHED AND pi.Quantity - src.OrderQty <> 0 THEN UPDATE SET pi.Quantity = pi.Quantity - src.OrderQty WHEN MATCHED AND pi.Quantity - src.OrderQty = 0 THEN DELETE;
For more information, see the SQL Server Books Online topics, “Using MERGE in Integration Services Packages” and “Merge (Transact-SQL).”
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
521
Of course SSIS is built to be a useful tool to assist in data transformation, so a primary consideration is when to use SSIS and when to use other measures, such as Transact-SQL scripts. As we mentioned in Chapter 16, SSIS shines in situations where you are cleaning up messy data and you need to trap and transform data errors and events. Key transformations that we often use to clean up data as we prepare it for load into star schema tables include the following: Aggregate, Cache Transform, Character Maps, Conditional Split, Copy Column, Data Conversion, Derived Column, Export Column, Fuzzy Grouping/Lookup, Import Column, Lookup, Merge, Merge Join, OLE DB Command, Pivot/Unpivot, Script Component, Sort, and Term Extraction/Lookup. One cleansing strategy that we’ve seen used is to reconcile bad source data based on confidence—in other words, to perform exact matches using a Lookup transformation first in a data flow and then process error rows using less exact transformations such as Fuzzy Lookup. The Fuzzy Lookup transformation requires that the key field from the source (the input dataset) and lookup (the reference dataset) be an exact match. All other columns are matched automatically using a fuzzy comparison so that the source and lookup columns have the same columns name(s). You can adjust the type of fuzzy match by right-clicking the black connecting arrow on the Columns tab of the Fuzzy Lookup Transformation Editor dialog box and then clicking Edit Relationships. The exact relationships are shown with a solid black line; fuzzy relationships are shown with a dotted black line between the source and lookup columns. You can also adjust the type of fuzzy match in the Create Relationships child window that pops up after you’ve clicked Edit Relationships. The Comparison Flags column of the Create Relationships window selects the Ignore Case flag by default for all fuzzy matches. You can select or clear multiple other flags in this drop-down list. Flags include the following: Ignore Case, Ignore Kana Type, Ignore Nonspacing Characters, Ignore Character Width, Ignore Symbols, and Sort Punctuation As Symbols. In addition to the settings available on the Reference Table tab, you can also adjust the thresholds for (fuzzy) comparison on the Advanced tab. You can view the various settings for a Fuzzy Lookup transformation in Figures 17-4 and 17-5. Another consideration for commonly used transformations during this phase of your project concerns potential resource-intensive transformations such as Aggregate or Sort. As discussed in Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration Services,” these components use asynchronous outputs, which make a copy of the data in the buffer. You should test with production levels of source data to ensure that these transformations can be executed using the available memory on the server on which you plan to execute the SSIS package. In the event of insufficient memory, these transformations will page to disk and SSIS performance could become unacceptably slow. In some cases, you can do this processing outside of SSIS. For example, most RDBMS are optimized for performing
522
Part III
Microsoft SQL Server 2008 Integration Services for Developers
sorts against large sets of data. You may choose to use a SQL statement to perform a sort in lieu of using the Sort transformation, as long as your source data is relational.
FIgurE 17-4 The Reference Table tab in the Fuzzy Lookup Transformation Editor
FIgurE 17-5 The Advanced tab in the Fuzzy Lookup Transformation Editor
Along these same lines, you’ll often use either the Merge or Merge Join transformations when transforming and/or loading star schema tables. You can presort the data in the staging database prior to performing the merge-type transformation to reduce overhead on your SSIS server during package execution.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
523
You can monitor overhead using Windows Server 2008 Performance Monitor counters from these areas: SQL Server:Buffer Manager, SQL Server:Buffer Node, SQL Server:Buffer Partition, and SQL Server:SSIS Pipeline 10.0. The collection of performance counters named Buffer* are related to the particular SQL Server instance; the SSIS Pipeline counters are related to the SSIS data flow engine. Buffers Spooled is a particularly useful counter for understanding performance overhead. This counter indicates how many buffers are being saved to disk temporarily because of memory pressure. Figure 17-6 shows the Add Counters dialog box, in which you configure what you want to monitor. This configuration is made at the operating system level. To open the Add Counters dialog box, right-click My Computer and then click Manage. Click Diagnostics, click Reliability And Performance, click Monitoring Tools, and then click Performance Monitor. From there click the large green plus sign (+).
FIgurE 17-6 The Add Counters dialog box, where you monitor SSIS overhead
Another way to improve package execution for packages that use resource-intensive transformations such as Split, Lookup, or Multicast is to increase the number of cores (processors) on your server. A performance enhancement in SQL Server 2008 SSIS is to use all available processors. In SQL Server 2005, if your data flow had only a series of synchronous components connected together, the data flow engine would use only a single thread of execution, which limited it to a single logical processor. In SQL Server 2008, this has been changed so that multiple logical processors are fully utilized, regardless of the layout of the data flow.
524
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Data Lineage As you begin to plan for the transformation process, your business requirements may include the need to include data lineage, or extraction history. You may need to track lineage on fact tables only, or you may also need to track lineage on some or all of the dimension tables. If this is a requirement for your project, it is important that you accurately capture the requirements and then model your data early on in your project—in other words, from the point of the first staging tables. Note LineageID is an internal value that you can retrieve programmatically. For more information, see http://blogs.msdn.com/helloworld/archive/2008/08/01/how-to-find-out-which-columncaused-ssis-to-fail.aspx.
You can handle lineage tracking in a number of different ways, from the simplest scenario of just capturing the package ExecutionInstanceGUID and StartTime system variables to track the unique instance and start time of each package, to much more complex logging. The versatility of SSIS allows you to accommodate nearly any lineage scenario. In particular, the Audit transformation allows you to add columns based on a number of system variables to your data flow. Figure 17-7 shows the dialog box that allows you to select the column metadata that you need.
FIgurE 17-7 Audit transformation metadata output columns
If you’re using SQL Server 2008 as a staging database, another capability you may wish to utilize for tracking lineage is change data capture (CDC). When you enable this capability on a database and one or more of its tables, SQL Server records details about data changes (inserts, updates, deletes) from the transaction log into CDC tables and exposes this data via table-valued functions. You can also use change data capture to facilitate incremental updates to your cubes; we’ll talk more about this later in the chapter.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
525
Tip We often use a simple mechanism that we call a data flow map to track data flow through the planning into the transformation server setup phase of our projects. Using Microsoft Office Excel or something similar, we track source data information such as physical location; load availability times; connection information; and database table names, column names, and data types (for relational sources). We then map this information to staging tables, where we name the transformations that need to be performed, and then map this data at the column level to the star schema destination locations.
Moving to Star Schema Loading Now that you’ve validated, extracted, transformed, and staged your source data, you are ready to begin to load this data into your destination star schema and from there into your OLAP cubes. If you haven’t done so already, it’s time for you to update and verify your package documentation standards. By this we mean that you should be using standardized and meaningful names for packages, tasks, transformations, and so on. Much like commenting .NET Framework source code, properly documenting SSIS packages is the mark of a mature developer. In fact, when we interview developers as contractors on projects we often ask to review a couple of their past packages to get a sense of their method (and discipline) of working in the SSIS world. In this section we again assume that you are working with star schemas that have been modeled according to the best practices discussed in earlier sections of this book. To that end, when you load your star schemas you will first load dimension tables. After successful dimension table load, you will proceed to load fact tables. We’ve seen various approaches to these two tasks. To be consistent with our “smaller is better” theme, we favor creating more, simpler packages—in this case at least one per dimension table destination and one per fact table. Although you could put all of the logic to load a star schema in one package—even to the point of loading the OLAP cube from the star schema—we advocate against huge transactional packages, mostly because of complexity, which we find leads to subtle bugs, maintenance challenges, and so on.
Loading Dimension Tables Generally, preparing source data for load into dimension tables is the larger of the two types of initial load ETL tasks when building OLAP cubes. This is because the original source data often originates from multiple source tables and is usually destined for very few destination tables, and possibly only one. We’ll illustrate this with a database diagram (Figure 17-8) showing the potential source tables for an Employee dimension that could be based on original AdventureWorks (OLTP) source tables. Here we show six source tables. (Remember that this example is oversimplified.)
526
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgurE 17-8 Six source tables related to employee information
We have seen dimensions with 10 to 20 data sources in many projects. Often these data sources are not all relational (flat file, Excel, and so on). Also, we find that dimensional source data is often dirtier than fact table data. If you think of dimension source tables around Employees and compare them with Sales data source tables, you’ll probably understand why it is not uncommon to find much more invalid data (because of typing, NULL values, and failing to match patterns or be within valid ranges) in dimension source tables than in fact tables. Businesses tend to make sure that their financials (such as sales amount and sales quantity) are correct and sometimes have less-rigid data-quality requirements regarding other data. When you load star schemas that you have modeled according to the guidelines that we introduced in earlier chapters, you always load dimension tables before attempting to load related fact tables. We’ll remind you of the reason for this: We favor building dimension tables with newly generated unique key values—values that are generated on load of the table data. These new values become the primary keys for the dimension table data because these new keys are guaranteed to be unique, unlike the original keys, which could originate from multiple source systems and have overlapping values. These new keys are then used as the foreign key values for the fact table rows. Therefore, the dimensions must first be successfully loaded and then the newly generated dimension primary keys must be retrieved via query and loaded into the related fact tables.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
527
Loading Fact Tables It is common to use a fast load technique to quickly load initial data into fact tables, particularly if you’ve used staging tables as a temporary holding area for cleansed data. Fast load is an option available on some of the data destination components, such as the OLE DB destination. The SQL Server destination uses bulk load with a shared memory provider behind the scenes and is often faster than the OLE DB destination with fast load enabled. So you can think of using the SQL Server destination as alternative method to using an OLE DB destination. However, the shared memory provider that gives the SQL Server destination its speed also means that you must execute packages that use it on the destination SQL Server. If you need to run packages on a dedicated ETL server, the OLE DB destination is a better choice. As its name indicates, fast load is a more efficient method of flowing data from the pipeline into a named destination. To use fast load, several configuration options are required. Figure 17-9 shows the OLE DB Destination Editor dialog box where you configure the various destination load options, including fast load.
FIgurE 17-9 Fast load is an efficient mechanism for loading fact tables.
Another concern regarding fact table ETL is that you must refrain from adding extraneous columns to any fact table. “Requirement creep” often occurs at this phase of the project, with clients asking for this or that to be added. Be diligent in matching fact table columns to business requirements. Fact tables can run to the tens or hundreds of millions or even billions of rows. Adding even one column can impact storage space needed (and ETL processing overhead) significantly.
528
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Of course, fast load is not just for fact tables. You can use it for dimension loads just as easily and with the same performance benefits. In our real-world experience, however, fact tables generally contain significantly more data than dimension tables, so we tend to use fast load primarily when working with fact tables. To better understand using SSIS to load dimension and fact tables, we’ll take a closer look at one of the sample SSIS packages that are part of the group available for download from CodePlex (as described in Chapter 15, “Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio”). Open the sample folder named LookupSample. Notice that unlike the other samples in this group, this folder contains only the package and does not contain the .sln file. To open this package in BIDS, create a new project of type SSIS package and then add an existing item by right-clicking the SSIS packages folder in the Solution Explorer window to add the existing sample SSIS package named LookupSample.dtsx. Notice that the control flow for this package contains two Data Flow tasks named DFT Load Lookup Cache and DFT Load Fact Table. If you click the Data Flow tab of the BIDS designer and then select the DFT Load Lookup Cache Data Flow task, you’ll see that it contains the three components shown in Figure 17-10: a flat file source named DimTime Source, a Derived Column transformation named Derived Column, and a Cache Transform transformation named Cache Transform.
FIgurE 17-10 The DFT Load Lookup Cache data flow
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
529
Further examination of the three items in the data flow in Figure 17-10 reveals the following activities: ■■
Source time data dimension members are loaded from a flat file.
■■
A new unique key is created by deriving two new columns—one for the ID and one for the name—each by using an expression in the Derived Column task.
■■
A cache object is populated with this source data for subsequent use in the package.
Next we’ll look in more detail at the second data flow, which is named DFT Load Fact Table and shown in Figure 17-11. It contains a flat file data flow source, a Lookup transformation that uses the Cache object created in the previous data flow, and three different branches depending on whether a match was found in the lookup cache. It also includes error output components, which are indicated with a red arrow and the text Lookup Error Output.
FIgurE 17-11 The DFT Load Fact Table data flow
The data flow tasks in the LookupSample.dtsx package are common design patterns for dimension and fact table load. They are simple, which is what we prefer. We sometimes also add more sophisticated error handling in situations where source data is dirtier (which we may have discovered earlier in the project by using techniques such as data profiling). After you’ve successfully loaded dimension and fact tables with the initial data load, you’ll turn your attention to regular update ETL. The most common case that we implement is a once-a-day update, usually done at night. However, some businesses need to update more
530
Part III
Microsoft SQL Server 2008 Integration Services for Developers
frequently. The most frequent updating we’ve implemented is hourly, but we are aware of near-real-time BI projects. If that is your requirement, you may wish to reread the section in Chapter 9, “Processing Cubes and Dimensions,” on configuring OLAP cube proactive caching. SSIS supports all updating that is built into SSAS. We reviewed storage and aggregation options in Chapter 9.
updates Remember the definition of ETL for OLAP cubes: For cubes (fact tables) adding new facts (rows to the fact table) is considered an update, and the cube will remain online and available for end-user queries while the new rows and aggregations (if any have been defined on the cube partition) are being added to the cube. If cube data (fact table row values) needs to be changed or deleted, that is a different type of operation. As we saw in Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler,” SSIS includes a control flow task, the Analysis Services Processing task. This task incorporates all possible types of cube (measure group or even individual partition) and dimension update options. Figure 17-12 shows the task’s main configuration dialog box.
FIgurE 17-12 SSIS supports all types of measure group and dimension processing.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
531
As mentioned earlier, in the section of this chapter on staging servers, if you are using a staging database built on SQL Server 2008, you can use the new change data capture feature in conjunction with SSIS packages to facilitate the updating of your fact and dimension tables. From a mechanical standpoint, to configure the SSIS task to perform the appropriate action (insert, update or delete), you’ll query the _$operation column from the change data capture log table. This column records activity on the table(s) enabled for change data capture from the transaction log using the following values: 1 for deleted data, 2 for inserted data, 3 and 4 for updated data (3 is the value before the update and 4 is the value after the update), or 5 for merged data. Figure 17-13 (from SQL Server Books Online) illustrates integrating CDC with SSIS from a conceptual level. You then use those queried values as variables in your SSIS package to base your action on—for example, if 2 (for inserted) for source dimension data, process as a Process Update type of update in an SSIS Analysis Services control flow processing task for that dimension; if 1, process as a Process Full type of update, and so on.
OLTP
Source Tables Log
Change Tables
Capture Process
Change Data Capture Query Functions
Data Warehouse
Extraction, Transformation, and Loading
FIgurE 17-13 CDC process in SQL Server 2008
532
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Fact Table Updates We often find much confusion and unnecessary complexity in SSIS packages resulting from a lack of understanding of what types of changes constitute updates. You must, of course, follow your business requirements. The most typical case is that it is either an error or an exception to change or delete facts. So you first must verify exactly how these types of updates are to be handled. It is important to make stakeholders aware that changing or deleting facts will result in the need to fully reprocess all fact data in that particular partition, or in the entire cube (if the cube contains only a single partition). Knowing that, we’ve often found that an acceptable solution is to create some sort of exception table, or to use writeback, particularly now that SQL Server 2008 supports writeback in MOLAP partitions. As with initial fact table loads, we recommend that you use fast load to perform incremental updates to fact tables, as long as you’ve already validated and cleansed source data and stored it in an intermediate staging area. Fast load of new data values to the staging table or tables, which is then pushed (or updated) to the SSAS cube for incremental updates, is ideal because it causes minimal disruption in cube availability for end users.
Cube Partitioning As we mentioned in Chapter 9, appropriate cube partitioning is important for efficient maintenance. The Analysis Services Processing task that you use to perform updates to an SSAS cube “understands” partitions, so if you’ve chosen to use them, you can incorporate partitioned updates into your SSIS strategy. From a practical perspective, smaller cubes—those with less than 1,000,000,000 fact table rows—will probably have acceptable processing and query times with a single partition. However, when cubes grow, partitioning is often needed to reduce processing times.
Dimension Table Updates Complexity around dimension table updating is a common bottleneck in SSAS projects. Remember that, like fact table rows, adding new rows to a dimension table is considered an incremental update. Also, as with fact table rows, changing or deleting dimension table rows can result in the need to fully reprocess the dimension and any cubes that reference that particular dimension. To this end, modeling dimension tables to support update business requirements is quite important. Remember the four types of update (change/delete) behaviors: ■■
No Changes Allowed
■■
Overwrite
Any request to change data is treated as an error.
The last change overwrites and the history is lost.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
533
■■
Keep History As Row Versions Write a new row in the dimension table for history. Next, add a date/time stamp and then mark newest value as active. Then mark older values as inactive, and finally mark deletes as inactive.
■■
Keep History As Column Versions Write original column values to the history column. Only use this kind of update if you expect to store a limited amount of history for a small number of columns.
SSIS contains the Slowly Changing Dimension transformation, which is a component that supports simplified creation of ETL around some of these dimension updating requirements. We looked at this component in Chapter 5, “Logical OLAP Design Concepts for Architects.” We like to use this transformation in simple scenarios because it’s easy, fast, and somewhat flexible via the configuration built into the wizard you run when you set up the transformation. However, the limitations of this transformation are important to understand. Most are related to scalability. Lookup tables associated with the SCD transformation are not cached, and all updates are row-based, which means that locking can also become an issue. In highervolume data dimension update scenarios we prefer to use standard lookup transformations, or, if using staging locations, direct fact load (as described earlier in the section “Loading Fact Tables”).
ETL for Data Mining Preparing your source data for load into data mining models involves some of the same considerations that you had with data destined for OLAP cubes. Specifically, you’ll want to quality check as best you can. We find two types of business scenarios when loading data mining structures. In some cases, a client prefers to create OLAP cubes first and then use the cleansed data in the star schema or in the cubes themselves as source data for new data mining models. This approach reduces the need for complex ETL specific to the data mining load, because the source data has already been cleansed during the OLAP cube preparation process. However, sometimes we encounter the opposite situation: A client has a huge amount of data that contains potentially large amounts of dirty data and therefore prefers to start with data mining rather than OLAP cubes because the messiness of the data may render it nearly unusable for OLAP queries. We alluded to this latter scenario earlier in this chapter when we briefly discussed using data mining as a quality check mechanism.
Initial Loading Regardless of whether you start with relational or multidimensional data as source data, you’ll still want to use SSIS to perform quality checks on this source data. If your source data originates from OLAP cubes, your quality check process can be quite abbreviated. Quality checks can include general data sampling to validate that data is within expected values or ranges,
534
Part III
Microsoft SQL Server 2008 Integration Services for Developers
data type verification, data length verification, and any others that you identified during your requirements phases and initial data discovery. After you complete whatever checking you wish to do, you need to prepare your source data for initial load into your mining structures and models. We covered data requirements by algorithm type fairly completely in Chapter 12, “Understanding Data Mining Structures,” and you may want to review that information as you begin to create the ETL processes for load. Remember that various algorithms have requirements around uniqueness (key columns), data types (for example, time series requires a temporal data type column as input), and usage. Usage attributes include input, predictable, and so on. You can use profiling techniques to identify candidate columns for usage types. You can use the included data mining algorithms as part of your SSIS package, using the Data Mining Query task from the control flow or the Data Mining transformation from the data flow. You might also choose to use the dedicated Data Profiling task from the control flow. Remember also that model validation is a normal portion of initial model load. You’ll want to consider which validation techniques you’ll use, such as lift chart, profit chart, or crossvalidation, for your particular project. Tip A new feature in SQL Server 2008 allows you to create more compact mining models. We covered this in Chapter 13, “Implementing Data Mining Structures,” and you should remember that you can now drill through to any column in the data mining structure. Your models can be more efficient to process and query because they will include only the columns needed to process the model.
Another consideration to keep in mind is that the initial model creation process is often much more iterative than that of loading OLAP cubes. Mining model algorithms are implemented, validated, and then tuned by adding or removing input columns and/or filters. Because of this iteration involved in data mining, we tend to use SSIS packages less during initial loading and more after initial models have been created. We then use SSIS packages to automate ongoing model training and also sometimes to automate regularly occurring DMX queries.
Model Training Although you will use SSIS packages to automate regular model training, you’ll probably use SSIS packages more frequently to automate regular predictive queries using new inputs. We’ll discuss the latter in the next section. To automate ongoing, additional training of data mining models, you can use the Data Mining Model Training destination from the Toolbox in SSIS. To use this component, you must provide it with an input that includes at least one column that can be mapped to each destination column in the selected data mining model. You can view this mapping in Figure 17-14.
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
535
Note Model overtraining is a general concern in the world of data mining. Broadly, overtraining means that too much source data can distort or render inaccurate the resultant newly trained mining model. If your business scenario calls for progressive retraining with additional data, it is very important to perform regular model validation along with model retraining. You can do this by creating an SSIS package that includes an Analysis Services Execute DDL task. This type of task allows you to execute XMLA against the mining model. You can use this to programmatically call the various validation methodologies and then execute subsequent package logic based on threshold results. For example, if new training results in a model with a lower validation score, roll back to a previous model. You would, of course, need to include SSIS task transaction logic in your package for this technique to be effective.
FIgurE 17-14 Data Mining Model Training Editor, Columns tab
Although you may use SSIS packages to automate regular mining model training, we have found a more common business scenario to be automating regularly occurring DMX queries. We’ll cover that scenario next.
Data Mining Queries If you would like to include DMX query execution in an SSIS package, you have two options to select from: the Data Mining Query control flow task or the Data Mining Query transformation component.
536
Part III
Microsoft SQL Server 2008 Integration Services for Developers
The Data Mining Query Task Editor allows you to incorporate a DMX query into an SSIS package control flow. To configure this task, you’ll work with three main tabs: Mining Model, Query, and Output. On the Mining Model tab you configure your connection to an SSAS instance and then select the mining structure and mining model of interest. Next, you’ll use the Build Query button on the Query tab to create the DMX query. Just as when you work in BIDS or SSMS to create DMX queries, the Build Query tab opens first to the visual DMX query builder. The tab also contains an advanced view so that you can directly type a query into the window if desired. As you continue to work on the Query tab, you’ll notice that you can use the Parameter Mapping tab to map package variables to parameters in the query. On the Result Set tab, you can choose to store the results of the query to an SSIS variable. Finally, you can use the Output tab to configure an external destination for the task’s output. If you stored the results of the query in a variable or an external destination, you can use those results from other tasks in the control flow. Figure 17-15 shows the Data Mining Query Task Editor dialog box.
FIgurE 17-15 The Data Mining Query Task Editor
Next, let’s review the workings of the Data Mining Query transformation. This transformation allows you to take data columns from an existing data flow as input to a data mining query
Chapter 17
Microsoft SQL Server 2008 Integration Services Packages in Business Intelligence Solutions
537
(such as a prediction query). Working with this component is similar to working with the Data Mining Query task in that you need to select an instance of SSAS to connect to along with the mining structure and model you want to use. You will also create a DMX query in the configuration dialog box that is part of this component. The key different between the task and the transformation is shown in Figure 17-16, which shows that the Input Columns originate from the particular section of the data flow that has been connected to this transformation.
FIgurE 17-16 Configuring Input Columns for the Data Mining Query transformation
Now that you’ve seen the mechanics of both of these components you may be wondering about possible business scenarios for their use. One scenario we’ve employed is using the Data Mining Query transformation to predict the likelihood of a new row being valid. We’ve done this by running a DMX prediction query against a model of valid data. Another interesting implementation revolves around incorporating predictive logic into a custom application. At http://www.sqlserverdatamining.com, you can find an example of doing just this in the sample Movie Click application. However, in this particular sample, the approach taken was direct access to the data mining API, rather than calling an SSIS package.
538
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Summary In this chapter we focused on high-level guidance and best practices rather than mechanics of implementation for ETL specific to BI projects. We discussed tips for data evaluation and went on to take a look at guidance around OLAP cube initial load and incremental updates. We differentiated between fact and dimension table loading, followed by a look at loading, updating, and querying data mining models. We have still more to cover: In Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services,” we’re going to look first at ETL package configurations, maintenance, and deployment best practices. In Chapter 19, “Extending and Integrating SQL Server 2008 Integration Services,” we’ll talk about using the script components to programmatically extend the functionality of SSIS.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services After you finish developing your packages, you have to move them to the production server, where they are typically executed on a scheduled basis using SQL Server Agent Job Steps or by direct invocation through the DTExec tool. In any case, you have to make sure that your packages will be executed correctly, taking into consideration that the production server might have different resources than your development environment (for example, different drive letters) and also that the security context under which packages will be executed might be completely different from the one that you used in development. In this chapter, we present best practices for these processes. We also consider the need to protect your work from mistakes or changing business requirements, using a complete infrastructure to manage different versions of packages.
Solution and Project Structures in Integration Services As you’ve seen already, Business Intelligence Development Studio (BIDS) organizes files into groupings of solutions and projects. A SQL Server 2008 Integration Services (SSIS) project contains all the files needed to create at least one specific extract, transform, and load (ETL) package. By default, an Integration Services project stores and groups the files that are related to the package, such as *.dtsx and more. A solution is a container of one or more projects; it’s useful when you’re working on big projects that are better handled when broken down into smaller pieces. Keep in mind also that a solution can contain projects of different types. For a BI solution, you can create a single solution where you’ll be able to store the data warehouse database project, the ETL project (packages) developed with SSIS, the OLAP structures created with SQL Server Analysis Services (SSAS, with its own XMLA, MDX, or DMX script files), and also the reporting project (RDL files) developed with SQL Server Reporting Services (SSRS). Although it’s possible to use a single-solution file with multiple project groupings for your complete BI solution, in practice most projects are too complex for this type of storage to be practical. As a best practice, we prefer to have at least one solution for each technology we choose to use—for example, one solution for SSIS, one for SSAS, and one for SSRS. Also, we typically create a unique solution container for each core business requirement. For example, a solution that performs the initial load ETL process to populate a star schema from a core source 539
540
Part III
Microsoft SQL Server 2008 Integration Services for Developers
database would be separate from an SSIS solution that will handle ETL processes to integrate data with external companies. As with development in general, failure to use adequate code organization makes a BI solution difficult to manage and deploy. As we mentioned earlier, you should keep your organization of packages as simple as is practical. If you are in doubt, favor separation of business processes, such as loading individual dimension (or fact) table into individual SSIS packages.
Source Code Control Development of solutions based on SSIS might require you to write a lot of packages, and you might also find yourself working in teams with other people. In any professional development environment, the ability to avoid any loss of any single line of code—in this case, XML code, because packages are really XML files—should be guaranteed. In addition, the development environment should allow you to work on the same solution that your other colleagues are working on, without creating overlapping changes that might cause you to lose your work. This can happen, for example, if you and a colleague work on the same package simultaneously. The last person to save the package will overwrite the other’s changes. As a last point, the development environment should enable you to recover old versions of packages, such as a package developed one or more months ago. You might need to do this because a feature that has been removed from your ETL logic is needed again or because changes to a package saved earlier have broken some ETL logic and the earlier version needs to be recovered. Generally, in your working environment you should have a central repository of code that acts as a vault where code can be stored safely, keeping all the versions you’ve created, and from where it can be reclaimed when needed. You need a source code control system. One of the Microsoft tools you can use to implement this solution is Visual SourceSafe (VSS). VSS is designed to directly integrate with the Visual Studio (and BIDS) integrated development environments. Another option is Team Foundation Server, which includes version control and other features, such as work item management and defect tracking. After you have your source code control environment set up, you and your team will store your solution files in a central code repository managed by that source code control system. Tip Even though we’re focusing on using VSS as a source control mechanism for SSIS package files—that is, *.dtsx files—it can also be used for other types of files that you’ll be developing in a BI project. They can include XMLA, MDX, DMX, and more.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
541
Using Visual SourceSafe If you decide to use VSS, you need to verify that after the installation a Visual SourceSafe database is available for use. To do that, you can just start the Visual SourceSafe client from the Start menu. If no Visual SourceSafe database has been created, a wizard is available to help you create one: the Add SourceSafe Database Wizard. A database for VSS is actually just a specific folder in the file system. To create a new database for VSS, you open the wizard by using the VSS Administrator (or Explorer) and accessing the Open SourceSafe Database menu option and then clicking Add. On the next page of the wizard, if you’re creating a new VSS database, enter the path and folder name where you’d like the new VSS database to be created. It’s a best practice to put the VSS database in a network share so that all the people on your BI development team will be able to connect to that database from the shared network location. After you create the new VSS database, you need to specify the name of that database. After you complete this step and click Next, the Team Version Control Model page appears, which allows you to define how Visual SourceSafe will work, as shown in Figure 18-1. You can select the Lock-Modify-Unlock Model option or the Copy-Modify-Merge Model option. We recommend that you select the Lock-Modify-Unlock Model option.
FIgure 18-1 Team Version Control Model page of the Add SourceSafe Database Wizard
If your VSS database is configured to use the Lock-Modify-Unlock model, only one person at a time can work on a package, which helps you to avoid any kind of conflict. The CopyModify-Merge model allows more than one person to work on the same file, allowing people to resolve conflicts that might happen if and when they modify the same line of code. We feel that the latter model is more appropriate for traditional application development rather than for SSIS package development. The reason for this is that although SSIS packages can
542
Part III
Microsoft SQL Server 2008 Integration Services for Developers
contain large amounts of functionality, you should favor creating smaller, more compact packages rather than large and complex ones (as we’ve mentioned elsewhere in this book). Of course, there might be exceptions to this general guideline; however, we’ll reiterate our general development guideline: keep it as simple as is practical. We’ve seen many SSIS packages that were overly complex, probably because the developer was working based on her experience with traditional application (that is, class) development. Tip Merging SSIS package versions using VSS’s merge capability doesn’t work very well with the SSIS XML structure, because it’s easy for even minor layout changes in the designer to result in major differences between the versions. This is another reason we prefer the Lock-ModifyUnlock model.
After you’ve created and configured the VSS database, you also have to create the appropriate number of VSS user accounts for you and anyone else on your BI development team who’ll need to access the source code. VSS doesn’t use Windows Authentication, so you’ll have to create the accounts manually or through the VSS application programming interface (API). You can administer user accounts and their account permissions with the Visual SourceSafe Administrator tool (shown in Figure 18-2), which you can find under the Microsoft Visual SourceSafe, Microsoft Visual SourceSafe Administration menu item available from the Start menu.
FIgure 18-2 Managing VSS users with the Visual SourceSafe Administrator tool
After you’ve completed these setup steps, you can start to use VSS from BIDS. If you’re creating a new solution, you can simply tell BIDS that you want to store solution files in Visual SourceSafe right from the beginning. To do this, just select the Add To Source Control option in the New Project dialog box. Alternatively, if you have an existing solution that you want to add to the source code control system, you can do it by simply selecting the Add Solution To Source Control item on the solution’s shortcut menu from the Solution Explorer window as shown in Figure 18-3.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
543
FIgure 18-3 The solution’s shortcut menu after having installed a source code control system
If Add Solution To Source Control doesn’t appear on the shortcut menu after you right-click on the solution name in the Solution Explorer window, you might need to configure BIDS to work with the installed VSS system manually. You can do that through the Options dialog box in BIDS as shown in Figure 18-4.
FIgure 18-4 You can configure which source code control system to use from the BIDS Options dialog box.
After you select the Add Solution To Source Control item, you’re asked to authenticate yourself using Visual SourceSafe credentials, which include your VSS user name, VSS password, and the VSS database name.
544
Part III
Microsoft SQL Server 2008 Integration Services for Developers
After doing that, you need to specify where you want to save the project inside the VSS database. As we mentioned, that database is very similar to a file system, so you have to provide the path for your project. This path will be used inside the VSS database to separate your project files from others’ project files. The folder location is preceded in the dialog box by a dollar sign ($). This is just the naming convention used by VSS. After clicking OK, your project will be stored in the location you’ve specified in the VSS database. From here on, all the project files visible in the Solution Explorer window will have a little icon on their left that indicates their status with respect to VSS. For example, the small lock icon shown in Figure 18-5 next to the SSIS package file named ProcessXMLData.dtsx indicates that the file is not being edited by anyone and it’s safely stored on the server where the VSS database resides.
FIgure 18-5 Element status icons are visible when a solution is under source control.
A red check mark indicates that you’re currently editing the file, and no one else will be able to edit it. This status is called Checked Out. VSS requires that you check out files before you can start to edit them. You can check out a file by right-clicking the file name in Solution Explorer and then clicking Check Out For Edit, as shown in Figure 18-6.
FIgure 18-6 The specific source code control menu items available on the shortcut menu
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
545
After you successfully check out a file, the source code control system makes a local copy of that file on your computer—by default in your project folder, which is typically placed inside My Documents\Visual Studio 2008\Projects. All the changes you make to that file remain on your computer only, until you’re satisfied with your work and you want to save it back into the central database, which will create a new version and make the updated version available to the rest of your BI development team. The operation that updates the source code control database is called Check In and is available on the shortcut menu of the checked-out file, as shown in Figure 18-7.
FIgure 18-7 The solution’s shortcut menu after having checked out a file
As mentioned, the Check In operation copies your local file to the source code control database, but it also keeps a copy of the original file. In this way, you can have all the versions of the package always available if you ever need to restore a previous version of that file. One method to revert to a previous version of a file is by using the View History item, which is available from the shortcut menu of the file for which you want restore an old version. Choosing the View History item opens the History dialog box shown in Figure 18-8. You can select (Get) any previous version from this dialog box. Not only can you restore a file to a previous version, but you can also perform any of the operations shown in Figure 18-8—for example, Diff (which allows you to compare two versions). As you can see, there is a lot of functionality available in VSS. Also, you can do many more things with a source code control system than the activities introduced in this chapter, which aims to give you just an operational overview of such a system. If we’ve piqued your interest in using a source code control system, you can learn more by reading the help files of the source code control system you’ve selected. MSDN contains some useful basic documentation on VSS at http://msdn.microsoft.com/en-us/library/ms181038(VS.80).aspx.
546
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 18-8 The VSS History dialog box, where you can view and get all previous versions of a selected file
Note Other source control systems, such as Visual Studio Team System (VSTS) and Team Foundation Server are available. Although a discussion of the functionality of VSTS is outside the scope of this book, we encourage you to explore whatever source control systems you have available in your particular development environment. At a minimum, you and your team must select one common system and use it. There are too many parts and pieces (literally files) in a BI project to consider proceeding without some kind of standardized source control. For an introduction to VSTS, see the MSDN documentation at http://msdn.microsoft.com/en-us/library/ fda2bad5.aspx.
The Deployment Challenge After you’ve developed, debugged, and tested your ETL SSIS packages, you will want to deploy your packages to production or to preproduction servers—the latter if your project is being implemented in a more complex (enterprise) environment. Note If you used SQL Server 2000 Data Transformation Services (DTS), SQL Server 2008 SSIS deployment will be a new process for you. With SQL Server 2000 DTS, packages were usually created on the same server where they would be executed, thus eliminating the need to deploy them. Of course, you might have used a server for development and another server for production and therefore had to move DTS packages from one to the other when you finished development. However, the concept with SQL Server 2000 is that packages were directly saved on the server where they were developed. To do that, you used a specific application—the DTS Designer—to create those packages.
From SQL Server 2005 and on, package development has moved from Enterprise Manager to BIDS, and it has changed mechanically as well. In this section, we’ll cover deployment options available for SSIS 2008 packages.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
547
All package development is initiated from the local developer’s machine, where you have installed BIDS. All the packages in your project are initially saved as files with the .dtsx extension on your local machine (or to a shared location if you’re using VSS). So, at the end of development, you have to move (deploy) these packages to the target production or preproduction server or servers so that they can be scheduled and executed through the SSIS runtime without the need to run them from BIDS. You know from reading the last chapter that there are two fundamentally different types of ETL processes associated with BI solutions: initial structure load, and incremental update of these cubes and mining models. Depending on the scope of the project, you’ll sometimes find yourself running the initial load ETL package directly from your developer machine. For larger projects, this won’t be the case. Invariably, the follow-on incremental update ETL packages for both the dimension and fact tables will be run from a production server and will be regularly scheduled by a database administrator. How does the deployment process of a package work? Because you’re handling files you want to deploy, everything is pretty straightforward. However, before starting the deployment you first need to decide where on the target server you want to deploy your packages. There are three possible locations on the server where you can save your packages, as described in the following list: ■■
File deployment
■■
Package Store MSDB deployment Package is stored inside the sysssispackages table in the SQL Server msdb database.
■■
Package Store File deployment Package is stored on the server’s file system in a Package Store configured path, typically in %Program Files%\Microsoft SQL Server\100\DTS\Packages.
Package is stored in an available directory on the server’s file system.
Although the difference between msdb deployment and the other two options should be clear, you might be wondering what kind of difference exists between file deployment and Package Store File deployment, because they both store the package on the file system. While file deployment basically allows you to use any directory available in the file system, Package Store File deployment manages and monitors the %Program Files%\Microsoft SQL Server\100\DTS\Packages folder so that you can see and manage its content right from SQL Server Management Studio (SSMS). The Package Store is handled by the SSIS service, which implements monitoring and cataloging functionalities. This allows you to keep an ordered and clear catalog of available packages. We’ll talk about the Integration Services service features again later in this chapter. So how do you decide which location to use? There are two main points that you have to consider to answer that question. The first is simplicity of deployment and management. Using a file system as a storage location allows you to deploy your packages by just copying them to the target folder, while the
548
Part III
Microsoft SQL Server 2008 Integration Services for Developers
usage of the msdb database requires you to deploy the packages by using the dtutil.exe tool, importing them through SSMS, using a custom deployment utility that leverages the SSIS API, or using BIDS to save a copy of your package in the msdb database. Also, the management of package files is easier with a file system because you’re simply dealing with files. For example, the backup of all the packages stored simply requires copying them to a safe location. With this solution, you can also easily decide to back up only some packages and not all of them. Using msdb this is also possible, but it’s not as straightforward. You must use the dtutil.exe tool or SSMS to export the packages you want to back up so that you can have them stored in a .dtsx file, and then move that file to a secure place. However, backing up all the packages in msdb is as easy as backing up the msdb database and might already be covered under your existing SQL Server backup strategy. On the other hand, you should also consider the security of package information. When a package is stored in a file system, you might need to encrypt some values, such as passwords, so that they cannot be stolen. This type of problem doesn’t exist with a package stored in msdb because access to any part of the package is managed by SQL Server directly and thus access is granted only to those who have the correct permission defined. There is more complexity to properly securing SSIS packages, and we’ll discuss this later in the chapter (in “Introduction to SSIS Package Security”), when we discuss security configuration options in the context of the ways you can manage and execute packages. When deploying SSIS packages, you also have to consider package dependencies, such as files, file paths, connection strings, and so on. To avoid encountering errors when packages are executed on production servers, you need to be able to dynamically configure the locations of such resources without having to directly modify any package properties using BIDS. BIDS might not even be installed on production servers. Also, modifying packages directly after development might introduce bugs and other issues. So, what you really need is a separate location where you can store all the values that might have to change when you deploy the package to the server. This way, when the package is executed, the values will be taken from this defined, external configuration. SSIS offers this feature, which is called Package Configurations.
Package Configurations In BIDS, you can enable a package to use an external configuration file by selecting the Package Configuration item from the SSIS menu. You’ll see the Package Configurations Organizer dialog box shown in Figure 18-9. Note that you can associate more than one external configuration with a single SSIS package.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
549
FIgure 18-9 Package Configurations Organizer dialog box
The Enable Package Configurations check box allows you to specify whether the package will try to load the specified package configurations. Package configurations can be created by clicking the Add button, which starts the Package Configuration Wizard. On the Select Configuration Type page (shown in Figure 18-10), you have to select where you want to store the package configuration data.
FIgure 18-10 Select Configuration Type page of the Package Configuration Wizard
550
Part III
Microsoft SQL Server 2008 Integration Services for Developers
The Configuration Type drop-down list allows you to choose from the following five types of storage locations for variable values: ■■
XML Configuration File
■■
Environment Variable an object property.
■■
Registry Entry
■■
Parent Package Variable This option is useful when you have a package that can be executed from another package. It allows you to bind an object property value to the value of a variable present in the calling package.
■■
SQL Server
Stores object property values using an XML file. Allows you to use an environment variable to save the value of
Stores an object property value into a registry key.
Stores object property values using a SQL Server database table.
The most common way to store files is to choose either the XML Configuration File or SQL Server option. Both of these options allow you to select and store multiple property values in the external configuration. We like the flexibility and ease of use of XML files. Using a SQL Server table is a good solution when you want to create a single place where you’ll store all your packages’ common configurations and when you want to be sure to protect them using SQL Server’s security infrastructure. After you’ve chosen the configuration type you want to use, you can select the property values that you want to be saved into the package configuration. For example, if you need to have the value of a variable called SystemRoot stored in a package configuration, all you have to do is choose the Value property of that variable on the Select Target Property page of the wizard, as shown in Figure 18-11. After you’ve enabled and defined the package configuration, the package will attempt to load that configuration each time it’s executed. It will also attempt to load the configuration when the package is opened in the BIDS environment. If at runtime the package configurations cannot be found, the package generates a warning and continues the execution using the values available directly within the package. If more than one configuration is specified, the package uses all of them, using the last value available if two or more package configurations try to set the value for the same property. Because of this potential for overwriting information, it’s a best practice to configure variable values only once in external package configurations. Also, you should name your configuration files to reflect which values they are updating—that is, File Path To MainFrame A Dump, Connection String For Database B, and so on. For example, if you have a variable A that has been configured to store and load its value from a package configuration, you specify three package configurations to be used, and all of them contain a value for variable A, the value that will be used is the one of the last package configuration present in the list.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
551
FIgure 18-11 Selection of property values to be exported to a package configuration
Here’s an example. Suppose that these are the package configurations set in the Package Configurations Organizer dialog box: ■■
Package Configuration One: Tries to set A to 100
■■
Package Configuration Two: Tries to set A to 200
■■
Package Configuration Three: Tries to set A to 300
At runtime, the value for A will be 300. This can seem confusing, but it can be very useful when you need to make an exceptional (such as a one-time-only) execution of your package using some specific configuration values that should be used only for that execution. Thanks to the way in which package configurations are loaded and used, you can simply create a special package configuration for that particular execution and put it as the last one in the list, and that’s it. Without having to modify the usual package configuration, you can handle those particular exceptions. Of course, this type of complexity should be used sparingly because it can lead to confusion and undesirable execution results.
552
Part III
Microsoft SQL Server 2008 Integration Services for Developers
As we’ve discussed, package configurations are referenced directly from the package and are enabled using the previously mentioned Package Configurations Organizer dialog box. Because your package might need to reference different package configurations when deployed on production or preproduction servers, you might also need to specify the package configuration it has to use directly when invoking its execution. You can do that with DTExec, DTExecUI, or the specific Integration Services Job Step. Note It’s important that you understand that the package stores a reference to the configuration internally, in the form of a hard-coded location (a path for XML files, for example). Also, with SQL Server 2008, the configuration behavior has changed. So you can override a SQL Server– based configuration connection string using the /CONN switch. For more information, see the “Understanding How Integration Services Applies Configurations” section in the following white paper: http://msdn.microsoft.com/en-us/library/cc671625.aspx.
Copy File Deployment This is by far the simplest method. You can simply take the .dtsx file you have in your solution directory and copy it to your target folder on the destination server. In this case, you typically need to have a file share on the server where you can copy packages, as shown in Figure 18-12.
FIgure 18-12 File copy deployment
If the directory on the server is not the directory managed by the SSIS Package Store, the packages stored here won’t be visible from SSMS. If you want to have packages visible and manageable from there, you have to copy your .dtsx files to the SSIS Package Store directory, which is %Program Files%\Microsoft SQL Server\100\DTS\Packages by default. In that folder, you can also organize packages using subfolders. For example, you might want to deploy your package Sample Package 1.dtxs so that it will be stored in a folder named SQL2008BI_SSIS_Deployment. Because this folder is used by the Integration Services service, you can see its content right from SQL Server Management Studio as shown in Figure 18-13.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
553
FIgure 18-13 SSMS Package Store folder browsing
And from there, you can manage the deployed package. Management tasks include scheduling package execution, logging package execution, and assigning variable values at execution runtime. Also, management includes backing up the storage location or locations, such as the file system or msdb database where the packages are stored.
BIDS Deployment Packages can also be deployed directly from BIDS. This is a very simple way to deploy packages because you don’t have to leave your preferred development environment. To accomplish this type of deployment, all you have to do is select the Save Copy Of <xxx file> As item from the BIDS main File menu. To have this menu item available, you have to verify that you have the appropriate package selected in the Solution Explorer window. One quick way to verify this selection is to look at the Properties window, shown in Figure 18-14. You should see the name of the package that you want to save as shown in this window.
FIgure 18-14 Package Properties window
554
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Otherwise, you might not be able to see the Save Copy Of…As menu item. After selecting the previously mentioned menu item, the Save Copy Of Package dialog box appears, as shown in Figure 18-15.
FIgure 18-15 Save Copy Of Package dialog box
In this dialog box, you can decide where the package will be deployed through the Package Location property. The list box allows you to choose from the three locations we discussed earlier: SQL Server (MSDB), SSIS Package Store (which allows you to deploy to either a managed path or msdb), or a defined file system path. For the SQL Server and SSIS Package Store options, you also provide the target server and the authentication information needed to connect to that server. For the SSIS Package Store option, only Windows Authentication will be available; while for SQL Server, you could alternatively use SQL Server Authentication. In the Package Path text box, enter the path and the name that the package will have once it’s deployed on the new server. The Protection Level option allows you to decide whether to encrypt your package after it has been deployed and how to encrypt it. For the package’s protection level, you can choose from six options. These options are explained in “Setting the Protection Level of Packages” in Microsoft SQL Server Books Online as follows: ■■
Do Not Save Sensitive This protection level does not encrypt; instead, it prevents properties that are marked as sensitive from being saved with the package and therefore makes the sensitive data unavailable. If a user opens the package, the sensitive information is replaced with blanks and the user must provide the sensitive information.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
555
■■
Encrypt All With Password Uses a password to encrypt the whole package. To open or run the package, the user must provide the package password. Without the password, no one can access or run the package.
■■
Encrypt All With User Key Uses a key that is based on the current user profile to encrypt the whole package. Only the same user using the same profile can load, modify, or run the package.
■■
Encrypt Sensitive With Password Uses a password to encrypt only the values of sensitive properties in the package. DPAPI is used for this encryption. DPAPI stands for data protection API and is a standard in the industry. Sensitive data is saved as a part of the package, but that data is encrypted by using the specified password. To access the package, the password must be provided. If the password is not provided, the package opens without the sensitive data so that new values for sensitive data have to be provided. If you try to execute the package without providing the password, package execution fails.
■■
Encrypt Sensitive With User Key Uses a key that is based on the current user profile to encrypt, using DPAPI, only the values of sensitive properties in the package. Only the same user using the same profile can load the package. If a different user opens the package, the sensitive information is replaced with blanks and the user must provide new values for the sensitive data. If the user attempts to execute the package, package execution fails.
■■
Rely On Server Storage For Encryption Protects the whole package using SQL Server database roles. This option is supported only when a package is saved to the SQL Server msdb database. It’s not supported when a package is saved to the file system.
As you can see, you have several choices to protect your sensitive package data, so it’s important for you to understand what type of data can be considered sensitive. Certain property values are obviously sensitive, such as passwords, while other values do not so obviously need encryption or security. It’s important that you document in your business requirements which information is considered security-sensitive and then create your package configurations accordingly. You should be aware that the default setting is Encrypt Sensitive With User Key. Of course, like any type of security that uses key-based encryption, this type of security depends on appropriate key storage mechanisms. You should partner with your network administrators to ensure that appropriate—that is, nondefault—mechanisms are in place for such key storage if you choose to use this type of encryption for your SSIS packages. We advocate that clients use “just enough” security in SSIS package security design (and we implement that ourselves). That is, we frequently stick with the default security setting, (Encrypt Sensitive With User Key) because of its small overhead and ease of use. As mentioned, the only complexity in using this setting is appropriate key storage.
556
Part III
Microsoft SQL Server 2008 Integration Services for Developers
We’ve used password-based encryption in scenarios where no key management was in place. We prefer to refrain from password-based encryption for a couple of reasons. The first is password management—that is, storage, complexity, and recovery. The second reason is that because of people being involved with passwords, use of this security mechanism is inherently less secure than a key-based system. Of course, we have encountered scenarios where the entire package needed to be encrypted. This amount of encryption adds significant overhead to package execution and should be used only in business scenarios that specifically call for it.
Deployment with the Deployment Utility The last option you have for deploying a package to a target server is the use of the BIDS SSIS Deployment Utility. To enable this feature, you have to set a specific project property. You do this by right-clicking on the project in the BIDS Solution Explorer window and then clicking Properties on the shortcut menu. After selecting Properties, the project property window appears, as shown in Figure 18-16. In the Deployment Utility section, you have to set the CreateDeploymentUtility property to True. From here on, every time you build the project by selecting Build <project name> from the Build menu in BIDS, all the packages present in the project will be copied into the directory defined in DeploymentOutputPath, where a special XML-based file with the extension SSISDeploymentManifest will be created.
FIgure 18-16 The project’s property window, where you can enable the Deployment Utility
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
557
Now, to deploy your packages to the server you want, all you have to do is double-click on that manifest file. After you do that, the Package Installation Wizard opens. By going through the pages of the wizard, you can decide where to deploy packages. In this case, you’re limited to only two locations: SQL Server or the file system, as shown in Figure 18-17.
FIgure 18-17 Package Installation Wizard location selection window
When you click Next, you see the Select Installation Folder page if you chose File System Deployment; if you chose SQL Server Deployment, you see the Specify Target SQL Server page. Then, when you click Next on either of those pages, the Confirm Installation page appears, which simply states it has enough information to start and tells you to click Next to start the installation. Finally, you see the Finish The Package Installation Wizard page, where you can view what’s been done and click Finish. After all package and package configuration XML files have been deployed, you have the chance to change the values saved in the XML configurations used by your packages. Keep in mind that you can do that only if they’re using an XML configuration file. On the Configure Packages page of the Package Installation Wizard, shown in Figure 18-18, there are two configurable properties: one for a file name and path, and the other for a file name. Both properties are configured dynamically via this XML file, and their values are retrieved at SSIS package runtime.
558
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 18-18 The Configure Packages page of the Package Installation Wizard
It’s a best practice to use package configurations in general and XML files in particular, as they are the most flexible for both deployment and for subsequent customizations. We remind you that if sensitive information is stored in these XML files, you must consider how you will secure this information. Alternatively, as mentioned, you can choose to store configuration values in SQL Server and thereby take advantage of SQL Server security. Finally, if you don’t want to allow configuration changes during deployment, you can disable it by specifying the value False for AllowConfigurationChanges in the project’s Deployment Utility properties. Note In addition to using the SSIS Deployment Utility to deploy SSIS packages, you can also use SSMS or the command-line utility dtutil.exe to accomplish package deployments.
SQL Server Agent and Integration Services After your SSIS packages have been deployed to the target server or servers, you need to schedule their execution. To do that, you’ll generally use SQL Server Agent by creating a specific job for the execution of each (normally updated) package. To create a new job, just expand the SQL Server Agent menu item from the SSMS so that you can see the item Jobs. Right-click Jobs and select the New Job option from the shortcut menu. You need to provide a name for the job.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
559
A job is made of smaller elements called steps, which can perform specific actions. To create a new step, select the Steps page on the left side of the dialog box, and click the New button at the bottom of the dialog box. A name for the job step needs to be provided, and you need to select the type of step to perform. In this case, you use the job step type SQL Server Integration Services Package to run the desired package, as shown in the New Job Step dialog box in Figure 18-19.
FIgure 18-19 New Job Step dialog box
Here you’ll work with an interface similar to the one offered by DTExecUI where you can choose the package you want to run, specifying all the options you might want to configure. After you select the package and configure its options, you then decide on the schedule and the notification options just like any other job you define with the SQL Server Agent. If you prefer to invoke package execution directly using DTExec or another tool—for example, DTLoggedExec—you can choose the job step type Operating System (CmdExec). With this job step, you can call any application you want to use, just like if you’re invoking it from the command line.
Introduction to SSIS Package Security When a package runs, it might need to interact with the operating system to access a directory where it has to read or write some files, or it might need to access a computer running SQL Server to read or write some data. In any of these cases, the package needs to be recognized by the target entity—whether it is the operating system or another kind of server, from
560
Part III
Microsoft SQL Server 2008 Integration Services for Developers
databases to mail servers—so that it can be authenticated and authorized to perform the requested activity. In other words, the package needs to have some credentials to present to a target system so that security can be checked. Understanding under which credentials a package will run is vital to avoid execution failures with packages deployed on servers and scheduled to be executed by an automation agent such as SQL Server Agent. This is a common source of questions and consternation among the DBA community, so we recommend you read the next section particularly closely. We almost never see execution security set up correctly in production. In the most common situation, the DBA attempts to run the package using (appropriate) restricted security and the package fails. Rather than troubleshooting, what we often see is a reversion to no security— that is, running the package with full administrative permissions. Don’t replicate this bad practice in your production environment! When a package is executed interactively—such as when you run a package from BIDS, from DTExcecUI, or by invoking it from the command line using DTExec—it’s more obvious under which credentials the package is run. That is, it’s more obvious which user has executed the package. In this case, each time the package executes and accesses a resource that uses Windows Authentication, it presents to the target system the credentials of the user who ran the package. So, for example, if in your package you have a connection to SQL Server that makes use of Windows Authentication or you need to load a file from a network share and you’re executing the package using DTExecUI, your Windows user account needs to have appropriate permission to access SQL Server and to read from or write to tables used in the package, and the same account will also be used to verify the credentials to access any files stored locally or in network shares. Understanding the security execution context becomes a little more complex when you schedule package execution with SQL Server Agent. In this situation, you must understand the key role that is played by the user who owns the job, which typically is the user who creates the job. Be aware that job ownership can be modified by a SQL Server system administrator. If the user who owns the job is a sysadmin, the job step SQL Server Integration Services Package will be run under the credentials of the account used by the SQL Server Agent Service. As we mentioned earlier, sysadmin is the most highly privileged account, and you should refrain from using this account as a security credential for production ETL SSIS packages. Using sysadmin introduces an unacceptable level of security risk into production environments. The preferred way to appropriately execute SSIS packages using SQL Server Agent is to use a proxy account. By default, the SQL Server Agent job step is initially configured to run as the SQL Server Agent Service Account in the Run As combo box of the Job Step window as shown in Figure 18-20.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
561
FIgure 18-20 The Run As combo box in the Job Step window, configured to use the SQL Server Agent Service Account
To create a proxy account, you first need to ask your system administrator to create a domain user that will be used by SSIS to run your package. Suppose that the Windows domain user is called MYDOMAIN\SSISAccount. After the domain account is created, you associate this new account’s login name and password inside SQL Server so that these values can be used later to impersonate the account and execute SSIS packages under its credentials. To do that, you first have to open the Security item in Object Explorer in SSMS. Right-click Security and select New Credential on the shortcut menu. The New Credential dialog box opens, and here you specify a name for the credential and the login name and password for the MYDOMAIN\SSISAccount user as shown in Figure 18-21.
FIgure 18-21 The New Credential dialog box
This SQL Server object, also called credential, simply holds—in a secure way—the specified login name and password. The credential object needs a name, just like any other object in SQL Server, and here we use the name SSIS Account Credentials. After creating the credential object, you can define the proxy account. Basically, a proxy account is an object that defines which SQL Server user can use the chosen credential object and for which purpose. The proxy account can be found and defined under the SQL Server Agent folder in Object Explorer, in the Proxies folder. Each proxy account type will be displayed in the folders for types associated with it—that is, ActiveXScript, Operation System (CmdExec), SQL Server Integration Services Package, and so on. You can see this by taking a look at the SQL Server Agent/Proxies folder in SSMS. As you can see, you can define proxy accounts for any of the SQL Server Agent job steps that need access to resources outside SQL Server boundaries and thus need to authenticate each request. Right-clicking on the Proxies item allows you to select the Unassigned Proxy command, which will allow you to create a new proxy assignment. The New Proxy Account dialog box, where you can create your proxy account, appears, as shown in Figure 18-22.
562
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 18-22 New Proxy Account dialog box
Here you can set the name of your proxy account, the credentials that this proxy account will use when it needs to be authenticated outside SQL Server by the Windows operating system, and the subsystems that are authorized to use this proxy account. If you’re the SQL Server DBA, on the Principals page you can also specify which SQL Server users are authorized to use this proxy account. After creating the proxy account, you can use it when setting up the SQL Server Agent job step as shown in Figure 18-23. You can specify the proxy account instead of SQL Server Agent Service Account in the Run As combo box.
FIgure 18-23 The Run As combo box in the Job Step window, now configured to use the created
proxy account
Each time SQL Server Agent executes the package using this step, the package will run under the credentials of the MYDOMAIN\SSISAccount account. Therefore, you have to configure all the resources used by this package (SQL Server, file system, mail servers, and so on) so that they will authenticate and authorize MYDOMAIN\SSISAccount. When defining permission for that account, always keep security in mind and give that account only the minimum permission it needs to perform the work defined in the package.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
563
Handling Sensitive Data and Proxy Execution Accounts As we discussed earlier in this chapter, each time you save a package with sensitive data to a file, that data needs to be removed or protected. If you save a package with the option Encrypt Sensitive With User Key, only the user who saved the package can decrypt the sensitive data. This often causes decryption issues when scheduling the package execution, because the executing user account usually differs from the user account who created and saved (and encrypted) it. This means that as soon as the executing account tries to open the package, it won’t be able to decrypt sensitive data, and any connection to resources that require login or password information in order to authenticate the request will fail. You might be thinking that you can create a proxy account that allows the package to use the same account of the person who saved it, but this isn’t a recommended practice because you might have more than one person (each with his own account information) who needs to execute the package on a regular basis. To solve this problem, you can choose to save the package using the Encrypt Sensitive With Password option so that all you have to do is provide to the SQL Server Agent Integration Services job step the password to decrypt sensitive data. Unfortunately, although this solution encrypts the package, it’s rather ineffective because it provides only weak security. Passwords are often compromised through inappropriate sharing, and also they’re often forgotten. For these reasons, we avoid package deployment to the file system and prefer package deployment on SQL Server MSDB. Here packages are not encrypted, and security is enforced and guaranteed by SQL Server. Another option we sometimes use is the DontSaveSensitive option, which stores the sensitive information in a secured configuration location. Inside the msdb, there are three database roles that allow you to decide who can and cannot use packages stored in that way: ■■
db_sissadmin on packages
■■
db_ssisltduser Can manage all their own packages
■■
db_ssisoperator Can execute and export all packages
Can do everything (execute, delete, export, import, change ownership)
These are the default roles; to give a user the permission of a specific role, all you have to do is add the user to the appropriate role. As usual, all the SQL Server accounts that are part of the sysadmin server role can also administer SSIS without restrictions because they’re implicitly part of the db_ssisadmin role. Of course, you can also choose to create your own roles with specific permissions. If you want to do this, you should refer to the SQL Server Books Online topic, “Using Integration Services Roles.”
564
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Security: The Two Rules As you can see, security make things a little bit more complicated at the beginning. However, because with SSIS packages you can interact with any type of resource (from those stored in local files to databases to network-accessible or even Internet-accessible resources and systems through Web services or an FTP), security must be appropriately constrained so that no harm—voluntary or involuntary—can be done to the system. Here are two best practices: ■■
If possible always use Windows Authentication so that you don’t have to deal with sensitive data. In this case, you can simply deploy packages to the file system or the SSIS Store and just configure the proxy account to use.
■■
If you need to store sensitive data in your package, use the msdb database as a storage location so that you can control who can access and use that package simply by using the SQL Server security infrastructure, thereby avoiding having package passwords passed around in your company.
The SSIS Service The SQL Server Integration Services service is a Windows service that monitors and catalogs packages. Although you can manage—that is, configure user accounts, start or stop the service, and so on—this service in the Windows Control Panel, you’ll probably use SSMS to work with the SSIS service because of the greater functionality exposed through this interface. You can connect to that service from within SSMS by using the Object Explorer window and clicking Integration Services. Only Windows Authentication is supported for connecting to the SSIS service. After you’ve successfully connected to the SSIS service, you can view all the packages saved in the SSIS Package Store and in the msdb database in the Object Explorer window of SSMS. This is shown in Figure 18-24. You can also view all the currently running (executing) packages on that instance, regardless of the method that they’ve been executed with—that is, from BIDS, from DTSEXECUI, and so on. Note To associate more than one physical folder with the SSIS Package Store, you need to edit the %Program Files%\Microsoft SQL Server\100\DTS\Binn\MsDtsSrvr.ini.xml file to add other file locations.
Chapter 18
Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services
565
FIgure 18-24 The SSMS menu items for the SSIS service
For each running package, you can halt package execution in this window. Also, for all the stored packages, you can perform administrative tasks, such as creating new folders, importing and exporting packages, executing packages, or deleting packages. Your account must have appropriate permission to perform these types of tasks—that is, stopping, configuring, and so on—to successfully complete them in the SSMS interface. Tip We already mentioned the usefulness of BIDS Helper, a free tool for SSAS cubes. BIDS Helper is available from CodePlex at http://www.codeplex.com/bidshelper. The tool includes a number of utilities that make working with SSIS packages easier. These tools include the following: Create Fixed Width Columns, Deploy SSIS Packages, dtsConfig File Formatter, Expression And Configuration Highlighter, Expression List, Fix Relative Paths, Non-Default Properties Report, Pipeline Component Performance Breakdown, Reset GUIDs, Smart Diff, Sort Project Files, SSIS Performance Visualization, and Variables Window Extensions.
Summary In this chapter, we covered all the aspects of SSIS package versioning and deployment. We then took a closer look at security options that are involved in package deployment. We advise you to take to heart our recommendations for using appropriate security for executing packages. You can easily expose unacceptable security vulnerabilities if you do not follow best practices. You’ve also learned how to monitor and manage packages through the SSIS service and how to deal with security in this area.
Chapter 19
Extending and Integrating SQL Server 2008 Integration Services As we discovered in previous chapters, SQL Server Integration Services (SSIS) gives us a lot of power and flexibility right out of the box. With dozens of tasks and components ready to use, a lot of the typical work in an extract, transform, and load (ETL) phase can be done by just dragging and dropping objects, and configuring them appropriately. But sometimes these included tasks are just not enough, particularly because business requirements or data transformation logic can get complex. SSIS provides an answer even to these demanding requirements we find in business intelligence (BI) projects. This answer lies in using very powerful scripting support that allows us to write our own custom logic using .NET code. In this chapter, we show you how to use scripting and create custom objects, as well as how to invoke the power of SSIS from your custom applications.
Introduction to SSIS Scripting Scripting in SQL Server 2008 Integration Services is available through two different objects in the Toolbox. The Script Task enables scripting usage in the control flow, and for scripting in the data flow you use the Script Component. As you’ll see later, transforming data is just one of the many things you can do with the Script Component. The two objects have a lot of things in common but are not identical. Commonalities include the fact that both objects deal with the .NET Framework, which results in the scripting capabilities being common to both objects. For both, you use the same integrated development environment—Microsoft Visual Studio Tools for Applications—and also for both, you can choose between C# and Visual Basic .NET as a scripting language. The differences are instead related to the different purposes of the two objects. In general, the ability to write scripts in C# is new to SQL Server 2008. In previous versions, you were limited to writing in Visual Basic .NET only. The Script task lives in the control flow, and its script can accomplish almost any generic function that you might need to implement. If the Script task is not explicitly put into a Loop container, it will run only once and will not have any particular restriction in the way it can access package variables. The Script component lives in the data flow, so its script code—not all of it, but at least the main processing routine—will be executed as many times as needed to process all the data 567
568
Part III
Microsoft SQL Server 2008 Integration Services for Developers
that comes through the component. The script here works directly with data that flows from the sources to the destinations, so you also need to deal with columns and flow metadata. You can decide whether to use the Script component as a source, destination, or transformation. Depending on your decision, you need to implement different methods in the script so that you can consume or produce data, or both. In addition to these generic differences, there are also other differences related to the way in which the scripts can interact with the package. As you’ll see in more detail in the upcoming pages, there are differences in how to deal with package variables, logging, debugging, and so on. Because the Script component is more complex than the Script task, we’ll begin by taking a look at how the Script task works and what types of business problems you can use it to address, and then we’ll turn our attention to the Script component. Because some concepts are shared between the two objects but are simply implemented in a different way, drilling down into the Script component after taking a closer look at the Script task makes sense.
Visual Studio Tools for Applications When implementing SSIS scripting, you will use a new interface in SQL Server 2008. SQL Server 2008 includes Visual Studio Tools for Applications (VSTA) rather than Visual Studio for Applications (VSA), which was the default scripting environment used in SQL Server 2005. This is a big improvement because VSTA is a full-featured Visual Studio shell. So you finally have full access to all Visual Studio features; you’re no longer limited to referencing only a subset of assemblies—you can use and reference any .NET assembly you might need. You can also reference Web Services simply by using the standard Add Web Reference feature.
The Script Task The Script task allows you to add manually written business logic into the control flow. Let’s take a closer look now at exactly how this task works. To get started, create a new SSIS package in Business Intelligence Development Studio (BIDS). Drag the Script Task from the Toolbox onto the designer surface. After you have dropped the appropriate object on the control flow surface, double-click on the newly created Script Task box to open the Script Task Editor dialog box. Note that the default scripting language is C#. Alternatively, you can write your SSIS scripts in Visual Basic .NET. The language type is configured at the level of each Script task using the ScriptLanguage property. Once you have begun editing the script, you will not be able to change the language, so make sure you pick the appropriate one before clicking the Edit Script button. Note also that the default entry point (or method) is named Main. This is shown in Figure 19-1.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
569
FIgure 19-1 Script Task Editor
In addition to choosing the script language and the name of the entry method, you can also use this dialog box to define variables that your script will use. You can define the usage of variables associated with your Script Task as either ReadOnly or as ReadWrite. The SSIS runtime will try to scale as much as possible to make your package run faster and faster, and this means that anything that could be run in parallel is likely to be executed in parallel with other tasks. The SSIS runtime needs to know which tasks can run in parallel without interfering with each other. If two tasks both need to access variable values in read/write mode, it’s better not to have them work on the same variable in parallel; otherwise, you’ll surely have some unpredictable results, because there’s no guarantee of the order in which the tasks will access the variable. Note You should be aware that the SSIS runtime doesn’t prevent tasks from running in parallel, even if they both access the same variable. Also, the locking method functions in a way that you might not expect. For more information, see the following post: https://forums.microsoft.com/ forums/showpost.aspx?postid=3906549&siteid=1&sb=0&d=1&at=7&ft=11&tf=0&pageid=0). One example of this unexpected behavior is that locking a variable for read doesn’t stop you from writing to it.
570
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Note that here you don’t have to prefix the package variable name that you want to be able to access from your script’s code with the @ character, because you’re not specifying an expression, just package variable names. We remind you that SSIS variable names are always case sensitive. Specifically, this is important to remember if you intend to write your script in Visual Basic .NET. Figure 19-2 shows an example of some configured script properties. We’ve defined a couple of variables in the Script Task Editor dialog box. These are the ReadOnlyVariables properties named User::FileName and User::Path, and the ReadWriteVariables property named User::FileExists. We’ll refer to these variables in the script that we write to check for the existence of a file later in this section.
FIgure 19-2 Script variable properties
Note You can also lock variables directly within the script, using the Dts.VariableDispenser object. The advantage of using this over the ReadOnlyVariables or ReadWriteVariables properties is that the variables are locked for a shorter period, and you have explicit control over when the lock starts and stops. Here is an example of this code: Variables vars = null; Dts.VariableDispenser.LockOneForRead("FileLocked", ref vars); vars["FileLocked"].Value = true; vars.Unlock();
Having defined everything you need, you can now start to write the script. To do this, you need to use Visual Studio Tools for Applications, which will run after you click the Edit Script button in the Script Task Editor dialog box. After VSTA is loaded, you’ll see that some autogenerated code is already present. Take a look at the autogenerated Main method shell; it is here where you’ll write the majority of the code you want to be executed. As you author the script, remember that you can use any feature that your selected .NET language supports. Because you’re using a Visual Studio shell, you can reference other assemblies simply by using the standard Visual Studio item Add Reference. In the same way, you can also add references to external Web Services by using the Add Web Reference menu item. Note The terms assembly and namespace are frequently used in .NET terminology. Assemblies are basically DLL or EXE files that provide some functionality. Those functionalities are provided by classes that make them available through methods and properties. Classes are organized into namespaces that are contained in and across assemblies.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
571
The Dts Object To access any package variable, you have to use the Dts object that the SSIS engine exposes, and use its Variables collection property to access the particular variable object that you want to work with. After you have access to the package variable, you can use its Value property to get or set the variable value. Because this is a common source of bugs, we remind you once again that all variable names are case sensitive—regardless of whether you’re using C# or Visual Basic .NET as your scripting language. Note that we included a reference to the System.IO namespace in our sample so that we can use objects and methods that work with the file system in our script task. In the following example, we’ve used a simple assignment to set the value of the User::FileExists variable: public void Main() { string folderPath = (string)Dts.Variables["User::Path"].Value; string fileName = (string)Dts.Variables["User::FileName"].Value; string filePath = System.IO.Path.Combine(folderPath, fileName); Dts.Variables["User::FileExists"].Value = System.IO.File.Exists(filePath); Dts.TaskResult = (int)ScriptResults.Success; }
In addition to using the Variables collection, the Dts object exposes several interesting properties that allow you to programmatically interact with the package. One of the most commonly used properties is TaskResult. This property is needed to notify the SSIS runtime if the execution of your script has to be considered as a success or as a failure so that the workflow defined in the control flow can follow the related branch. The other properties exposed by the Dts object are the following: ■■
Log Allows you to write logging information that will be used by the SSIS logging infrastructure.
■■
Connections Gives you access to the connection managers present in the SSIS package, allowing you to connect to the related data source.
■■
Events
■■
Transaction Permits the script to join or manage a transaction by indicating the status of the transaction through the task execution result. For example, the success of the task can equate to the success of the entire transaction, or the success of the task can cause a series of tasks in the transaction to continue.
■■
ExecutionValue In addition to indicating a simple Success or Failure result, you might need to communicate to the control flow an additional value.
Allows you to fire events—for example, FireInformation and FireProgress.
For example, you might need to check whether a file exists in a specific directory. If it does exist, different branches must be followed in your control flow, depending on the
572
Part III
Microsoft SQL Server 2008 Integration Services for Developers
name of that file. You can assign the name of the file to the Dts object’s ExecutionValue property so that it will be available in the control flow. To access in the control flow, you must point the ExecValueVariable Script task property to an existing package variable. This variable will hold the value put into the ExecutionValue property from the script code. In that way, you can easily use that value in an expression or in another task for further processing. After you’ve finished modifying the script to suit your business requirements, you can simply close Visual Studio Tools for Applications, and the script will be automatically saved.
Debugging Script Tasks A Script task can be easily debugged. As we saw in Chapter 15, “Creating Microsoft SQL Server 2008 Integration Services Packages with Business Intelligence Development Studio,” and Chapter 16, “Advanced Features in Microsoft SQL Server 2008 Integration Services,” SSIS offers many options for debugging in the BIDS environment. You’ll be happy to know that these extend to the Script task as well. SSIS script debugging works as it does in any other standard .NET application. With VSTA open, you need to place the cursor in the line of code upon which you want to have the execution halted, open the shortcut menu with a right-click, and click Insert Breakpoint:
After you set the breakpoint, you can run the package in debug mode in BIDS. The Script task will have a red circle on it, indicating that the script has a breakpoint set. The execution will stop and enter into interactive debugging mode as soon as it reaches the breakpoint. All of the standard debugging windows—that is, locals, autos, and so on—are available for SSIS script debugging.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
573
The Script Component Now that we’ve taken a look at how you can use the Script task in the control flow, we’ll turn next to using the Script component in a data flow. Here you’ll see how you can leverage the power of scripting while processing data. To take a closer look at the Script component, you need to drag one Data Flow task onto your sample package’s designer surface from the Toolbox. Double-click the Data Flow task to open the data flow in design view. Next, select the Script Component from the Toolbox and drop it onto the data flow designer surface. As soon as you drop the object, the Select Script Component Type dialog box (shown in Figure 19-3) is displayed, which enables you to specify how you intend to use this Script component in the data flow. In the Select Script Component Type dialog box, select either Source, Destination, or Transformation (the default value). For this example, choose the Source setting.
FIgure 19-3 Script component type options
The Script component can behave in three different ways, depending on how you elect to use it. It can be used as a data source so that you can generate data that will be used in the data flow. This is useful, for example, when you need to read data from a custom-made source file that doesn’t conform to any standard format. With a source Script component, you can define the columns that your scripted source will provide, open the file, and read data, making it flow into the data flow. Of course, before you take the time and effort to manually write such a script, you should exhaust all possibilities for locating and downloading any additional data source components that Microsoft or the SSIS community might have already developed and made available. We recommend checking CodePlex (http://www.CodePlex.com) in particular for this type of component.
574
Part III
Microsoft SQL Server 2008 Integration Services for Developers
A destination Script component is similar to a source, but it works in the opposite way. It allows you to take the data that comes from the data flow and store it in a custom-made destination. Again, this is typically a destination file with custom formatting. The data flow transformation Script component is able to take input data from another data flow component, transform that data via custom logic, and provide the transformed data as an output for use by other data flow components. We most often use the Script component in SSIS package data flows to execute custom transformation logic, so we’ll focus the remainder of our discussion on this particular type of implementation. After selecting the Transformation option and clicking OK, you’ll see an object named Script Component on the data flow designer surface, and you’ll also see that the component is in an error state, as indicated by the red circle containing a white x on the Script component, shown in Figure 19-4.
FIgure 19-4 Script component in an error condition
The error state is correct because a transformation Script component needs to have one or more inputs and one or more outputs, and you haven’t added either of these yet in this example. So add at least one data flow source and connect it to the Script component. For this example, use the sample database (from CodePlex) AdventureWorks 2008. After you’ve done this, open the Script Transformation Editor dialog box by double-clicking on the Script Component surface, as shown in Figure 19-5. In the Custom Properties section, you can define the language you want to use for scripting and the variables you’ll use in the script. This is the same process you followed earlier for the Script task in the control flow. Before starting to edit the script’s code, you also need to define which of the columns that come from the source will be used by your script and in which mode, ReadWrite or ReadOnly. You do this by configuring the Input Columns page of this dialog box. We’ve selected a couple of columns to use for this example, as shown in Figure 19-6.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
575
FIgure 19-5 Script Transformation Editor dialog box
FIgure 19-6 Input Columns page
Because the transformation Script component can handle more than one input, you can configure the columns you need to use for each input flow using the Input Name combo box near the top of the window. You simply switch from one input to another. Only the columns
576
Part III
Microsoft SQL Server 2008 Integration Services for Developers
you select will be available to be processed in the script. Note also that you can configure the Usage Type for each selected column. You can choose either ReadOnly or ReadWrite. On the Inputs And Outputs page of this dialog box, you can view detailed information about the configured input and output columns. Here, you are not limited to working on existing columns; you can also add new columns to the output. New columns are often useful because when you do some processing in the script, you might need a new place to store the processed data; in these cases, a new output column works perfectly. Output columns have to be configured before you start to write a script that references them. To switch to the Inputs And Outputs page, click Inputs And Outputs in the left column, as shown in Figure 19-7.
FIgure 19-7 Inputs And Outputs page
Selecting an output and clicking Add Column creates a new output column. You can also have multiple outputs, but for this example, you just need one output. Each new output has a default name of Output , where n is a number that will start from zero and be incremented for each new output. A best practice is to define a customized meaningful name for newly added output columns; you’ll do that in this example. To add a new output, click the Add Output button in the Script Transformation Editor dialog box. The new output, named FullName Output in this example, implicitly includes all the columns from the input. If you need to add more output columns, click the Add Column button and then configure the name and data type of the newly created output column. This is shown in Figure 19-8.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
577
FIgure 19-8 Adding a new column to the output
Now that everything has been configured, click the Edit Script button on the Script page of the Script Transformation Editor dialog box so that you can begin to write the script logic you need. After you do this, VSTA loads. Now take a closer look at the autogenerated code. This code includes three methods: ■■
PreExecute
■■
PostExecute
■■
Input0_ProcessInputRow
If you’ve changed the name of the input flow from the default value of Input 0, the name of the latter method will be different and will reflect, for the part before the underscore character, the revised input flow name. As we continue our discussion, we’ll examine in greater detail how to use each of these methods (which also depends on the usage of the Script component type that you’ve selected— that is, Source, Destination, or Transformation). However, we’ll leave the discussion for now, because before starting to write the script you need to understand how it’s possible to interact with the package and the data flow metadata. In the Script component, the Dts object (that we used previously in the Script task) is not available. To access the package variables configured to be used in the script here, you use the Variables object. For this example, if you add two package variables, named RowCount and BaseValue, to the ReadOnlyVariables property, you can use them through the Variables object.
In the example, as the name suggests, RowCount contains the number of processed rows and the BaseValue variable allows your script to start to count that number not only from one but from any arbitrary number you want to use. This is particularly helpful when you have a data flow inside a Loop container and you have to count the total number of processed rows, by all data flow executions.
578
Part III
Microsoft SQL Server 2008 Integration Services for Developers
It is important to consider that variables added to the ReadWriteVariables property can be used only in the PostExecute method. This restriction is enforced to avoid the performance impact of the locking mechanism that SSIS needs to use to prevent possible conflicts if two different data flows were to attempt to modify the same package variable value. Of course, local variables (declared inside the script) can be used in any section of the script, and they’re subject to the normal .NET variable scoping rules. So the trick here is to create a local variable named _rowCount and use it to keep track of how many rows have been processed by our script. You do this by assigning this local value to the package variable just before the end of the Data Flow task, in the PostExecute method. Inside the Script component’s code, you’ll sometimes also need to access data flow metadata—in particular, columns related to the flows that come into the component. For this situation, the Input0_ProcessInputRow method provides a parameter named Row. This object exposes all selected columns as properties, and all you have to do is just use them according to the usage type you’ve defined for the various columns (ReadOnly or ReadWrite). In the following code example, the Row object can be used in the Input0_ProcessInputRow method’s body: public override void Input0_ProcessInputRow(Input0Buffer Row) { _rowCount += 1; Row.FullName = Row.FirstName + " " + Row.MiddleName + ". " + Row.LastName; }
The preceding example is obviously oversimplified. You should exhaust all built-in components before you write manual scripts. This is because of the time required to write and debug, in addition to the overhead introduced at package execution time. If you just need a new column built from simple input concatenation, you should use a built-in transformation (such as the Derived Column transformation) rather than writing a script. We emphasize this point because, particularly for users with .NET development backgrounds, the tendency will be to favor writing scripts rather than using the built-in components. We challenge you to rethink this approach—we aim to help you be more productive by using SSIS as it is designed to be used. We have observed overscripting in many production environments and hope to help you learn the appropriate use of the script tools in SSIS in this chapter. To that end, let’s drill a bit deeper into working with scripts. An example of more complex logic is the requirement to have more than one output. We’ll show the script to separate input data into two or more output paths. As we mentioned, adding new outputs is just a matter of some clicks—a more important question is how to manage to put a particular row into a particular output. The first thing to do after adding an output is to configure it so that it is bound to the input data. This operation is done by selecting the new output on the Inputs And Outputs page of the Script Transformation Editor, and then selecting the appropriate input in the
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
579
SynchronousInputID property. This is shown in Figure 19-9. For the example, an output named UserHashCode Output should be created, and bound to Input 0. One new column named HashCode should be added to the new output.
FIgure 19-9 Associating new output to existing input
In this way, all output contains all the input columns plus any columns added to the specific output. Thus, all the input data will flow through all the outputs. However, the data for the column created by the script will be available only for the output flow that contains that column. You should also note that there’s a difference between a synchronous output and an asynchronous one—setting the SynchronousInputID property makes an output synchronous; choosing None makes it asynchronous. So, basically, if a thousand rows pass across the data flow via the script’s execution, all output will have a thousand rows, effectively duplicating all the input values. You can change this default behavior by correctly defining the ExclusionGroup property. All synchronous outputs need the same ExclusionGroup setting if they are in the same group. For the example, both outputs need the same ExclusionGroup value set, not just the new one. This property is found on the same Inputs And Outputs page of the Script Transformation Editor dialog box, and it has a default value of 0 (zero). After you change the property value to any non-zero number, only the rows that you explicitly direct to that output will flow there. Figure 19-10 shows this property.
FIgure 19-10 Setting the ExclusionGroup property to a non-zero value redirects output.
The logic that decides where to direct the incoming row’s data lies in the script code. Here, the Row object exposes some new methods that have their names prefixed with a DirectRowTo string. You compose the method’s full name by using this prefix and then adding the name of the output flow that will be managed by that method. These methods allow you to decide in which of the available outputs the input row should go. You’re not limited to sending a row to only one output; you can decide to duplicate the data again by sending the row to multiple output flows.
580
Part III
Microsoft SQL Server 2008 Integration Services for Developers
public override void Input0_ProcessInputRow(Input0Buffer Row) { _rowCount += Variables.BaseValue; string fullName = Row.FirstName + " " + Row.MiddleName + ". " + Row.LastName; Row.DirectRowToFullNameOutput(); Row.FullName = fullName; if (Row.EmailPromotion != 0) { Row.DirectRowToUserHashCodeOutput(); Row.HashCode = (fullName + Row.EmailAddress).GetHashCode(); } }
In the example shown, all rows will be directed to the FullNameOutput output, and only rows whose EmailPromotion column value is not zero will be directed to the HashCodeOutput output.
The ComponentMetaData Property Within the script code, you can choose to write some log information or fire an event, as you did in the Script task. Keep in mind that in the Script component there is no Dts object available. Rather, for these purposes, you use the ComponentMetaData property. This property allows you to access log and event methods so that you can generate logging data (using the PostLogMessage method) and fire events (using the FireInformation, FireError, or other Fire . . . methods). The Connections property allows you access to the connection managers defined in the package. To use a connection manager in the Script component, you first have to declare its usage in the Script Transformation Editor dialog box (found on the Connection Managers page). If you need to access connection manager information by using the Connections property in your script, you must first define the connection managers on this page, as shown in Figure 19-11.
FIgure 19-11 Connection Managers page of the Script Transformation Editor
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
581
Using a connection manager is a convenient way to make your script easily configurable with custom connection information. As we mentioned, this type of dynamic connection configuration information is a preferred package design method. In this example, you have a script that needs to read from or write to a file. You’ll define the file name in a package variable and then associate that variable with a configuration file using the configuration feature of SSIS. Follow best practice by creating a connection manager that uses an expression to configure its ConnectionString property to use the file name stored in the variable. In this way, you’ll have a solution that is understood by the SSIS architecture, is reusable by other tasks or components, and is easily configurable. We discussed the procedure for configuring connection managers in Chapter 18, “Deploying and Managing Solutions in Microsoft SQL Server 2008 Integration Services.” So, if in your script you need to access any resources external to the package (such as a text file), you should do that by using a connection manager, thereby avoiding accessing the resource directly. After you configure the connection manager, you use it to get the information on how you can connect to the external resource. For a text file, for example, this information will simply be the file path. To get that information, you override the methods AcquireConnections and ReleaseConnections to manage the connections to external resources. In the aforementioned example, access to a file name and path through a defined file connection manager can be set up as shown in the following partial code snippet from the AquireConnections method: public override void AcquireConnections(object Transaction) { _filePath = (string)this.Connections.MailingAddresses.AcquireConnection(Transaction);
The AcquireConnections method returns the correct object you need to use to connect with the external resource. For a file, it is a string containing the path to the file; for a SQL Server connection (using a .NET provider), it is an SqlConnection object; and so on. Be aware that AcquireConnections returns a generic Object reference, so it’s your responsibility to cast that generic Object to the specific object you need to use to establish the connection with the external resource. The Transaction object is a reference to the transaction in which the component is running. Just like all the other methods that you override, the AcquireConnections method is called automatically by the SSIS engine at runtime and during the package validation. When the engine calls that method, the Transaction parameter allows you to know whether the component is working within a transaction or not. If the data flow that contains that component is taking part in a transaction (for example, it has its Transaction property set to Required, or any of the containers that contain the data flow has that property set to Required), the Transaction object passed to the AcquireConnections method will not be null. The Transaction object can then be passed to the connection manager you’re using so that the connection manager has the information it needs to take part in that transaction, too. To do that, pass the Transaction object to the AcquireConnections method of the connection manager you’re going to use in your component to allow that component to be transactional.
582
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Source, Transformation, and Destination Now that we’ve covered the mechanics of the Script component, it’s time to dive a little deeper into the differences between the three possible behaviors you can associate with this component. We’ll start first by taking a closer look at using it as a (data) source.
Source A source Script component can have one or many output rows but no input rows. As you start to edit the code, you’ll see that the method ProcessInputRow is not present. Rather, another method exists, named CreateNewOutputRows. Here is where you can write the code that generates output rows. Inside that method, if you need to generate new output rows by using a script, you can do so by invoking the method AddRow on the output buffer object. An output buffer object will be created for any output flow defined, and its name will be something like Buffer. The following code fragment shows how it’s possible to add rows to an output flow named Authors: else if (line.ToLower().StartsWith("by")) { // Authors authorLine = line.Substring(2).Trim(); string[] authors = authorLine.Split(','); foreach (string a in authors) { AuthorsBuffer.AddRow(); AuthorsBuffer.Name = a.Trim(); } }
Synchronous and Asynchronous Transformation Although you’ve been introduced to the functionality of the Script component, we need to show you another important aspect of its functionality. In all the examples you’ve seen up until now, all the rows that flowed into this component were processed as soon as they came up and were then immediately sent to one of the available output flows. This kind of behavior makes this component a synchronous transformation because, as the name suggests, the output is synched with the input—that is, it is processed one row at time. To be more precise, the data doesn’t come into the component one row at time; rather, it’s grouped into buffers. Anyway, the buffers—at this level—are transparent to the user, so we’ll simplify things by saying that you deal with data one row a time.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
583
Although synchronous processing is adequate for some business situations, in others you might want to use a more flexible and scalable method. For example, if you want to do some aggregation processing on a large incoming dataset, you might not want to wait to receive all the incoming rows before starting to produce any output. In this case, you might prefer to configure your component to use asynchronous processing, otherwise known as asynchro nous transformation. Of course, aggregation is not the only case in which you might choose to use an asynchronous transformation. Some other example of transformation types that use asynchronous processing are the native Merge and Sort transformations. To define whether a component will be synchronous or not, you use the same Inputs And Outputs page of the Script Transformation Editor dialog box, shown in Figure 19-12. Here, for each defined output, we can find the property SynchronousInputID.
FIgure 19-12 The SynchronousInputID property defines asynchronous behavior.
To define an asynchronous output, all you need to do is set the value for SynchronousInputID to None. This also indicates that, for the specified output, the only available output columns will be the ones that you add. The input columns won’t be automatically available like they are for synchronous transformations. Tip Because the SynchronousInputID property is configured on an output basis, a component with multiple outputs can have both behaviors.
Keep in mind that when you add a transformation script component to your package it automatically includes a configured output, which will be configured to be synchronous with the input by default. If you need to add more outputs, all of the manually added outputs will be automatically set to be asynchronous. Apart from these differences, the methods you use in your script for synchronous and asynchronous transformation are the same, with some additional methods typically being used for the last one (for example, ProcessInput). To create a synchronous script, you use—as we’ve said before—the following methods, which will be made available automatically by autogenerated code: ■■
PreExecute
■■
PostExecute
■■
Input0_ProcessInputRow
584
Part III
Microsoft SQL Server 2008 Integration Services for Developers
For asynchronous output, you’ll also find that the autogenerated code contains the method CreateNewOutputRows. The Input0_ProcessInputRow method allows you to create output rows, and it works in the same as it works for a transformation Script component. However, in this case, this method is not as useful because you probably need to manipulate data that is coming from an input flow and generate output data, and the CreateNewOutputRows method gets called by the SSIS engine only one time and before the ProcessInputRow data, not after. This is a typical scenario when you’re creating a script transformation that aggregates data— and thus for a certain number of input rows you have to produce a certain number of output rows. However, the number of output rows is completely different and independent from the input, and also the output rows have a completely different structure than the input rows. As an example, consider this problem: as input rows, you have some string values; as output, you have to generate a row for any used alphabetical letter, and for each of them you have to tell how many times it has been used—not only in one single row but in the whole data flow. To illustrate, suppose you have three input rows, each with one column and containing the values shown in Table 19-1. TAble 19-1
Sample Scenario for Aggregating Data
Row Number
Value
1
abc
2
ab
3
a
The resulting flow of aggregated values is shown in Table 19-2. TAble 19-2
results of Sample Scenario
Letter
UsageCount
A
3
B
2
C
1
Fortunately, the methods available to you from autogenerated code are only a subset of the available methods, and a method that allows you to manipulate code such that input and output data can manipulated in an arbitrary manner exists—it is the ProcessInput method. Normally, it is invoked by the SSIS engine each time a new data buffer from the input flows needs to be processed. Internally, it calls the ProcessInputRow method for any row present in the buffer. So all you need to do is override that method, and write your own implementation, which gives you the possibility to decide what to do with the data available in the buffer.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
585
By overriding the ProcessInput method, you’re handling data as it comes into a buffer. You might also want to define what should happen after the data is flushed out of each buffer or after the entire flow has been processed. You do this by accessing the Buffer object’s EndOfRowset method, which can tell you if there are still rows to be processed in the incoming data. Overriding ProcessInput allows you to create a very specific transformation, without any limitations in terms of functionality that you might want to implement. All this power, as usual, comes at the price of complexity, so it’s important to have a clear understanding of the difference between ProcessInput and ProcessInputRow. The first method allows you to deal with entire buffers of data that will be processed, while the latter method gives you the simplicity of working row by row when writing the code that will transform the data. As you can see, the data flow process can be complex. At this point, we’ll take a look at all of the methods you can use when developing a data flow script, as well as show when the SSIS engine will be invoked during data flow processing. The AcquireConnections method is called as you begin to execute your package. As soon as the data flow starts to process your script, the first method associated with script processing that gets executed is the PreExecute method. This method is executed only one time per package execution. Then the CreateNewOutputRows method is run. This method is also executed only one time per package execution. After that, the _ProcessInput method is executed, one time for each available buffer. This method internally calls the ProcessInputRow method that is executed for each row available. Finally, the PostExecute method gets called (one time for each execution). After the script process completes, the ReleaseConnections method is called. So if you have a data flow that has split the incoming data into two buffers that hold three rows each, here is the sequence in which the previously described methods will be called: ■■
PreExecute
■■
CreateNewOutputRows
■■
Input0_ProcessInput
■■
ProcessInputRow
■■
ProcessInputRow
■■
ProcessInputRow
■■
Input0_ProcessInput
■■
ProcessInputRow
■■
ProcessInputRow
■■
ProcessInputRow
■■
PostExecute
586
Part III
Microsoft SQL Server 2008 Integration Services for Developers
This is only an example. Remember that you can only choose how much memory and a maximum number of rows per buffer will be used. So the number of buffers will be automatically calculated by the system. Now let’s suppose that you want to count how many times alphabet letters are used in some text that comes in as input. In the ProcessInputRow method, you create the logic to extract and count letter usage. In the ProcessInput method, you make sure that all rows will be processed and that at the end, and only at the end, you produce the output with all gathered data. public override void Input0_ProcessInput(Input0Buffer Buffer) { // Process all the rows available in the buffer while (Buffer.NextRow()) { Input0_ProcessInputRow(Buffer); } // If no more buffers are available we processed all the data // and we can set up the output if (Buffer.EndOfRowset()) { foreach (KeyValuePair kv in _letterUsage) { LettersUsageBuffer.AddRow(); LettersUsageBuffer.Letter = kv.Key.ToString(); LettersUsageBuffer.Usage = kv.Value; } } }
Asynchronous outputs give you a lot of flexibility, but this comes with a price. Depending on the business requirements that you have and on the code you write, the asynchronous output can have to process a large amount of input data before it can output even a single row. And during all that time, all the components that are connected to that output flow won’t be able to receive any data. This is known as blocking, and it can create significant performance degradation of your package. So you need to base such a design on business requirements and test it using production levels of data during the development phase of your project.
Destination When the Script component is configured to work as a destination, it will have only one input and no output unless you configure an error output. Basically, the way in which you use a destination Script component is similar to that of a transformation, except that you aren’t required to produce any output rows. You put all the code you need to write in the ProcessInputRow method by using the now familiar Row object to access the data from the rows.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
587
You can also choose to use the PreExecute and PostExecute methods because through them you can, for example, open the physical location at which you would like to save your data so that it can be configured correctly. You can also close it down nicely after you have finished your processing. The scenario we just described is quite common because typically a destination Script component will be used to save data in one or more custom file formats. You can open a stream in the PreExecute method and write the file header if you need it. Then, in the ProcessInputRow method, you can output data to the opened stream, and finally, you can close and dispose of the used stream in the PostExecute call.
Debugging Script Components Unfortunately, debugging of Script components through the usual usage of breakpoints is not possible because this technique is not supported in SSIS. This also means that stepping into the Script component while it’s running and checking runtime values through watchers is not possible. Tip If you want to debug, you have to use old-fashioned debugging tricks, such as printing messages to the screen. At least you can use the logging and event-firing capabilities of SSIS. If you really need it, you can also use the MessageBox.Show method in the System.Windows.Forms namespace to show a pop-up message that will also block the script execution until you click the OK button. However, this should be used only in development and removed before the package is put into production because it can cause issues when automating package execution. In general, using the ComponentMetaData.FireInformation method is the best approach to adding debugging information to your Script component.
Overview of Custom SSIS Task and Component Development Although SSIS scripting is quick and easy, it’s not the best solution for implementing business logic for all BI solutions. As with any type of scripting, SSIS package-level scripting is isolated to the particular package where the script was developed and can be reused in another package only by cutting and pasting the original script. If your business requirements include complex or highly reusable custom logic that will be reused frequently across multiple SSIS packages, you’ll want to create your own SSIS object rather than simply reuse the SSIS Script tasks over and over. A sample of such a custom object is a task that compresses one or more files (typically, logs) and then moves it into an archive folder. This is a common operation that can be encapsulated in a custom task and used in any package your company develops in which such a
588
Part III
Microsoft SQL Server 2008 Integration Services for Developers
feature is needed. Authoring an SSIS component requires that you be proficient using a .NET language. You can, of course, use any .NET language—that is, C# or Visual Basic .NET. You also need to have a good knowledge of SSIS’s object model. Note As you can understand, creating a custom object is not a trivial job and it could take an entire book by itself. For this reason, we’ll describe only the main topics here and the ones that you need to pay the most attention to. We’ll also show code fragments that give you an idea of what should be in a custom object. The best way to read this section of the chapter is with the related example at hand, where you can find the entire code, run and test it, and correlate it with the problems being discussed.
The great power of a custom SSIS object is that once you have finished its development, you can install it in the SSRS Toolbox in BIDS and reuse it across as many SSIS packages as need be. Also, you can share this object with other developers working on your project. We believe there is also potential for reusable objects to be sold commercially, even though we haven’t seen much activity in this development space yet. Before you create a custom object you might want to do a quick search to see if your particular business problem has been solved via the creation of a commercial custom object by another vendor. We favor buying over building because of the complexity of appropriately coding custom SSIS objects. Thanks to the pervasive use of .NET, SSIS is such an extensible product that you might think of it as a framework or development platform for creating ETL solutions. Another way to understand this is to keep in mind that you can extend nearly every object in SSIS, which provides you with many opportunities for customizing all aspects of SSIS. These customizations can include the following: ■■
Custom tasks Create custom tasks to implement some specific business logic.
■■
Custom connection managers Connect to natively unsupported external data sources.
■■
Custom log providers Log package events defining custom formats.
■■
Custom enumerators Support iteration over a custom set of object or value formats.
■■
Custom data flow components Create custom data flow components that can be configured as sources, transformations, or destinations.
To develop custom objects, you can use anything that allows you to create .NET (that is, C# or Visual Basic .NET) class libraries. Normal practice is to use a full version of Visual Studio 2008 to create .NET class libraries. Each SSIS custom object needs to provide the logic that will be available at runtime—when the package is running—and at design time—when the package that uses your component is being developed.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
589
For this latter purpose, all objects should also provide a user interface to interact with, where the package developer can configure the object’s properties. Providing a user interface is not mandatory, but it is strongly advised because an easily configurable object helps you to avoid bugs and lower package development costs. Before diving into the specific code for a custom object, you need to know a few things about deploying it. After you’ve written all of the code for your SSIS object and are ready to compile it into an assembly, you need to follow a couple of steps to be able to integrate the compiled assembly with the SSIS engine and BIDS. These steps are generic steps that you need to follow for any custom component you develop, so keep them in mind. They are basic knowledge that you need to have to be able to deploy and distribute your work. The first step you need to complete is to sign the assembly. This means that you have to create a public/private key pair. In Visual Studio 2008, go to the project’s properties window and click Signing in the left pane to access the Signing page. This is shown in Figure 19-13.
FIgure 19-13 The Signing page of the SSIS project assembly
You can then choose to use an existing key file or create a new one. After you’ve completed this step, you can proceed with building the solution as you would with any other .NET assembly. Click Build on the shortcut menu in Solution Explorer for your SSIS component project. This is shown in Figure 19-14.
590
Part III
Microsoft SQL Server 2008 Integration Services for Developers
FIgure 19-14 The Build option compiles your project into an assembly.
After your assembly has been successfully built—that is, it has no design-time errors—you can deploy the assembly so that it can be used by BIDS. The assembly file has a .dll extension; you need to register it in the global assembly cache (GAC). There are several methods for performing a registration. For this example, you’ll use the gacutil.exe tool, which can be found in the directory where the .NET Framework Software Development Kit (SDK) or Microsoft Windows SDK has been installed. On our sample server, which has SQL Server 2008 and Visual Studio 2008 installed, we find gacutil.exe in the C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\ folder. To register an assembly in the GAC, you just have to execute the following line from the Visual Studio command prompt (which you can find under Visual Studio Tools on the Windows Start menu): gacutil.exe –iF After you register the assembly in the GAC, you have to deploy it to a specific folder inside the SQL Server 2008 installation directory. Here, in the DTS folder (by default, located at C:\Program Files\Microsoft SQL Server\100\DTS), there are specific directories where— depending on the type of custom object you’re deploying—you have to copy your assembly. Table 19-3 shows a list of object types and the corresponding directories to which the assemblies should be copied. TAble 19-3
SSIS Custom Objects and Target Directories
Custom Object Type
Target Directory
Task
Tasks
Connection manager
Connections
Log provider
LogProviders
Data flow component
PipelineComponents
Foreach enumerator
ForEachEnumerators
Using Visual Studio 2008, you can make this process automatic by using the post-build events. All you have to do is to add the following code:
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
591
C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin\gacutil.exe -iF $(TargetPath) copy $(TargetPath) C:\Program Files\Microsoft SQL Server\100\DTS\\$(TargetFileName)
In place of , specify the appropriate directory as indicated in Table 19-3. After you successfully complete all of the steps just described, your component will be nearly ready to be used from within the BIDS Toolbox. However, for custom task objects or data flow components, you have to perform an additional step to be able to use either type. You will want to make your object available in the appropriate Toolbox window. To add an object to the Toolbox, right-click on the Toolbox and then click Choose Item. The Choose Toolbox Items dialog box opens, where you can choose an object to add. Note that in our example, shown in Figure 19-15, SSIS data flow items and SSIS control flow items are shown on two separate tabs.
FIgure 19-15 The Choose Toolbox Items dialog box includes tabs for data flow components and control flow
components.
After selecting your component, you can finally start to use it in your SSIS packages. We’ll look next at an example of a business problem that relates to BI and how you might solve it by writing a custom-coded task. The business problem is the need to compress data. Because of the massive volumes of data that clients often need to work with during their BI projects, we’ve found data compression via SSIS customization to be quite useful.
Control Flow Tasks To build a custom task that can be used in the control flow, you use the base class Task. This class can be found in the namespace Microsoft.SqlServer.Dts.Runtime in the assembly
592
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Microsoft.SqlServer.ManagedDTS. This is shown in the Add Reference dialog box in Figure 19-16.
FIgure 19-16 Add Reference dialog box
If you want to supply a user interface for your task (and you normally will), you also have to add a reference to the assembly Microsoft.SqlServer.Dts.Design. Before starting to write the code of your task, you have to be sure that it can be correctly integrated with the SSIS environment. For that reason, you have to decorate the class that derives from Task with DtsTaskAttribute, which specifies design-time information such as task name, task user interface, and so on: {DtsTask( DisplayName="CompressFile", IconResource="DM.SSIS.ControlFlow.Tasks.Compress.ico", UITypeName="DM.SSIS.ControlFlow.Tasks.CompressFileUI," + "DM.SSIS.ControlFlow.Tasks.CompressFile," + "Version=1.0.0.0," + "Culture=Neutral," + "PublicKeyToken=c0d3c622a17dee92" )] public class CompressFileTask : Task
Next you implement the methods Validate and Execute. These two methods will be called by the SSIS engine. The Validate method is called when the engine starts the validation phase. Here, you can check whether the objects on which your task will work are ready to be used. The Execute method is called in the execution phase, as the name suggests. In this phase, you should write the code that performs the actions you desire. For this example, because you’re developing a task that compresses files, the Validate method will check that a target file has been specified while the Execute method will do the compression.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
593
In our example, we’ve chosen also to implement a user interface to make the task more user friendly. For a custom object, you must implement the user interface through the use of a standard Windows Form. This user interface will be shown to whoever uses the custom task in their own packages. During package development, the interface needs to be created in such a way that BIDS can open it internally and make it communicate with the SSIS development infrastructure. To do this, create a separate class that implements the interface IDtsTaskUI. The interface has four methods, and the one you use to create the user interface is called GetView. Create the Windows Form that you defined as your user interface, and return it to the caller (which is BIDS): public System.Windows.Forms.ContainerControl GetView() { return new CompressFileForm(_taskHost); }
The _taskHost variable holds a reference to the task control that will use this GUI—in this case, the CompressFileTask created earlier. It allows the GUI to access and manipulate the properties made available by the task. You might be wondering at this point how you bind the interface you’ve developed so far to the class that has been derived from Task and contains the run-time core of your custom component. This is actually not difficult; just specify the fully qualified name of the userinterface class. This name is a comma-separated string of the following values: type name, assembly name, file version, culture, and public key token. Then place this information into the DTSTask attribute’s UITypeName property: UITypeName = "DM.SSIS.ControlFlow.Tasks.CompressFileUI," + "DM.SSIS.ControlFlow.Tasks.CompressFile," + "Version=1.0.0.0," + "Culture=Neutral," + "PublicKeyToken=c0d3c622a17dee92"
You can obtain the PublicKeyToken through the GAC or by using the sn.exe tool distributed with the .NET SDK. Using the –T parameter (which displays the public key token for an assembly) and specifying the signed assembly from which you want to read the public key token accomplishes this task: sn.exe –T
Data Flow Components Developing custom data flow components is by far the most complex thing you can do while developing any type of SSIS custom objects. We prefer to buy rather than build, particularly in this area, because of the complexity of custom development. Other than for commercial
594
Part III
Microsoft SQL Server 2008 Integration Services for Developers
use, we haven’t deployed any custom data flow components into any BI solution that we’ve implemented to date. The complexity of development occurs in part because SSIS uses a large amount of metadata information to design and run data flows. Thus, your component also needs to deal with all that metadata, providing and consuming it. For this reason, development of a custom data flow component is not described here. The topic would require a chapter of its own and is beyond the scope of this book. SQL Server Books Online includes a tutorial on this topic and a sample. So, if custom development of data flow components is part of your project, we recommend that you start with SQL Server Books Online.
Other Components In addition to creating custom control flow tasks and data flow components, you can also create custom connection managers, custom log providers, and custom foreach enumerators. This last option is one of the most interesting because it allows you to extend the supported sets over which the Foreach Loop container can iterate. To create a custom enumerator, you just create a class that derives from the ForEachEnum erator base class. As with other types of custom object development, if you want to supply a user interface you have to decorate that class with the DtsForEachEnumerator attribute, specifying at least the DisplayName and UITypeName properties. When creating a foreach enumerator, the most important methods to implement are Validate and GetEnumerator. Validate is the usual method for checking that all the objects on which your custom object will work are defined correctly and ready to be used. GetEnumerator is the method that provides the set of values over which the Foreach Loop will iterate. The output of that method must be an object that supports enumeration, such as List or ArrayList. The user interface in this case is not created using a Windows Form. You have to inherit from ForEachEnumeratorUI, which in turn derives from System.Windows.Forms.UserControl. This is needed because the control that you create is displayed in the Enumerator configuration area of the Collection page of the Foreach Loop Editor dialog box. For example, suppose that you’re developing a foreach enumerator that allows package developers to iterate over two dates. In Visual Studio, your user interface will look like the one shown in Figure 19-17.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
595
FIgure 19-17 Sample custom enumerator interface
To use your new object, you first have to deploy the project in the usual way, with the only difference being that now you don’t have to add anything to the Toolbox. Rather, you just drop a ForEach Loop on your package and edit its properties. You’ll find your newly created extension in the Foreach Loop Editor dialog box, as shown in Figure 19-18.
FIgure 19-18 Custom enumerator in BIDS
You can now start to use it in the same way as you use other built-in enumerators. Creating custom SSIS objects is not an extremely complex task, yet it is not trivial either. If you want to learn more about custom component development, a good place to start is SQL Server Books Online. Under the Integration Services Development topic, there are a lot of samples, a tutorial, and of course, a good explanation of everything you need to begin your development process.
596
Part III
Microsoft SQL Server 2008 Integration Services for Developers
Overview of SSIS Integration in Custom Applications So far, you’ve learned how you can extend SSIS through custom object development. But sometimes you might want to extend your custom application by integrating SSIS functionality directly into it, rather than extending SSIS itself. In this section, we’ll answer the following questions: ■■
How can you integrate SSIS package functionality into custom applications so that you can execute packages from inside of those applications?
■■
How do you enable your application to consume the data that SSIS produces as a result of a data flow task?
Before we explain how you can do these powerful things, we should emphasize two vital concepts: ■■
A package runs on the same computer as the program that launches it. Even when a program loads a package that is stored remotely on another server, the package runs on he local computer.
■■
You can run an SSIS package only on a computer that has SSIS installed on it. Also, be aware that SQL Server Integration Services is now a server component and is not redistributable to client computers in the same manner that the components required for SQL Server 2000 Data Transformation Services (DTS) were redistributable. So, even if you want to install only Integration Services on a client computer, you need a full server license!
To start to load and execute a package in your application, add a reference to the Microsoft.SqlServer.ManagedDTS assembly. In this assembly, you’ll find the namespace Microsoft.SqlServer.Dts.Runtime, which contains all the classes you need. The first class you have to use is the Application class. All you have to do is load a package with the appropriate Load method, depending on where the package is stored, and then you can run it by using the Execute method: Microsoft.SqlServer.Dts.Runtime.Application app = new Microsoft.SqlServer.Dts.Runtime.Application(); Package pkg = app.LoadPackage(@"Sample Package 10 - Custom Enum.dtsx", null); pkg.Execute();
If you also need to intercept and consume the events fired from packages, you have to create a custom class that handles the events that derive from the base class DefaultEvents. Then you can override the event handler that you’re interested in. If you need to intercept Information and Error events, your code will look like the following sample.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
597
public class CustomEvents : DefaultEvents { public override bool OnError(DtsObject source, int errorCode, string subComponent, string description, string helpFile, int helpContext, string idofInterfaceWithError ) { Console.WriteLine(description); return false; } public override void OnInformation(DtsObject source, int informationCode, string subComponent, string description, string helpFile, int helpContext, string idofInterfaceWithError, ref bool fireAgain) { Console.WriteLine(description); } }
Next, the Execute method should be changed to pass in a reference to the custom event listener so that it can route fired events to your handlers: CustomEvents ce = new CustomEvents(); Package pkg = app.LoadPackage(@"Sample Package 10 - Custom Enum.dtsx", ce); pkg.Execute(null, null, ce, null, null);
It’s really quite simple to run a package from a custom application. We expect that you’ll find creative ways to use this powerful and flexible capability of SSIS. Next we’ll look at how to execute a package that is considered to be a data source so that you can consume its results with a DataSet or DataReader class. This gives you the ability to create applications that can display the result of a data flow elaboration on screen or to visualize on a grid all the rows processed with errors so that end users can immediately correct them. To do either of these things, you have to reference the assembly Microsoft.SqlServer.Dts. DtsClient.dll. It can be found in %ProgramFiles%\Microsoft SQL Server\100\DTS\Binn. This assembly contains specific implementations of the IDbConnection, IDbCommand, and IDbDataParameter interfaces for SSIS, which allows you to interact with SSIS as a standard ADO.NET data source. You can use the standard .NET classes and methods to access data from the package. Thanks to these classes, you can invoke a package execution using the DtsConnection class. This class can be used to execute a DtsCommand that provides an ExecuteReader method so that you can use it to have a .NET DataReader that can be populated to a grid, for example. In the next example, you start by creating and initializing a DtsConnection class. Just like any other connection classes, your custom connection needs a connection string. In this case, the connection string is the same string you’ll use as an argument for the dtexec.exe tool to run the desired package.
598
Part III
Microsoft SQL Server 2008 Integration Services for Developers
DtsConnection conn = new DtsConnection(); conn.ConnectionString = @"-f C:\Work\SQL2008BI\Chapter15\Packages\SQL2008BI\Sample Package 11—DataReaderDest.dtsx"; conn.Open();
Next, create and initialize the command you want to execute to get the data: DtsCommand cmd = new DtsCommand(conn); cmd.CommandText = "DataReaderDest";
Here the word command is used somewhat inappropriately because there are no commands to run. The command really points to a special destination that a package executed in that way needs to use. This special destination is the DataReader destination. The CommandText property needs to point to the name of the DataReader destination in the package that you are running. In this example, the data flow in the package looks like the sample shown in Figure 19-19.
FIgure 19-19 Data flow for the current example
Because DtsCommand implements IDbCommand, you can use the usual ExecuteReader to get a DataReader object that can be used to populate a DataGrid. The result is a simple application that shows on a grid the rows processed by the package that has been sent to the DataReaderDest destination. An example using a Windows Forms interface is shown in Figure 19-20.
Chapter 19 Extending and Integrating SQL Server 2008 Integration Services
599
We are excited by possibilities of integration into many different types of UI environments, and we see this as a big growth area for SSIS developers. We are not done with this story quite yet, however.
FIgure 19-20 Data flow output application in a Windows Forms interface
By using the same Microsoft.SqlServer.ManagedDTS assembly and the Microsoft.SqlSer ver.Dts. Runtime namespace, you can manage SSIS packages in a fashion that is similar to the functionality available in SQL Server Management Studio (SSMS) or by using the dtutil.exe tool. For example, you can enumerate all the available packages in a specific storage location (SQL Server or SSIS Package Store), or you can manage storage locations by creating, removing, or renaming folders inside them. Importing and exporting packages into a storage location is also possible. Any of these actions can be accomplished by using the Application class that exposes explicit methods for them: ■■
LoadFromSqlServer
■■
RemoveFromSqlServer
■■
CreateFolderOnSqlServer
■■
RemoveFolderOnSqlServer
Of course, the list is much longer. This is just a sample to show you the methods you can use to load, create, or remove packages and folders in a SQL Server storage location. Because of the completely exposed object model, you can create any kind of application to manage and run SSIS packages on your own, which enables you to meet particular business requirements when creating a custom administrative environment. Two good samples of that
600
Part III
Microsoft SQL Server 2008 Integration Services for Developers
kind of application are available on CodePlex, where they are freely downloadable along with source code: ■■
DTLoggedExec (http://www.codeplex.com/DTLoggedExec) This is a tool that allows you to run an SSIS package and produces full and detailed logging information of execution status and package runtime data.
■■
SSIS Package Manager—PacMan (http://www.codeplex.com/pacman) This is a utility designed to permit batch operations on arbitrary sets of SSIS packages. Users can select a single package, a Visual Studio project, or solution or file system folder tree and then validate or update all selected packages in one operation.
Summary In this chapter, we showed you how to extend SSIS via native and custom script tasks. We also covered the mechanics of creating custom objects. Finally, we showed you how to embed SSIS packages in custom applications. Because SSIS is a generic extract, transform, and load tool, it has many generic features that are of great help in a variety of cases. However, for some specific business problems, you might find that there are no ready-made solutions. Here lies the great power of SSIS. Thanks to its exceptional extensibility, you can programmatically implement the features you need to have, either by using scripting or creating reusable objects.
Part IV
Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
601
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services In this chapter, we look at the parts and pieces that make up an installation of Microsoft SQL Server 2008 Reporting Services (SSRS). We start by reviewing the installation and configuration processes you’ll use to set up Reporting Services. We then explore the report development environment in Business Intelligence Development Studio (BIDS). While doing this, we’ll walk through the steps you’ll use to build, view, and deploy basic reports of various types. This is the first of three chapters in which we’ll focus on understanding concepts and implementation details for using SSRS. In the next two chapters, we’ll look more specifically at using SSRS as a client environment for SQL Server Analysis Services (SSAS) cubes and data mining models. Then we’ll discuss advanced SSRS concepts, such as custom client creation, working directly with the SSRS APIs, and more. Also, we’ll tackle integration between SSRS and Microsoft Office SharePoint Server 2007 in Chapter 25, “SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007.”
Understanding the Architecture of Reporting Services SQL Server Reporting Services was introduced in SQL Server 2005 (with compatibility for SQL Server 2000). In SQL Server 2008, Reporting Services includes significant enhancements that make it more versatile and much easier for you to create reports using SQL Server 2008 Analysis Services OLAP objects (that is, OLAP cubes or data mining models) as source data. Reporting Services is designed to be a flexible enterprise-capable reporting solution for all types of data sources (that is, relational, multidimensional, text, and so on). The architecture of SSRS is built around the three types of activities that accompany reporting. These activity groups are report creation, report hosting, and report viewing. Each of these activity groups contains one or more components that can be used to support the activity. Before we drill into the core components in more detail, here’s a list of them and a description of their primary functions: ■■
A Microsoft Windows service called SQL Server Reporting Services This is the area where the core report processing is done in SSRS. This component is required. Here the core input is processed and results are sent to a hosting environment and rendered as a report in one of the many available output formats (that is, HTML, Excel, CSV, XML, Image, PDF, Word, or custom).
■■
A Web service called ReportServer This exposes the core functionality of SSRS via Web service calls. Although it’s not strictly a requirement to use this service, we’ve 603
604
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
found that all of our clients have chosen to use it. A significant improvement in SSRS 2008 is that Internet Information Services (IIS) is no longer required as a host for this service. It can be hosted using the http.sys listener. We’ll go into more detail on this latter point later in this chapter. ■■
A Web site called Reports This is the default end-user interface for viewing the reports that SSRS produces. End users access this ASP.NET 3.5 Web application, which is also called Report Manager, by navigating to the (default) URL: http://localhost/ reports. This Web site also contains an administrative interface called Report Manager. Authorized administrators can perform configuration tasks via this interface. This component is optional. Some of our clients have chosen to use it, while others have preferred to use alternate client hosting environments, such as Office SharePoint Server 2007, custom Web sites, and so on.
■■
Command-line utilities, such as rsconfig.exe and others SSRS ships with several command-line utilities that facilitate scripting of common administrative tasks. Also, some administrative tasks can be completed by connecting to an SSRS instance using SSMS.
■■
Report development environments SSRS includes a couple of templates in BIDS that enable quick report development. Microsoft also plans to release an upgraded version of the stand-alone visual report creation tool named Report Builder. As of this writing, the announced plan is to release the upgraded version of Report Builder in late 2008 after RTM. Also, if you’re using Visual Studio 2008, SSRS adds an embeddable component, called the Report Viewer, to the Toolbox for Windows Forms and Web Forms development projects.
■■
Metadata repository SSRS uses 31 SQL Server tables to store metadata for SSRS itself and for its configured reports. These tables can be either stored in a dedicated SQL Server database or integrated with Office SharePoint Server metadata (which is also stored in a SQL Server database).
■■
Integrating hosting and viewing (optional) Depending on what other Microsoft products are installed, such as Office SharePoint Server 2007 or other products, you might have access to prebuilt and configurable SSRS hosting applications. These are usually Web sites. In the case of Office SharePoint Server 2007, there is a set of templates, called Report Center, that ships as part of Office SharePoint Server 2007. We’ll take a closer look at this and at the integration of SSRS and Office SharePoint Server 2007 metadata in Chapter 25. Also new to SQL Server 2008 SSRS is the ability to render SSRS in Microsoft Office Word (new) or in Microsoft Office Excel (same as the previous version for workbooks and enhanced for Excel chart rendering).
Figure 20-1 (from SQL Server Books Online) shows the core components of Reporting Services as well as a reference to third-party tools. One particularly compelling aspect of SSRS in general is that Microsoft has exposed a great deal of its functionality via Web
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
605
services, making SSRS quite extensible for all three phases of report work—that is, design, administration, and hosting. Web Browser
Report Builder
Report Designer
Reporting Model Services Third-Party Designer Configuration Tools
Report Manager Report Server Components Programmatic Interfaces Scheduling and Delivery Processor
Delivery Extensions
Authentication Extensions
Report Processing Extensions Report Processor
Rendering Extensions
Data Processing Extensions
Report Server Database
Data Source
Key: Windows service components Web service components Components common to Windows service and Web service Report Manager components FigURe 20-1 SSRS architecture (from SQL Server Books Online)
In Chapter 4, “Physical Architecture in Business Intelligence Solutions,” we introduced SSRS installation considerations. There we examined the security context for the SSRS service itself. We also examined backup and restore strategies. In the next section, we’ll expand on our initial discussion about SSRS installation and setup.
606
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
installing and Configuring Reporting Services As mentioned previously in this chapter, a major change to SSRS in SQL Server 2008 is the removal of the dependency on IIS. This change was driven by customer demand, and it makes SSRS more attractive in environments where the installation of IIS would have posed an unacceptable security risk (or administrative burden). In fact, there is a SQL Server Books Online entry describing how you need to configure IIS to prevent conflicts if you choose to install both SSRS and IIS side by side on the same server: “Deploying Reporting Services and Internet Information Services Side-by-Side.” The first consideration when installing SSRS is, of course, on which physical server (or servers) to install its components. Although it’s physically possible to install SSRS on a physical machine where either SSAS or SQL Server Integration Services (SSIS) has been installed (or both have been), we don’t find this to be a common case in production environments. More commonly, SSRS is installed on at least one dedicated physical machine, sometimes more, depending on scalability or availability requirements. In Chapter 22, “Advanced SQL Server 2008 Reporting Services,” we’ll revisit multiple-machine installs, but as we get started, we’ll just consider a single, dedicated machine as an installation source. Another important consideration when planning your SSRS installation is which edition of SSRS you should use. There are significant feature differences between the Enterprise and Standard editions of SSRS—most of which have to do with scalability. We suggest you review the feature comparison chart and base your edition decision on business requirements—particularly the number of end users that you expect to access the SSRS instance. For a complete list of feature differences by edition, go to http://download.microsoft.com/download/2/d/ f/2df66c0c-fff2-4f2e-b739-bf4581cee533/SQLServer%202008CompareEnterpriseStandard.pdf. After determining what hardware to use for your SSRS installation, your next consideration is component installation and service account configuration. To install SSRS, you use SQL Server 2008’s installer, which presents you with a series of dialog boxes where you enter the configuration information. This information is stored in one of two types of locations—either in XML configuration files (named RSReportServer.config or ReportServerServices.exe.config) or in SQL tables. There are two possible locations for these SQL Server metadata tables. They can either reside in a dedicated metadata database on a selected SQL Server 2008 instance (called native mode) or be part of a SharePoint metadata SQL Server database (called SharePoint integrated mode). Native mode is the default installation type. By default, native mode creates two metadata databases—named ReportServer and ReportServerTempDB—in the installed SQL Server 2008 instance. You can choose the service account that the SSRS Windows service runs under during the installation. This account should be selected based on your project’s particular security requirements.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
607
Tip There is a new, third type of installation called a files-only installation. This type generates configuration files as a result of the actions you take when using the Setup Wizard. This type of installation is particularly useful for moving SSRS work from a development environment to a production environment. Microsoft provides both command-line tools and a graphical user interface for managing this metadata after installation. The GUI tool is called the Reporting Services Configuration Manager and is shown in Figure 20-2. You can see from this tool that you have the following configuration options: Service Account, Web Service URL, Database, Report Manager URL, E-Mail Settings, Execution Account, Encryption Keys, and Scale-Out Deployment.
FigURe 20-2 Reporting Services Configuration Manager
You might be wondering exactly how the stripped-down HTTP listener works and provides the functionality that formerly required IIS. We’ll take a closer look at that next. After that, we’ll discuss some of the other core components of SSRS in more detail as well. First we’ll include another component diagram of SSRS from SQL Server Books Online in Figure 20-3. As we take a closer look at the architecture of SSRS, we’ll drill into the functionality provided by the core components shown in this diagram. Note that the diagram indicates core and optional (external) components of SSRS. As was the case in SQL Server 2005 SSRS, in 2008 both the Report Manager and the Web service components are built on the ASP.NET page framework. This allows experienced .NET developers to easily extend most of the core functionality programmatically if requirements necessitate such actions.
608
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence Service Architecture HTTP Listener
RPC
Authentication Report Manager
Web Service
Background Processing
Front-End Access to Report Server Items and Operations
Report Processing
Report Processing
Model Processing
Model Processing
All (Authentication, Data, Rendering, Report Processing)
Data, Rendering, and Report Processing Extensions
UI Extensions
WMI Reporting Services WMI Provider
Scheduling Subscription and Delivery ASP.NET
Database Maintenance
ASP.NET
Application Domain Management
Memory Management
Service Platform
Key: External components Internal components Feature components FigURe 20-3 SSRS component architecture (from SQL Server Books Online)
New in SQL Server 2008 is the ability to interact with SSRS using Windows Management Instrumentation (WMI) or Windows Management Interface queries. This is a welcome addition that makes administrative control more flexible.
HTTP Listener New to SSRS 2008 is the use of the HTTP listener (also called by its file name, which is http.sys). The HTTP listener monitors incoming requests on a specific port on the system using http.sys. The host name and port are specified on a URL reservation when you configure the server. Depending on the operating system you’re using, the port you specify can be shared with other applications. As we mentioned, this approach effectively removes the need for an IIS instance. This is an important improvement to SSRS and one that allows us to select SSRS as a business intelligence (BI) client for a greater number of customers who had previously objected to the IIS dependency.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
609
The HTTP listener implements the HTTP 1.1 protocol. It uses the hosting capabilities that are built into the operating system—for example, http.sys itself. For this reason, SSRS requires operating systems that include http.sys as an internal component, such as Windows XP Professional, Windows Vista Business, Windows Server 2003, or Windows Server 2008. When the HTTP listener processes a request, it forwards it to the authentication layer to verify the user identity. The Report Server Web service is called after the request is authenticated. The most common way to configure the http.sys interface is by using the Reporting Services Configuration Manager shown earlier in Figure 20-2.
Report Manager Report Manager is an administrative client application (Web site) that provides access to the Report Server Web service via pages in the included reporting Web site. It’s the standard tool for viewing and managing Report Server content and operations when SSRS is configured in native mode. Reporting Services Configuration Manager can be used either locally or remotely to manage instances of Reporting Services, and it runs in a browser on the client computer. Session state is preserved as long as the browser window is open. User-specific settings are saved to the Report Server database and reused whenever the user connects to Report Manager. In addition to using Report Manager, you can also use the command-line tool rs.exe with scripts to automate administrative processes associated with reporting. Some of these include scheduled execution of reports, caching options, and more. For more information about using rs.exe and to see some sample scripts, see the SQL Server Books Online topics “rs Utility” and “Script Samples (Reporting Services).” If you configure Report Server to run in SharePoint integrated mode, Report Manager is turned off and will not function. The functionality normally provided by Report Manager is included in a SharePoint report library. SSRS 2008 no longer allows you to manage SSRS content from SSMS, so you must manage it through Report Manager or Office SharePoint Server 2007 if you’re running in SharePoint integrated mode. In addition, new pages have been added to Report Manager for generating models, setting model item security, and associating click-through reports to entities in a model.
Report Server Web Service The Report Server Web service is the core engine for all on-demand report and model processing requests that are initiated by a user or application in real time, including most requests that are directed to and from Report Manager. It includes more than 70 public methods for you to access SSRS functionality programmatically. The Report Manager Web site accesses these Web services to provide report rendering and other functionality. Also,
610
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
other integrated applications, such as the Report Center in Office SharePoint Server 2007, call SSRS Web services to serve up deployed reports to authorized end users. The Report Server Web service performs end-to-end processing for reports that run on demand. To support interactive processing, the Web service authenticates the user and checks the authorization rules prior to handing a request. The Web service supports the default Windows security extension and custom authentication extensions. The Web service is also the primary programmatic interface for custom applications that integrate with Report Server, although its use is not required. If you plan to develop a custom interface for your reports, rather than using the provided Web site or some other integrated application (such as Office SharePoint Server 2007), you’ll want to explore the SQL Server Books Online topic “Reporting Services Web Services Class Library.” There you can examine specific Web methods. In Chapter 22, we’ll provide some examples of working directly with this API. For most of our BI solutions, we find that our clients prefer custom application development to the canned Web site included with SSRS.
Authentication All users or automated processes that request access to Report Server must be authenticated before access is allowed. Reporting Services provides default authentication based on Windows integrated security and assumes trusted relationships where client and network resources are in the same domain or a trusted domain. You can change the authentication settings to narrow the range of accepted requests to specific security packages for Windows integrated security, use Basic authentication, or use a custom forms-based authentication extension that you provide. To change the authentication type to a method other than the default, you must deploy a custom authentication extension. Previous versions of SSRS relied on IIS to perform all types of authentication. Because SSRS 2008 no longer depends on IIS, there is a new authentication subsystem that supports this. The Windows authentication extension supports multiple authentication types so that you can precisely control which HTTP requests a report server will accept. If you’re not familiar with the various Windows authentication methods—NTLM, Kerberos, and so on—see http://www.microsoft.com/windowsserver2003/technologies/ security/kerberos/default.mspx for Kerberos and http://msdn.microsoft.com/en-us/library/ aa378749.aspx for NTLM for more information. Included authentication types include the following: ■■
RSWindowsNegotiate Directs the report server to handle authentication requests that specify Negotiate. Negotiate attempts Kerberos authentication first, but it falls back to NTLM if Active Directory cannot grant a ticket for the client request to the report server. Negotiate falls back to NTLM only if the ticket is not available. If the first attempt results in an error rather than a missing ticket, the report server does not make a second attempt.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
611
■■
RSWindowsKerberos Reads permissions on the security token of the user who issued the request. If delegation is enabled in the domain, the token of the user who is requesting a report can also be used on an additional connection to the external data sources that provide data to reports.
■■
RSWindowsNTLM Authenticates a user through an exchange of private data described as challenge-response. If the authentication succeeds, all requests that require authentication will be allowed for the duration of the connection. NTLM is used instead of Kerberos under the following conditions:
■■
❏■
The request is sent to a local report server.
❏■
The request is sent to an IP address of the report server computer rather than a host header or server name.
❏■
Firewall software blocks ports used for Kerberos authentication.
❏■
The operating system of a particular server does not have Kerberos enabled.
❏■
The domain includes older versions of Windows client and server operating systems that do not support the Kerberos authentication feature built into newer versions of the operating system.
RSWindowsBasic Passes credentials in the HTTP request in clear text. If you use Basic authentication, use Secure Sockets Layer (SSL) to encrypt user account information before it’s sent across the network. SSL provides an encrypted channel for sending a connection request from the client to the report server over an HTTP TCP/IP connection.
By default, only WindowsNegotiate and WindowsNTLM are enabled. Each of these authentication types can be turned on or off as necessary. You make changes to the default configuration to enable other types of Windows authentication by making changes to the RSReportServer.config files. For specifics, see the SQL Server Books Online topic “How to: Configure Windows Authentication in Reporting Services.” As mentioned, to use nonWindows authentication, custom authentication providers must also be used. You can enable more than one type of authentication if you want the report server to accept multiple requests for authentication. We’re pleased to see the improved flexibility in configuring authentication mechanisms for SQL Server 2008 SSRS. It has often been a business requirement to implement some type of custom authentication in our production BI projects. Note In SQL Server 2008, Reporting Services does not support anonymous or single sign-on authentication unless you write and deploy a custom authentication provider.
612
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Background Processing (Job Manager) Report Server also contains a job manager that enables background processing. Background processing refers to operations that run in the background and are initiated by Report Server. Most background processing consists of scheduled report processing and subscription delivery, but it also includes Report Server database maintenance tasks. Background processing for scheduling, subscription, and delivery is configurable and can be turned off through the ScheduleEventsAndReportDeliveryEnabled property of the Surface Area Configuration for the Reporting Services facet in Policy-Based Management. For more information on doing this, see the topic “How to: Turn Reporting Services Features On or Off” in SQL Server Books Online. If you turn those operations off, scheduled report or model processing will not be available until they’re re-enabled. The Database Maintenance task is the only task that cannot be turned off because it provides core database maintenance functionality. Background processing operations depend on a front-end application or the Web service for definition. Specifically, schedules and subscriptions are created in the application pages of Report Manager or on a SharePoint site if the report server is configured for SharePoint integration, and then they’re forwarded to the Web service, which creates and stores the definitions in the report server database. All of the aforementioned components work together and provide reporting functionality to administrators, developers, and end users. The sum total makes SSRS a very viable enterprise-capable reporting platform. After you’ve installed, configured, and verified your SSRS instance, you’ll want to move on to the work of developing reports. In the next section, we’ll take a look at using BIDS (which is just one of several possible ways you can use to author reports for SSRS) to develop, preview, and deploy reports to SSRS Report Server.
Creating Reports with BiDS To get started developing our first report, we’ll use BIDS. As mentioned previously, we’ll start by building reports from OLTP data sources so that we can first focus on learning how to use the report designer in BIDS. In the next chapter, we’ll look at how BIDS is used to design reports for cubes and mining models. To get started, in the New Project dialog box, select the Business Intelligence Projects project type and then choose Report Server Project in the Templates area, as shown in Figure 20-4. This gives you a blank structure of two folders in Object Explorer: one folder for shared data sources, and one folder for reports.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
613
FigURe 20-4 BIDS contains two SSRS development templates.
Developing a report for SSRS consists of two basic steps: defining the data source, and defining the report layout. Data sources can be either shared by all reports in a project or local to a specific report. Shared data sources are preferred to private (or report-specific) data sources for several reasons. The first reason is that it’s most typical to be developing reports on one set of servers with the expectation of deploying those reports to a different set of production servers. Rather than having to change connection string information in each report in a project, using shared data sources allows developers (or administrators) to update configuration information once for each group of reports in that project when those reports are deployed to production servers. To define a new shared data source in your project, simply right-click on the Shared Data Sources folder in Solution Explorer, and then click Add New Data Source on the shortcut menu. Clicking Add New Data Source opens the Shared Data Source Properties dialog box shown in Figure 20-5.
FigURe 20-5 The Shared Data Source Properties dialog box
Note that data sources can be of any type that is supported by Reporting Services. By default, the available data source types are Microsoft SQL Server, OLE DB, Microsoft SQL Server Analysis Services, Oracle, ODBC, XML, Report Server Model, SAP NetWeaver BI, Hyperion Essbase, and TERADATA. For our example data source, click Microsoft SQL Server and then click the Edit button to bring up the standard connection dialog box. There you
614
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
enter the server name, database name, and connection credentials. We’ll use the sample relational database AdventureWorksDW2008 for this first sample report. After you complete the configuration, a new shared data source is added to the Report Server project. To add a new report, right-click the Reports folder and select Add, New Item, and then Report from the shortcut menu. This opens the report designer interface shown in Figure 20-6. Note that the report designer surface contains two tabs—Design and Preview. Also, near the bottom of the designer surface, there are two sections named Row Groups and Column Groups. These have been added to the SSRS 2008 BIDS report designer to make report design more intuitive.
FigURe 20-6 The SSRS report designer
To build a report, you need data to display. To obtain the data, you need to provide a query in a language that is understood by the data source—that is, Transact-SQL for OLTP, MDX for OLAP, and so on. Query results are called datasets in SSRS. For the next step, open the Report Data window, which should be on the left side of the BIDS window. Select New and then Data Source from the toolbar at the top of the Report Data window. This opens the Data Source Properties dialog box shown in Figure 20-7.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
615
FigURe 20-7 The Data Source Properties dialog box
In this dialog box, you first reference a connection string or data source. You can either create a new (private) connection string using this dialog box or reference any shared data source that you’ve defined for this project. In this case, you should select the shared data source created earlier. Note also that you can optionally select the Use Single Transaction When Processing The Queries check box. Selecting this option causes all queries associated with the dataset to execute as a single transaction. After you’ve configured your data source, you need to configure the login credentials for this particular data source. Click the Credentials link in the left pane to view the properties for the data source credentials as shown in Figure 20-8. When using a shared data source, the controls here will be disabled, as security information is defined in the shared data source. It’s important to understand how these credentials will be used. The default is to use the credentials of the user requesting the report via Windows Integrated Authentication. Other choices are to specify a user name and password to be used every time the report is processed, prompt the user to enter credentials, or use no credentials. When you configure this dialog box, you’re setting the design-time credentials. Be aware that authorized administrators can make changes to these settings if your business requirements call for such at run time by using the SSRS administrative Web site to update the values associated with the connection string (data source).
616
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigURe 20-8 The Data Source Properties dialog box for specifying credentials
Click OK to finish the data source configuration, and then right-click on the data source in the Report Data window and choose New Dataset to create a new dataset. Click the Query Designer button in the resulting dialog box. SSRS includes multiple types of query designers, and we’ll detail those shortly. Because the shared data source you created earlier is based on a relational data source, in this case the generic query designer (shown in Figure 20-9) opens. The type of query designer that opens is dependent on the type of data source—that is, SQL opens TransactSQL, SSAS opens MDX, and so on. Reporting Services provides various query design tools that can be used to create queries in the report designer. The kind of data that you’re working with determines the availability of a particular query designer. In addition, some query designers provide alternate modes so that you can choose whether to work in visual mode or directly in the query language. Visual mode allows you to create queries using drag and drop or guided designers, rather than by just typing in the query code ad hoc. There are five different types of query designers that can be used, depending on the type of data that you’re working with: ■■
Generic query designer tional data sources.
■■
Graphical query designer Used in several Microsoft products and in other SQL Server components. It provides a visual design environment for selecting tables and columns.
The default query building tool for most supported rela-
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
617
It builds joins and the (relational) SQL statements for you automatically when you select which columns to use. ■■
Report model query designer Used to create or modify queries that run against a report model that has been published to a report server. Reports that run against models support click-through data exploration by authorized end users. The idea is to provide end users with a subset of source data against which they can click and drag to create reports based on further filtered subsets of the original data subset. The query that the end user creates by clicking and dragging objects (called entities, attributes, and so on) determines the path of data exploration at run time.
■■
MDX query designer Used to create queries that run against an Analysis Services or other multidimensional data source. This query designer becomes available when you create a dataset in the report designer that uses an Analysis Services, SAP NetWeaver BI, or Hyperion data source.
■■
DMX query designer Used to retrieve data from a data mining model. To use this query designer, you must have an Analysis Services data source that includes a data mining model. After you select the model, you can create data mining prediction queries that provide data to a report.
FigURe 20-9 The generic query designer
618
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
An alternative to launching the query designer as just described is to open the dataset designer from the Report Data window menu. After you do that, you’ll see the Dataset Properties dialog box shown in Figure 20-10. Here you can define not only the source query, but you can also configure query parameters, report fields, report options, and report filters. We’ll look at the last few items in a bit more detail later in this chapter.
FigURe 20-10 The Dataset Properties dialog box
Tip If you’re building reports using OLTP data sources, such as SQL Server, it’s best practice to first define the source query in the RDBMS as a stored procedure. Your reports will perform better if you use stored procedures rather than ad hoc Transact-SQL queries because of the built-in optimization characteristics associated with stored procedures. Also, limiting RDBMS access to stored procedures, rather than open SELECT statements, provides much tighter security and is usually preferred to granting full SELECT permission on RDBMS objects (tables, views, and so on). To continue the example, you next enter a simple query, such as SELECT * FROM DimCustomer. Click OK to leave the query designer, and click OK again to close the Dataset Properties dialog box. After the query is built and executed, you need to place items from the query onto the report. To do this, you need to view the Report Data window, which can be accessed either by selecting Report Data from the View menu or pressing Ctrl+Alt+D. You first need to select a type of report data container on the report designer surface. There are a couple of different types of containers into which you can place your data. The default is a table. In SSRS 2008, there a couple of new container types—notably the Tablix and Gauge containers. We’ll take a closer look at both of those later in this chapter and in the next one.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
619
First we’ll build a basic tabular report. To create a simple tabular report from the dataset created earlier, right-click on the designer surface, select Insert, and then select Table. Doing this opens a blank tabular report layout. Using our dataset defined in the previous paragraphs, you can drag the Customer Key, Birth Date, Marital Status, and Gender fields from the Report Data window to the table layout area. Next you can apply formatting to the header, detail, and table name sections using the formatting toolbar or by selecting the cells of interest, right-clicking them, and then clicking Format. An example of a simple table is shown in Figure 20-11.
FigURe 20-11 The tabular report designer
Also, you’ll see the results of your query displayed in the Report Data window on the left side of BIDS. It contains an entry for each dataset defined for this report. Each dataset is shown in a hierarchal tree view, with each displayable column name listed in the tree. From this designer, you can drag columns from the report dataset and place them in the desired location on the designer surface. You can also add more columns to create the basic report layout as shown in Figure 20-11. Each item on the report surface—that is, row, column, and individual text box—can be configured in a number of ways. The first way is by dragging and dropping one or more dataset values directly onto the designer surface. SSRS is a smart designer in that it automatically creates the type of display based on the location you drag the item to on the report surface. For example, on a tabular report, if you drop a dataset field onto a column in the table, it will show individual values in the detail rows and add the column name to the header area. This smart design also applies to automatic creation of totals and subtotals.
620
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
You can also manipulate items by configuring properties in the Properties window by default on the bottom right in BIDS. In addition, you can use the toolbars, which include a Wordlike formatting toolbar. And finally, you can also right-click on the designer surface to open a shortcut menu, which also includes an option to configure properties. The report designer surface is straightforward to use and is flexible in terms of formatting options. New to SSRS 2008 are sections at the bottom of the report designer surface named Row Groups and Column Groups. You use these areas to quickly see (and change if desired) the grouping levels defined in your report. After the report is created to your satisfaction, you can click the Preview tab on the report designer surface to see a rendered version of the report, as shown in Figure 20-12. Of course, because we’ve applied very little formatting to this report, its appearance is rather plain. You’ll certainly want to use the built-in capabilities to decorate your production reports with appropriate fonts, formatting, text annotation, and such to make them more appealing to their end-user audiences.
FigURe 20-12 The report in preview mode
After you click the Preview tab, you might notice that the Output window opens up in BIDS as well as the rendered report. If there were no design errors in your report, it will render and be displayed on the Preview tab. If there were errors, they’ll be listed in the Errors window in BIDS. And, as with other types of development, if you click on any particular error in the error list, BIDS takes you to the error location so that you can make whatever correction is needed.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
621
To fix errors, you need to understand what exactly is happening when a report is built in BIDS. Unlike traditional development, building a report does not compile the Report Definition Language (RDL). Rather, the RDL is validated against the internal XML schema. If there are no invalid entries, the report is built successfully, as seen in Figure 20-12. As with traditional coding, fatal errors are shown using red squiggly lines and warning errors are shown using blue squiggly lines. If you’ve made some type of error when you created your report, you see a brief error description on the Preview tab (rather than the rendered report) and you can open the Errors window in BIDS (from the View menu) to see more detail about the particular error or errors. The Errors window lists all errors, ranked first by type—for example, fatal (or red)—and then includes a description of each error, the file where the error is located, and sometimes the line and column. Although you could open the RDL associated with the report by rightclicking the file name in Object Explorer and then clicking View Code, you’ll more commonly read the error description and then navigate to the GUI location in BIDS where you can fix the error. After you resolve any error or errors, you simply save your changes and then click on the Preview tab again and the report will render. Note When using BIDS, in addition to being able to create new reports you can also import report definition files that you’ve created using other tools. To import existing files, you rightclick on the Reports folder in Solution Explorer, click Add, and then click Existing Files. You can import files of the *.rdl or *.rdlc format. To import Microsoft Office Access reports, you must have Access 2002 or later installed. Access reports can be imported by selecting Project, Import Reports, and then Microsoft Access. It’s possible that some aspects of Access reports will not import correctly and might require manual correction.
Other Types of Reports As mentioned, you’re not constrained to using only a tabular format for reports. The Toolbox contains several types of containers and is shown in Figure 20-13. Note that you can select from Table, Matrix, List, Subreport, Chart, or Gauge.
FigURe 20-13 The Toolbox in BIDS
622
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
The Gauge type is new to SQL Server 2008 SSRS. We most often use Table, Matrix, or Chart. List is used when you want to apply some type of custom formatting, such as a multicolumn display. List is a blank container, and lets you apply your own layout. It’s very flexible and a good choice if your business requirements call for some type of custom formatted output. You should also note that you can nest other container types—such as tables and charts— inside of a list. You might also have heard something about a new type of container, called a Tablix. We’ll cover the Tablix container in more detail in Chapter 21, “Building Reports for SQL Server 2008 Reporting Services.” At this point, we’ll mention that when you select a table, matrix, or list, you’re actually getting a Tablix data container. For our next example, we’ll build a report using the new Gauge control. It’s interesting to see more visually rich controls, such as the gauge, being added to the SSRS development environment. Given the challenge of building reports that appropriately present BI data to various end-user communities, we think this a positive direction for SSRS in general. As with creating any report, to use the Gauge control, you must, of course, first define a data source and a dataset. Do this following the steps described earlier in this chapter. We’ll use a simple relational dataset from the sample database, AdventureWorksDW for our sample. We’ve just used the query SELECT * FROM DimCustomer to get a dataset for display. In the next step, we dragged the customer last name field onto the gauge display surface. SSRS automatically applies the COUNT aggregate to this field. Figure 20-14 shows the output. In addition to displaying an aggregate value on the gauge itself, we’ve also chosen to show an SSRS system variable and the date and time of report execution, and we chose to include a report parameter, Co Name, in our particular report. In addition to the standard output, you can configure the many properties of this rich control by simply clicking on the item you want to configure—for example, pointer, scale, and so on—and then accessing its properties in the designer by using the Properties windows or by using the shortcut menu to open a control-specific properties dialog box.
Sample Reports You might want to take a look at some sample reports available on CodePlex (see http://www.codeplex.com/MSFTRSProdSamples/Release/ProjectReleases.aspx?ReleaseId=16045) so that you can gain a better understanding of the various design possibilities. We think it will be valuable for you to see the various formatting and layout options as you consider to what extent you want to use SSRS as a client in your BI project. As with the other types of samples mentioned in this book—such as SSAS, SSIS, and so on—to work with the sample SSRS reports, you must download the samples and then install them according to the instructions on the CodePlex site.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
623
FigURe 20-14 The Gauge control rendered
Deploying Reports After the report is designed to your satisfaction, the report must be deployed to an SSRS server instance to be made available to applications and users. As we learned in the errorhandling section earlier, a report consists of a single RDL file. If you choose to use shared data sources, report deployment also includes those RDS files. Deployment in SSRS simply means copying the RDL and RDS files from BIDS to your defined deployment location. To deploy a report project, you must first configure the development environment. To configure the SSRS development environment for report deployment, right-click on the report server project and then click Properties. This opens the Report Project Property Pages dialog box as shown in Figure 20-15.
FigURe 20-15 Report deployment properties
624
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
This dialog box is where you specify the location on the SSRS server for any reports that are contained within the project. There are four properties: ■■
Overwrite Data Sources Allows you to choose to always overwrite all existing data sources on the server or ignore them if they already exist. This is an important option to configure according to your requirements, especially if your data sources change during report development.
■■
TargetDataSourceFolder Allows you to choose the destination deployment folder on the SSRS server for data connection files.
■■
TargetReportFolder Allows you to choose the destination deployment folder on the SSRS server for report files.
■■
TargetServerURL
Allows you to choose the destination deployment server.
Note If you’re upgrading from SQL Server 2005 SSRS, you should be aware that Microsoft has made substantial changes to the RDL that SSRS uses. To that end, reports created using BIDS or Report Builder from SQL Server 2005 must be upgraded to be able to run in SQL Server 2008 SSRS. If these reports were in an upgraded Report Server database or are uploaded to an SSRS 2008 instance, they will be automatically upgraded the first time they’re processed. Reports that can’t be converted automatically are processed using backward-compatible components. You can also convert older reports using BIDS 2008. If you choose not to convert older reports, you can run those reports by installing SQL Server 2005 SSRS side by side with SQL Server 2008.
After you have configured this information, in the BIDS Object Explorer window, right-click on the project name and then click Deploy. As deployment proceeds, BIDS first attempts to build (that is, validate) all RDL and RDS files. After successful validation, BIDS copies those files to the configured deployment location. If you’re using the default SSRS Web site to host your reports, you can then click on the URL for that Web site to see the reports in a browser. The default URL is http://<%servername%>/ Reports/Pages/Folder.aspx. Of course, you might be using other hosting environments, such as Office SharePoint Server 2007, a custom Windows Forms application, and so on. We’ll take a closer look at such alternatives in Chapters 22 and 23.
Chapter 20
Creating Reports in SQL Server 2008 Reporting Services
625
Summary In this chapter, we discussed the architecture of SSRS. To that end, we took a closer look at the included components that support authoring, hosting, and viewing report functions. These components include the Windows service; Web service; Web site, configuration, and development tools; and metadata storage. We discussed best practices for installation and configuration. Next we walked through the process of developing a couple of reports using various container types in BIDS. Finally, we configured properties for completed report deployment. In the next chapter, we’ll look more specifically at how best to use SSRS as a client for SSAS objects. There we’ll look at the included visual MDX and DMX query designers. We’ll also take a look at the use of container objects that lend themselves to SSAS object display—such as the new Tablix container.
Chapter 21
Building Reports for SQL Server 2008 Reporting Services In this chapter, we take a look at the mechanics of creating reports based on SQL Server 2008 Analysis Services (SSAS) objects: OLAP cubes and data mining models. To that end, we examine using the included MDX Query Designer. Then we take a look at parameter configuration. We conclude this chapter by looking at the redesigned Report Builder report creation tool that Microsoft released in October 2008. We start by introducing best practices related to using the MDX Query Designer that is included with the SQL Server 2008 Reporting Services (SSRS) developer interface in Business Intelligence Development Studio (BIDS). This is a good starting point because, of course, implementing an effective query is the key to producing a report that performs well. Later in this chapter, we also look at the included DMX Query Designer.
Using the Query Designers for Analysis Services After you open BIDS to create a new project of type Report Server Project and create a data source, you’ll see the Reporting Services designer and Toolbox shown in Figure 21-1. As you begin to create your first report, you must decide whether you prefer to configure one connection specific to each report—that is, an embedded connection—or to define projectspecific connections. We favor the latter approach for simpler management. To define project-specific connections, you create a shared data source by right-clicking on the Shared Data Sources folder in Solution Explorer. Then select Microsoft SQL Server Analysis Services from the drop-down list of data sources and provide a connection string. For the examples in this chapter, we’ll continue using the Adventure Works DW 2008 OLAP database. Configure the connection credentials—that is, Windows, custom authentication, and so on—as you’ve done in previous connection configurations in BIDS. In this section, we cover two of the query designers available in Analysis Services: the MDX Query Designer, with its visual and drag-and-drop modes, and the DMX Query Designer, which you use when you want to base your report on data mining results. Along the way, we also provide information on how to set parameters in a query designer.
627
628
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 21-1 Reporting Services designer surface in BIDS
MDX Query Designer When you make (or reference an existing) connection to an SSAS data source by opening a new report and adding a dataset, the query designer opens by default in visual MDX query generation mode. You can switch to manual MDX or to DMX mode by clicking buttons on the query toolbar. We’ll cover those scenarios subsequently in this section. After you’ve successfully created a connection, you define a query for the report that you’re working on by creating a new report, and opening the Report Data window. From there, choose New, Data Source and select a shared data source based on SSAS. Finally, choose New, Dataset, select the data source created in the previous step, and click the Query Designer button. This opens the MDX Query Designer. By default, the first cube in alphabetical order by name appears in the list of cubes in the upper right corner of the query designer. If you want to change that value, click on the Build (…) button to the right of the selected cube to open a dialog box that allows you to select any of the cubes contained in the OLAP database. After you’ve verified that you’re working with the desired cube (Adventure Works for this example), you need to create the query. Tip Just as with other built-in metadata browsers, the browser included in the SSRS Query Designer includes a Measure Group filter. Because we like to design MDX queries using drag and drop to save time, we also frequently use the Measure Group filter to limit the viewable measures and dimensions to a subset of cube attributes.
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
629
As mentioned, there are two modes for working with the query designer when querying OLAP cubes. The default is a visual (or drag-and-drop) mode, as shown in Figure 21-2. SQL Server Books Online calls this design mode. We recommend using this mode because it greatly reduces the amount of manual query writing you need to do to build a reporting solution for SSAS and results in the ability to generate reports much more quickly. The other mode is manual MDX entry, which we take a closer look at later in this section. SQL Server Books Online calls this query mode. New in SQL Server 2008 is the ability to import an existing MDX query in the query designer using the Import button on the toolbar.
FigUre 21-2 SSAS MDX Query Designer in BIDS
Figure 21-3 shows a drag-and-drop query. To create this query, you must first filter the metadata view to show only measures and dimensions associated with the Internet Sales group. Then drag the Internet Sales Amount measure to the designer surface, expand the Date dimension, and drag the Date.Calendar Year level from the Calendar folder onto the designer surface. Next drag the Sales Reason level from the Sales Reason dimension to the designer surface. Finally, configure the slicer value (at the top of the designer) to be set to the Product dimension, Category hierarchy, and set the filter for the values Bikes and Accessories.
630
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 21-3 MDX SSAS Query Designer in BIDS showing query results
If you want to view or edit the native MDX query that you visually created, click the last button on the right on the toolbar (the Design Mode button). This switches the visual query designer to a native query designer (called query mode). You can then view the native MDX code and, optionally, also re-execute the query to view the results. For our example, the MDX query looks like that in Figure 21-4.
FigUre 21-4 Manual MDX SSAS query in BIDS
By examining the generated query, you can see that the query designer automatically added the MDX NON EMPTY keyword to your query. Also, you can see that the dimension levels were included by name in the query (Sales Reasons), whereas specific dimension members were referenced by their key (Product Category). You might recall from Chapter 10, “Introduction to MDX,” and Chapter 11, “Advanced MDX,” that these naming properties vary depending on the type of object you’re using in the query. If you’re thinking, “Wow, that’s a long MDX query statement!” you’re not alone. We’ve said it before and we’ll say it again here: For improved productivity, make maximum use of the drag-and-drop MDX query designers in all locations in BIDS—in this case, in the SSRS Query
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
631
Designer. In addition to viewing the MDX query statement in query mode, you can also edit it. One word of caution: If you switch from design mode to query mode and then make manual changes to the query, you won’t be able to switch to design mode without losing any manual changes that you made while in query mode. Before you switch back, BIDS will generate a warning dialog box, as shown in Figure 21-5.
FigUre 21-5 Query designer mode switch warning dialog box in BIDS
When you’re working in design mode, you can create local calculated MDX members. To do this, right-click on the lower left of the designer surface (the Calculated Members area) and select New Calculated Member from the shortcut menu. You are presented with a dialog box where you can write the MDX query for the calculated member. As with other MDX query writing, we recommend that you drag and drop metadata and MDX functions into the Expression area of the dialog box rather than manually typing the query. The interface for creating these calculated members is very similar to the one that you used when creating global calculated members using BIDS for SSAS. We’ll elaborate a bit on the concept of local versus global objects. Local means the objects are visible only to this one particular report. This differs from creating calculated members as objects for a particular cube using the cube designer Calculations tab in BIDS. Calculated members created inside the cube can be considered global, rather than calculated members that you might choose to create using the query designer in BIDS for SSRS. Calculated members are local to the specific report where they have been defined. We prefer to create global calculated members (in the OLAP cube definition that is using BIDS for SSAS) rather than local (specific to a report) members because the former are more visible, more reusable, and easier to maintain. The only time we use local (or report-specific) calculated members is when we have a very specific requirement for a very specific subset of users.
Setting Parameters in Your Query We briefly interrupt the discussion of query designers to stress the importance of setting the right parameters in your query. You can enable one or more parameters in your query by selecting the Parameter option in the filter section at the top right of the query design work area. These parameters can be presented in the user interface as a blank text box or dropdown list (showing a list you provide or one that is generated based on another query), and they can show a default value. You can also allow for the entry or selection of multiple values.
632
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Figure 21-6 shows the same query as was generated in Figure 21-5, but with the Parameter option selected for the Product filter, now rendered in the query builder so that the MDX statement is visible (query mode). By examining the generated MDX after you select the Parameter option in the filter section, you can see that the MDX produced now includes a parameter value designated by the @ProductCategory value. In addition, an IIf function was also added to return either the currently selected value or (by default) to set the value to the currently displayed member value.
FigUre 21-6 Manual MDX SSAS query in BIDS, which includes parameters
After you enable parameters, the query designer adds several important lines of MDX to your query without you having to write them. Notice that when you view the generated query text (query mode), you can see two of the other buttons on the toolbar: the Query Parameters and Prepare Query buttons. These are the fifth and fourth buttons, respectively, from the right side of the toolbar. They are available only when you’re working in query mode. When you click the Query Parameters button, a dialog box appears that allows you to visually configure the query parameters that become available. Here you can specify parameter names; associate those names to dimensions, attributes, or hierachies (using the Hierarchy section); allow multiple values; and set a default value. This dialog box is shown in Figure 21-7.
FigUre 21-7 Query Parameters dialog box in the query designer for SSRS
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
633
The Prepare Query button acts much like the blue check button in SQL Server Management Studio (SSMS) Transact-SQL query mode—that is, when you click it, the query syntax is checked and any errors are returned to you via a pop-up message. The one item that is lacking in this view is also similar to a missing feature in the Transact-SQL query interface—that is, IntelliSense. Unfortunately, there isn’t any available. If you need to author most of your MDX queries manually, we recommend that you obtain a query-writing interface that gives you more feedback to write your query properly, and then copy and paste the completed query into the SSRS Query Builder dialog box. Tip Although MDX IntelliSense is not included with BIDS, you can download tools to help you with MDX query writing and include MDX IntelliSense from Mosha Pasumansky at this URL: http://sqlblog.com/blogs/mosha/archive/2008/08/22/intellisense-in-mdx-studio.aspx.
We remind you that although you can use the visual tools as much as possible in the SSRS MDX Query Designer to improve your query-writing productivity, there is no substitute for solid cube design and effective MDX query writing. No tool can overcome inefficient data structure design and inefficient query writing. Also, aggregation and storage design will factor into your query execution efficiency. Before we go on to look at report layout for OLAP cubes, we’ll take a quick look at how the query designer works for DMX data models.
DMX Query Designer As mentioned, when you create a new dataset against an SSAS data source, the default query designer that opens in BIDS for SSRS is the MDX designer. You can switch this to show the DMX designer if you want to use data mining models as a basis for your report by clicking the second toolbar button (the pickaxe icon) from the left. Doing this displays a query interface that is identical to the one we reviewed when querying data mining models in BIDS. More specifically, this interface looks and functions exactly like the Mining Model Prediction tab for a particular mining model in BIDS for an SSAS database. Of course, in SSRS, you must first select the particular mining model you want to use as a basis for your report query. For our example, select the Targeted Mailing data mining structure from the Adventure Works DW 2008 sample database. From the structure, select the TM Decision Trees data mining model. Then in the Select Input Table pane in the designer, click the Select Case Table button and select the vTargetMail view from the list. As in BIDS, the interface in SSRS automatically creates joins between source and destination columns with the same names. Again, as in BIDS, if you need to modify any of the automatically detected join values, right-click on the designer surface and then click Modify Mappings to open a dialog box that allows you to view, verify, and update any connections. Also, identical to BIDS, you can right-click anywhere on the designer surface and select Singleton Query to change the input interface (which defaults to a table of values) to a singleton query input.
634
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
After you’ve configured both the source data mining model and the input table or singleton values, you use the guided query area on the bottom of the designer to complete the query. In our case, we are showing an example of a singleton query using the PredictProbability function, taking the [Bike Buyer] source column as an argument to this query function. Figure 21-8 shows this sample in the interface.
FigUre 21-8 DMX Query Designer for SSRS in BIDS in design mode
As with the MDX query designer, if you want to view or edit the generated DMX, you can do so by clicking the last button on the right side of the toolbar (Design Mode). Clicking this button renders the query in native DMX. Finally, as with the MDX designer, you can use the Query Parameters button on the toolbar to quickly and easily add parameters to your DMX source query. Figure 21-9 shows what the singleton DMX query looks like in query mode in the query designer. You might get a value of 38 for age, rather than the 37 shown when you run this query using the sample. Using the Query Parameters button, we added a parameter to our query. We named it YearlyIncome and set the default value to 150000. Unlike adding a parameter to an MDX query, when you add a parameter to a DMX query, the generated DMX does not include that parameter. The parameter is visible in the Report Data window. If you want to view or update its properties, you can do so by right-clicking on the parameter in the object tree and then clicking Properties. This lack of visibility in DMX is because the parameter is a report parameter rather than a DMX parameter. We highlight this behavior because it differs from that of parameters in OLAP cubes (that is, MDX queries) and might be unexpected.
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
635
FigUre 21-9 DMX Query Designer for SSRS in BIDS in query mode
After you’ve written either your MDX or DMX query, you’re then ready to lay out the query results on a report designer surface. You saw a basic table layout in Chapter 20, “Creating Reports in SQL Server 2008 Reporting Services.” Now we’ll take a closer look at laying out reports that use OLAP cubes as a data source, which is the most common approach in business intelligence (BI) projects. In the next section, we’ll do just that, using both of the sample MDX and DMX queries that we just built as a basis for laying out report objects. We’ll start with the sample MDX query in the next section.
Working with the Report Designer in BIDS As we explained in Chapter 20, when you work with the report designer in BIDS, you use the Design and Preview tabs in the main work area to lay out and preview your reports. Recall also from Chapter 20 that you’re actually creating Report Definition Language (RDL) metadata when you’re dragging and dropping items onto the designer area and configuring their properties. Because RDL is an XML dialect, keep in mind that all information is case sensitive. We find that case errors are a common source of subtle bugs in reports, so we start our discussion of report design with this reminder. In addition to working with the designer surface, you’ll also use the Toolbox to quickly add data and other types of visual decoration, such as images, to your reports. In SQL Server 2008, there are two new or improved types of report items. In Chapter 20, we took a brief look at the new Gauge control. In this chapter, we explore the expanded capabilities of both the table and matrix data regions shortly. Before we do that, however, let’s examine another window you’ll want to have open when designing reports—the Report Data window. To open it, go to the View menu, and click the Report Data option). Figure 21-10 shows the Report Data window populated with the datasets we created using the MDX query shown in the previous section. The Report Data
636
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
window in SQL Server 2008 replaces the Data tab in the SSRS BIDS designer used in SQL Server 2005.
FigUre 21-10 Report Data window in BIDS
By taking a closer look at the Report Data window, you can see that in addition to the fields defined in your MDX query —that is, Calendar_Year, Sales_Reason, and so on—you also have access to fields defined as parameters in your query as both parameters and as dataset values. The configuration options differ for parameters and datasets. In addition to these fields, SSRS includes a number of built-in fields that you can use in your report definitions. These include the fields shown in Figure 21-10, such as Execution Time, Page Number, and so on. Note If you do not see the second dataset for the ProductCategory parameter, you might have to right-click on DataSource1 and select the Show Hidden Datasets option or right-click in an empty spot in the Report Data window and select Show All Hidden Datasets.
If you want to further configure any of the items displayed in the Report Data window, click the item of interest and then click Edit at the top of the window to open its Property configuration window. If the Edit button is disabled, there are no editable properties for the selected item. Configuration options vary depending on the type of object selected—that is, field, table, parameter, and so on. We’ll take a closer look first at the DataSet Properties dialog box. Select the ProductCategory dataset, and click Edit to open the dialog box. You can view or change the source query, add or remove parameters, add or remove fields, set advanced options (such as collation), and define dataset filters. This last item is particularly interesting for BI-based reports because source queries can return large, or even huge, datasets. It might be more efficient to filter (and possibly cache) the source query information on a middle-tier
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
637
SSRS server than to continually query an SSAS source server on each report render request. We’ll go into more detail about scalability considerations in Chapter 22, “Advanced SQL Server 2008 Reporting Services.” For now, we’ll just examine the mechanics of creating a filter on a dataset. To create a filter on a dataset, select the Filters option in the DataSet Properties dialog box and then click Add. After you select the expression ([ParameterValue], for this example) and operator (=), you define the value. Here you can supply a static value or use an expression to define the value. To define an expression, click the fx button, which opens the Expression dialog box shown in Figure 21-11. Note that IntelliSense is available in this dialog box, which is a welcome addition to the SSRS interface.
FigUre 21-11 Expression dialog box in SSRS in BIDS
The syntax in Figure 21-11 equates to setting the value to the first value in a field (Fields!) collection, where the field collection is based on the ParameterValue field from the ProductCategory dataset. The Expression editor colors strings (which should be delimited using double quotes) brown and other syntax black. It also shows you syntax errors by adding red (fatal) or blue (warning) squiggly lines underneath the syntax error in the statement. Note If you use expressions to define values anywhere in an SSRS report, the syntax must be correct. If not, the report will not render and an error will be displayed in the Errors window in BIDS.
638
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Before we leave the Report Data window, let’s look at the configuration options for defined parameters. Select the ProductCategory parameter in the Parameters folder, and click Edit to see the Report Parameter Properties dialog box. This dialog box includes general properties, such as setting the prompt value and data type, as well as options for specifying the source for the parameter values, such as the label value, and the value to use as a key. Using this Properties dialog box, you can also specify the refresh rate for parameter values. The default is set to Automatically Determine. Other options are Automatically Refresh and Never Refresh. If your report includes long parameter lists, manually configuring the refresh rate can affect report performance. After you’re satisfied with the data fields available for your report, you’ll want to select data regions and other visual elements and begin to lay out your report.
Understanding report items The next step in report creation is to select the items and data you want to display on your report. The quickest way to add layout items to your report is to drag them from the Toolbox onto the designer surface. The Toolbox is shown in Figure 21-12. Some of the report items can display data. These are known as data regions in SSRS. SSRS 2008 includes the following data regions: the Table, Matrix, Textbox, List, Image, Subreport, Chart, and Gauge controls. Of the data-bindable controls, some (such as Table, Matrix, and so on) contain intelligent formatting and display capabilities; others (such as List) contain little or no formatting. List allows you to define specialized layouts, such as repeating multicolumn tables.
FigUre 21-12 Toolbox for SSRS in BIDS
Most frequently, we use the Table, Matrix, Chart, and Gauge data region types for BI reports. Figure 21-13 shows the Chart and Gauge controls on the designer surface in their initial configurations. Note that you can drag as many types of controls as you want onto the report designer surface. Also, as mentioned previously, you can nest certain types of controls inside other controls as well. The Gauge control is new to SQL Server 2008 and represents enhanced visual representation of data native to SSRS. In many of our past projects, we chose to purchase third-party report controls to enhance the look and feel of end-user reports, so we see inclusion of richer visual controls as a very positive change for SSRS.
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
639
FigUre 21-13 Common data regions for SSRS in BIDS
List and Rectangle Report Items The list and rectangle report items have some similarities, the most notable of which is that you can nest other data-bindable data regions inside of either one—that is, place a table inside a list or have multiple matrices inside of a rectangle to more precisely control the data display and layout. However, there are some differences between these two report item types. A rectangle is a simple control with no particular format or dataset associated with it. You use it when you want to control layout. A list functions similarly, but it’s a type of Tablix container and is associated with a dataset. We explain exactly what that means in the next section.
Tablix Data Region New to SQL Server 2008 is the Tablix data region. Probably the first consideration you’ll have is how to implement this type of data region in SSRS because it does not appear in the Toolbox. As we mentioned in Chapter 20, when you drag data regions of type Table, Matrix, or List onto the designer surface, you’re actually working with an instance of a Tablix data region. Each of these variations on a Tablix data region is simply formatted in a different starting default configuration. In other words, if you drag a Table data region onto the designer surface, you get a Tablix data region that is presented in a (starting) table configuration; if you drag a Matrix data region, you get a Tablix data region that is formatted to look like a matrix; and so on.
640
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
MSDN explains Tablix data regions as follows: Tablix enables developers to generate reports that combine fixed and dynamic rows. Previously, layouts of this kind had to be developed by using multiple matrix data regions and shrinking row headers. Support for Tablix data regions simplifies the inclusion of combined static and dynamic data in reports, and extends the formatting and layout capabilities of Reporting Services significantly. You can read more about Tablix data regions at http://msdn.microsoft.com/en-us/library/ bb934258(SQL.100).aspx. What does this really mean for you? It means that you can easily change and add features from one type of structure—that is, a table with rows and columns—into a matrix with rollup totals in both the columns and rows areas. This flexibility is terrific for reports based on OLAP cubes because it’s quite common for tabular reports to evolve into more matrix-like structures. Now let’s build a report using a Tablix data region. We start with a Table because that’s the type of control we most often start with when displaying OLAP data. We’ll continue to work with the parameterized query that we showed earlier in this chapter. After dragging the Table control onto the designer surface, populate the fields with some data values (Calendar_Year, Sales_Reason, and Internet_Sales_Amount) by dragging them from the Report Data window. In this sample, we’ve included some built-in fields, such as UserID, ReportName, and ExecutionTime. These values are populated at run time, and we frequently include them in production reports. This is shown in Figure 21-14.
FigUre 21-14 Designing a basic OLAP report in BIDS
Of course you’ll probably spend much more time formatting your report output. You have many different ways to add formatting to the items included in your report. We often use the Properties dialog box associated with each item—that is, Tablix, Textbox, Image, and
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
641
so on. You can use the Properties dialog box associated with each item or the Formatting toolbar. We won’t cover report object formatting in any more detail here because it’s pretty straightforward. You’ll also want to understand how to use the Tablix features. Using our sample, you can right-click on the [Internet_Sales_Amount] cell, then click Add Group, and then Parent Group in the Column Group section of the menu. Select [Internet_Sales_Amount] as the Group By field in the resulting dialog box, and select the Add Group Header check box. This adds a column grouping level on the [Internet_Sales_Amount] field. If you were to select the same options in the Row Group section of the menu, you would create a row grouping on the [Internet_Sales_Amount] field. Figure 21-15 shows only the single field we’ve mentioned and the redesigned shortcut menus containing the available options for adding grouping rows or columns to your report.
FigUre 21-15 Shortcut menus showing Tablix options in BIDS
After you select a text box containing an aggregate, the data region identifies the groups in scope for that aggregate by adding an orange bar to the designer surface. Figure 21-16 shows the design output after you add the grouping level previously described.
642
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 21-16 Adding a grouping level
The flexibility available in Tablix data regions will improve your productivity when you create reports based on OLAP data sources because you’ll frequently encounter changes in requirements as the project progresses. That is, your customers will frequently ask questions such as, “Can you now summarize by x factor and y factor?” In addition to using the Tablix functionality, you can show or hide grouping levels based on other factors by configuring the grouping-level properties. You can do this in a number of ways. One way is to access the Visibility property of the particular group (for example, row or column) and then configure Visibility to be one of the following values: Show, Hide, or Show Or Hide Based On An Expression. You may also select the Display Can Be Toggled By This Report Item option, if you wish to allow the users to expand the amount of detail on the report. This is a common requirement because report summary levels are often determined by the level of detail that a particular level of management wants to view. For example, it’s common for senior managers to want to view summary data, whereas middle managers often prefer to drill down to the level or levels that they are most closely associated with. You also have the ability to edit the displayed information in the new for 2008 Row Groups and Column Groups sections, which appear at the bottom of the report designer. Here is yet another place for you to edit additional grouping levels and add them to or delete them from your report. We encourage you to explore the newly added shortcut menus in the Row Groups and Columns Groups sections. Menu options include adding, editing, and deleting groupings on rows and on columns. Additionally, if you decide to add a new grouping to your report via these shortcut menus, you can also select where you’d like to add it. The options are Parent Group, Child Group, Adjacent Before, and Adjacent After. In general,
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
643
we really like the usability enhancements included in the SSRS report designer in BIDS and believe that if you master the UI you’ll improve your report-writing productivity. Because we are discussing changes and improvements to SSRS controls, we’ll mention that in addition to the new Gauge control, another significant improvement in SSRS 2008 is the overhaul of the Chart control. You can now choose from a greater variety of chart types, and you have more control over the properties of the chart, which include more features to use to better present data visually. Along with standard chart types—that is, column charts and pie charts—there are many new chart types you can use in your reports. Here is a partial list of the new chart types in SSRS 2008: stepped line, range, exploded pie, polar, radar, range columm/bar, funnel, pyramid, and boxplot.
Using report Builder Report Builder is a simplified report designer application. It was introduced in SQL Server 2005; however, it has been completely redesigned in SQL Server 2008. Report Builder was released separately from SQL Server 2008, in October 2008. This text is based on Report Builder version 2.0 RC1, which is available for download from http://www.microsoft.com/ downloads/details.aspx?FamilyID=9f783224-9871-4eea-b1d5-f3140a253db6&displaylang=en. The released version’s features might differ slightly from the following discussion. The first thing you’ll notice is that the design of the interface is quite similar to the design of the report work area in BIDS. Among other UI changes, Microsoft has now included a ribbonlike menu interface. You’ll also notice that the Data window is nearly identical to the Report Data window in the report designer in BIDS. Also, at the bottom of the designer surface there are sections for quick configuration of row groups and column groups, just like those that have been added to the SSRS designer in BIDS. In addition to what is immediately visible on the report designer surface, you’ll also find that the properties dialog boxes have been redesigned from the previous version so that they now match those found in the SSRS designer in BIDS. All of these UI changes result in better productivity for report authors, whether they use BIDS or the report designer. Figure 21-17 shows the opening UI of the redesigned Report Builder.
644
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 21-17 Report Builder user interface
Report Builder includes the MDX Query Designer that we saw earlier in BIDS when you connect to an Analysis Services data source. For simplicity, we’ll create a nonparameterized report using the Adventure Works sample cube to walk you through the steps of creating a sample report. Click the Table Or Matrix icon on the designer surface to open the New Table Or Matrix Wizard. The first step is to create a connection (data source). Then define a query (dataset) to provide the data that you want to display on your report. Figure 21-18 shows the visual MDX Query Designer. As with the report designer in BIDS, you can either design MDX queries visually or click the last button on the right side of the toolbar to change to manual query mode. After you’ve written your query using the Design A Query page of the New Table Or Matrix Wizard, on the Arrange Fields page of the wizard you lay out your results in the Tablix data region. The wizard interface is well designed and allows you to intuitively lay out fields onto the rows or columns axis. In addition, you can change the default measure aggregation from SUM to any of 14 of the most common aggregate functions—such as MIN, MAX, and so on—by clicking on the downward pointing triangle next to the measure value. This interface is shown in Figure 21-19.
Chapter 21
Building Reports for SQL Server 2008 Reporting Services
645
FigUre 21-18 Report Builder visual MDX Query Designer
FigUre 21-19 The Arrange Fields page of the New Table Or Matrix Wizard
On the Choose The Layout page of the wizard, you configure selections related to displaying subtotals or groups (with optional drilldown). Lastly, you can apply a predefined formatting style to your report. As an example, we’ve applied a bit more formatting, such as selecting text and marking it as bold, and we show you the results on the designer surface shown in Figure 21-20.
646
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
If you don’t want to use the default layout, you can format any or all sections of the Tablix control display by right-clicking on the section of interest and then configuring the associated properties. This behavior is nearly identical, by design, to the various formatting options that we’ve already looked at when you use BIDS to format reports.
FigUre 21-20 Report Builder with a chart-type report on the designer surface
Summary In this chapter, we described how to use SSRS as a report interface for BI projects. We investigated the MDX visual and manual query interfaces built into BIDS. Then we examined the redesigned development environment, including the new Report Data window. After that, we built a couple of reports using BI data and the new or improved data region controls. We talked about the Tablix data region and populated it with the results of an MDX OLAP cube query. We continued by talking about the future of the Report Builder client. In the next chapter, we’ll look at advanced topics in SSRS related to BI projects, including implementing custom .NET code and using the new Microsoft Office Word and Excel 2007 report viewing and exporting capabilities. We’ll then describe how to embed the report viewer controls in a custom Windows Forms application. We’ll wrap up our look at SSRS by reviewing a sample of coding directly against the SSRS API and looking at best practices related to performance and scalability of SSRS.
Chapter 22
Advanced SQL Server 2008 Reporting Services In this chapter, we take a look at some advanced concepts related to using SQL Server Reporting Services (SSRS) in the Business Intelligence Development Studio (BIDS) environment. This includes integrating custom code modules and property configurations. We also examine the new functionality for viewing SSRS reports in Microsoft Office Word and Excel 2007. Then we take a look at integrating SSRS into custom hosting environments, including Windows Forms and Web Forms. We also look at URL access, embedding the report viewer controls, and directly working with the SSRS Simple Object Access Protocol (SOAP) API. Coverage of these topics is followed by a discussion of deployment, which includes scalability concerns. We conclude the chapter with a look at some of the changes to memory architecture and the Windows Management Instrumentation (WMI) API in SSRS.
Adding Custom Code to SSRS Reports We open our chapter with an advanced topic—using custom .NET code in SSRS reports. An example of a business scenario for which you might to choose to write custom code is one where you need complex, custom processing of input data, such as parsing XML documents as input data for SSRS reports. There are two approaches to doing this. Using the first approach, you simply type (or copy) your code into the particular report of interest (in the Report Properties dialog box) and run the .NET code as a script—that is, the code is interpreted each time the method or methods are called. The second approach, and the one we prefer, is to write (and debug) code in Microsoft Visual Studio and then deploy the module to the global assembly cache (GAC) on the report server. We use our preferred .NET language (C#) to write this custom code; however, you can use any .NET language as long as it will be compiled into a DLL. You are simply creating business logic that can be reused by SSRS, so you will normally use the class file template in Visual Studio to get started. You then write, debug, and optimize your class file or files. After you’ve written and built your class file, you then edit the SSRS configuration file (rssvrpolicy.config) to add a reference to that DLL. Then reference the particular assembly from the Report Properties dialog box mentioned earlier (and shown in Figure 22-1). Because you’re now working with a compiled assembly, you gain all the advantages of working with compiled code—that is, performance, security, and scalability. To implement the business logic 647
648
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
in the report, you then invoke one or more of the methods included in your class file from your SSRS report logic.
FiguRe 22-1 SSRS Report Properties dialog box
To implement the functionality defined in the assembly, you access its members via a report expression. Most of the time, you create your assembly’s members as static—that is, belonging to the class or type, rather than being associated with a particular instance of that class. This makes them simpler to call from SSRS because they are global and need not be instantiated on each use. The syntax for calling a static member is =Namespace.Class.Method. If you are calling a particular method instance, the syntax in SSRS is =Code.InstanceName.Method. Code is a keyword in SSRS. If you need to perform custom initialization of the object, you override the OnInit method of the Code object for your report. In this method, you create an instance of the class using a constructor, or you call an initialization method on the class to set any specific values. You do this if you need to set specific values to be used on all subsequent calls to the object, such as setting one or more members to default values for particular conditions. (For example, if the report is being executed on a Saturday or Sunday, set the Weekday property to False.) The SSRS samples available on CodePlex demonstrate several possible uses of this type of code extension to SSRS. These include custom renderers, such as printers, authentication, and more. These samples can be downloaded from the following location: http://www.code plex.com/MSFTRSProdSamples. Note that if you’re working with a custom assembly, you must reference both the namespace and the class or instance name in your report. Another consideration when deploying custom
Chapter 22
Advanced SQL Server 2008 Reporting Services
649
code as an assembly is the appropriate use of code access security (CAS). This is because assemblies can perform operations outside of the application boundary, such as requesting data from an external source, which could be a database, the file system, and so on. CAS is a policy-based set of code execution permissions that is used in SSRS. CAS default permissions are set in the multiple *.config files used by SSRS, such as rssrvpolicy.config. For more information, see the SQL Server Books Online topic “Understanding Code Access Security in Reporting Services.” Custom code modules are typically used to provide the following types of functionality: custom security implementation, complex rendering, and data processing. The most common use of this type functionality that we’ve provided for our clients revolves around custom data processing (which occurs prior to report rendering). Specifically, we’ve implemented complex if…then…else or case statements that transform data prior to rendering it using custom code modules in SSRS. Another business scenario that has caused us to implement custom report extensions is one in which the customer wants to create an “export to printer” mode of report rendering. This new rendering extension, after being properly coded and associated with your SSRS instance, appears as an additional choice for rendering in all client drop-down list boxes, such as in the SSRS Report Manager Web site. CodePlex has a well-documented sample of this functionality to get you started if you have a similar requirement. You can find this sample at http://www. codeplex.com/MSFTRSProdSamples. After you’ve completed report development, your next consideration is where you’d like to host and display your reports for end users to be able to access them. As we’ve mentioned, the default hosting environment is the Report Manager Web site provided with SSRS. Although a couple of our clients have elected to use this option, the majority have preferred an alternate hosting environment. We’ve found that the primary drivers of host environment selection are security model, richness of output, and sophistication of end-user groups. Another consideration is whether other Microsoft products that can host business intelligence information have already been deployed, such as Excel or Office SharePoint Server 2007. In Chapter 23, “Using Microsoft Office Excel 2007 as an OLAP Cube Client,” we’ll cover Excel hosting in detail. In Chapter 25, “SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007,” we’ll discuss Office SharePoint Server 2007 (Report Center) hosting. In the next few sections of this chapter, we’ll take a look at some alternatives for hosting to those that we’ve just listed. We’ll start by exploring direct Office 2007 viewing.
Viewing Reports in Word or excel 2007 The ability to render SSRS reports in Office file formats has been significantly enhanced in SSRS 2008. In SSRS 2005, you could render reports as Excel only. In SSRS 2008, you can render in a Word (2000 or later) format. Also, the Excel rendering has been enhanced. The
650
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
simplest way to use the new Word rendering is to view a report in the default Web site, select the Word format from the list of included rendering options, and then click the Export link to save the report in a .doc format. This is shown in Figure 22-2.
FiguRe 22-2 Render formats in SSRS 2008 now include Word.
In addition to the inclusion of Word rendering, SSRS 2008 has improved existing renderers for Excel and CSV. Excel rendering with SSRS 2008 now supports nested data regions such as subreports. In SSRS 2005, a matrix rendered to a CSV file produced a cluttered format that was difficult to use. The output for this format has been cleaned up and simplified in 2008, and it is now easier to import into applications that support CSV files. If you want to further customize the SSRS output that is rendered to either a Word or an Excel format, you can use the Visual Studio Tools for Office (VSTO) template in Visual Studio 2008 to programmatically extend BI reports that were being rendered in either of the supported formats. For more information (including code samples) about custom VSTO development, see the following link: http://code.msdn.microsoft.com/VSTO3MSI/Release/ProjectReleases.aspx?ReleaseId=729. VSTO is discussed in more depth in Chapter 23. You can choose to create your own Web site to host BI reports. As mentioned, there are a couple of approaches to take here. The simplest is to simply use URLs to access the reports. We’ll talk about that next.
Chapter 22
Advanced SQL Server 2008 Reporting Services
651
uRL Access Because of the simplicity of implementation, many of our customers choose to host BI reports using SSRS and link each report’s unique URL to an existing Web site. Microsoft has made a couple of useful enhancements to the arguments you can pass on the URL. Before we cover that, let’s take a look at a sample URL so that you can understand the URL syntax used to fetch and render a report: http://servername/reportserver?/SampleReports/Employee Sales Summary&rs:Command=Ren der&rs:format=HTML4.0 This example fetches the information about the Employee Sales Summary report and renders it to an HTML 4.0 output type. For a complete description of the URL syntax, see the SQL Server Books Online topic “URL Access” at http://msdn.microsoft.com/enus/library/ ms153586.aspx. Although it’s typical to use URL access for SSRS BI reports from Web applications, it’s possible to use this method from custom Windows Forms applications. The latter approach still requires the use of a Web browser (usually embedded as a WebBrowser control on the form). The SQL Server Books Online topic “Using URL Access in a Windows Application,” which is available at http://msdn.microsoft.com/enus/library/ms154537.aspx, details exactly how to do this. New to SQL Server 2008 is the ability to work with estimated report total page counts through URL access. This functionality has been added because page rendering has changed in this version of SSRS and it’s not necessarily possible to know the full range of page numbers as the report is processed. By using URL access, you can provide the argument &rs:PageCountMode=Estimate to use an estimated page count or &rs:PageCountMode=Actual to use the actual page count. Of course, more overhead is associated with the Actual value because all pages must be rendered in order to obtain a final page count. The SSRS interface reflects this change in the UI by adding a question mark after the page count in SSRS as shown in Figure 22-3. If the page count is an actual count, the question mark will not appear in the SSRS interface. The URL syntax needed if you want to use an estimated page count is as follows: http://servername/reportserver?/Adventure Works Sales/Sales Person Directory &rs:PageCountMode=Estimate.
FiguRe 22-3 Estimated pages in SSRS interface
652
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Another consideration when implementing URL access is how you choose to handle security credentials. You have a couple of options to address this. Most of our clients have chosen to implement Windows authentication for their intranet scenarios and to configure the included SSRS top-level URL as part of the intranet security zone so that the credentials of the loggedon user will be automatically passed to SSRS. If you’re implementing URL access in an Internet scenario (or a scenario where the user needs to provide logon credentials), you should set the SSRS data source to prompt for credentials. You can then include the credentials as part of the URL by including the prefix:datasourcename=value parameter string, where prefix is either dsu (for user name) or dsp (for password) and datasourcename is the name of the data source for which to supply credentials. For example, if you want to connect to the AdventureWorks2008 data source with a user name of testuser and a password of password, you include the following values in the URL: dsu:AdventureWorks2008=testuser&dsp:AdventureWorks2008=password. Of course, you should be using Secure Sockets Layer (SSL) encryption in these scenarios so that the transmission is encrypted. The user name and password should have the minimum privileges required to get the data for the report, such as read-only access to the database. Alternatively, you could supply the credentials via other methods, such as programmatically, if the security risk of passing credentials on the URL are unacceptable. Another way to implement URL access is to use the Microsoft ReportViewer controls in either a Web Forms or Windows Forms application to be able to display a BI report that is hosted in SSRS. We’ll detail the process to do this in the next section.
embedding Custom ReportViewer Controls Microsoft provides two controls in Visual Studio 2008 that allow you to embed SSRS reports (or link to an existing SSRS report hosted on an SSRS instance) in your custom Windows Forms or Web Forms applications. Alternatively, you can also design some types of reports from within Visual Studio and then host them in your custom applications. In Visual Studio 2005, Microsoft provided two different tools in the Toolbox to represent the types of this control; in Visual Studio 2008, the different modes of report access have been incorporated into a single control in the Toolbox for each type of client application. These controls appear in the Reporting section of the Toolbox in Visual Studio 2008 when you use either the ASP. NET Web Application or Windows Form Application project templates. Note If you plan to use the SSRS ReportViewer control, you can choose to install both Visual Studio 2008 and SQL Server 2008 on the same physical machine. Also, a full version of Visual Studio (not the Express edition) is required. To install both you need to install Service Pack 1 (SP1) for Visual Studio 2008 to ensure compatibility between BIDS and Visual Studio 2008.
Chapter 22
Advanced SQL Server 2008 Reporting Services
653
The two report processing modes that this control supports are remote processing mode and local processing mode. Remote processing mode allows you to include a reference to a report that has already been deployed to a report server instance. In remote processing mode, the ReportViewer control encapsulates the URL access method we covered in the previous section. It uses the SSRS Web service to communicate with the report server. Referencing deployed reports is preferred for BI solutions because the overhead of rendering and processing the often large BI reports is handled by the SSRS server instance or instances. Also, you can choose to scale report hosting to multiple SSRS servers if scaling is needed for your solution. Another advantage to this mode is that all installed rendering and data extensions are available to be used by the referenced report. Local processing mode allows you to run a report from a computer that does not have SSRS installed on it. Local reports are defined differently within Visual Studio itself, using a visual design interface that looks much like the one in BIDS for SSRS. The output file is in a slightly different format for these reports if they’re created locally in Visual Studio. It’s an *.rdlc file rather than an *.rdl file, which is created when using a Report Server Project template in BIDS. The *.rdlc file is defined as an embedded resource in the Visual Studio project. When displaying *.rdlc files to a user, data retrieval and processing is handled by the hosting application, and the report rendering (translating it to an output format such as HTML or PDF) is handled by the ReportViewer control. No server-based instance of SSRS is involved, which makes it very useful when you need to deploy reports to users that are only occasionally connected to the network and thus wouldn’t have regular access to the SSRS server. Only PDF, Excel, and image-rendering extensions are supported in local processing mode. If you use local processing mode with some relational data as your data source, a new report design area opens up. As mentioned, the metadata file generated has the *.rdlc extension. When working in local processing mode in Visual Studio 2008, you’re limited to working with the old-style data containers—that is, table, matrix, or list. The new combined-style Tablix container is not available in this report design mode in Visual Studio 2008. Both versions of this control include a smart tag that helps you to configure the associated required properties for each of the usage modes. Also, the ReportViewer control is freely redistributable, which is useful if you’re considering using either version as part of a commercial application. There are several other new ReportViewer control features in Visual Studio 2008. These include new features for both the design-time and run-time environments. Design-time features include the new Reports Application project template type. This is a project template that starts the Report Wizard (the same one used in BIDS to create *.rdlc files) after you first open the project. The wizard steps you through selecting a data source, choosing a report type (tabular or matrix), defining a layout, and formatting the report. Also, the SSRS expression editor (with IntelliSense) is included. Local reports that include expressions created in Visual Studio 2008 can include expressions written in Visual Basic .NET only. At runtime,
654
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
PDF compression (which results in “exporting to PDF” format automatically compressing the report) is added. Using the ReportViewer control in a custom application adds two namespace references to your project: Microsoft.ReportViewer.Common and Microsoft.ReportViewer.WinForms (or Web Forms for Web applications). Because you use the ReportViewer control in local mode with a Windows Forms application in scenarios where you want to design the report at the same time you’re creating the form, we see the ReportViewer control in local mode being used more often in OLTP reporting than in BI reporting. We believe that most BI developers will create their BI reports first in BIDS and then use the ReportViewer control in a custom application to provide access (using remote processing mode) to that report. For this example, you’ll create a simple Windows Forms application to display a sample SSAS cube report. After you open Visual Studio 2008 and start to design a Windows Forms application by clicking File, New Project, C# (or Visual Basic .NET), and then Windows Form Application, drag an instance of the MicrosoftReportViewer control from the Reporting section of the Toolbox onto the form designer surface. The Toolbox, ReportViewer control, and some of the control’s properties are shown in Figure 22-4.
FiguRe 22-4 ReportViewer control for a Windows Forms application
After you drag the ReportViewer onto the form’s designer surface, you’ll see that a smart tag pops up at the upper right side of the control. This smart tag allows you to enter the URL to an existing report (remote processing mode) or to design a new report (local processing mode) by clicking the Design A New Report link on the smart tag. As mentioned, if you use local processing mode, no license for SQL Server Reporting Services is needed and all
Chapter 22
Advanced SQL Server 2008 Reporting Services
655
processing is done locally (that is, on the client). Unfortunately, this mode does not support SSAS cubes as a data source. It does, of course, support using SQL Server data (and other types of relational data) as data sources. There are other significant limitations when using the local processing mode, including the following: ■■
Report parameters defined in the *.rdlc file do not map to query parameters automatically. You must write code to associate the two.
■■
*.rdlc files do not include embedded data source (connection) or query information. You must write that code.
■■
Browser-based printing via the RSClientPrint ActiveX control is not part of client-run reports.
Tip You can connect the ReportViewer control to an Object data source programmatically. In this way, you can connect to any BI object. A well-written example of this technique, including code samples, is found in Darren Herbold’s blog at http://pragmaticworks.com/community/blogs/ darrenherbold/archive/2007/10/21/usingthewinformreportviewercontrolwithanobjectdata source.aspx. Because you’ll mostly likely use only the remote processing mode to display reports built on your SSAS cubes and mining structures, your considerations when using the Windows Forms ReportViewer control will be the following settings (when configured using the smart tag associated with the ReportViewer control named ReportViewer Tasks, as shown in Figure 22-5 and described here): ■■
Choose Report Here you either select <Server Report> for remote processing mode or leave this value blank for local processing mode. Selecting <Server Report> changes the values in the smart tag to those in the remainder of this list. If you leave this value blank and then click on the Design A New Report link, the included report designer in Visual Studio opens a blank *.rdlc designer surface.
■■
Report Server Url This string is configured in the smart tag on the ReportViewer control and is in the form of http://localhost/reportserver.
■■
Report Path This string is configured in the smart tag on the ReportViewer control and is in the form of /report folder/report name—for example, /AdventureWorks Sample Reports/Company Sales. Be sure to remember to start the path string with a forward slash character (/). Also, you cannot include report parameters in this string.
■■
Dock In Parent Container This is an optional switch available via a linked string in the smart tag for the control. It causes the ReportViewer control to expand to fill its current container (the form in this example).
656
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FiguRe 22-5 ReportViewer Tasks settings
Figure 22-6 shows the ReportViewer control used in a Windows Forms application, displaying a simple sample report. We’ve chosen only to return the Calendar Year labels and Internet Sales Amount totals from the Adventure Works DW OLTP sample database in this sample report. By setting the optional Dock In Parent Container option to True, the report surface fills the entire Windows Forms display area.
FiguRe 22-6 Rendered report using the ReportViewer control
About Report Parameters If you’re connecting to a report in remote processing mode and the report expects parameter values, the ReportViewer control header area can provide a UI automatically for entering or selecting the particular parameter value. If the ShowParameterPrompts property is set to True, the prompt is displayed in the top area of the control. You have the option of setting
Chapter 22
Advanced SQL Server 2008 Reporting Services
657
the ShowParameterPrompts property to False and handling the parameter entry yourself. To do this, you must provide a parameter input area in the form or Web page. That is, you must add some type of control—such as a text box, drop-down list, and so on—to display and to allow the end users to select the parameter input values. You can then pass the value from the control as a parameter to the report by using the SetParameters method that the ReportViewer control exposes through the ServerReport class. You can also use this technique to set SSRS parameters that are marked as hidden (in the Report Parameters dialog box). You can see a code sample at the following blog entry: http://pragmaticworks.com/community/ blogs/darrenherbold/archive/2007/11/03/usingthereportviewercontrolinawebformwith parameters.aspx. The only way to display parameters when using local processing mode is programmatically. You use the same method just described: add the appropriate controls to the form to allow the user to select the parameter values. The difference is that you use the SetParameters method of the LocalReport class to apply them to the report. Tip As an alternative to linking to a report hosted on the SSRS server for which you’ve added parameters at design time, you can programmatically populate parameters. The following blog entry has a well-written description of this technique: http://blogs.msdn.com/azazr/archive/2008/ 08/15/parameterizetheolapreportsinclientapplications.aspx.
About Security Credentials The report credentials being passed through the ReportViewer control vary depending on how the application developer has configured the application and the type of application (that is, Windows Forms or Web Forms). For all types of credentials, remember to verify that the credentials being used in the application have access to the desired report on the SQL report server and have appropriate permissions in the database the information is being retrieved from, whether it’s a relational source or an Analysis Services database. If you’re working with Web Forms, the default authentication type is Windows Integrated Security. Remember from Chapter 5, “Logical OLAP Design Concepts for Architects,” that there are also limits on how many times a user security token can be passed to other computers. If you need to support a custom authentication provider other than Windows, you’ll need to support those requirements programmatically. Remember to consider both authentication and authorization strategies related to the type of authentication provider, such as providing a logon page for the user to enter crendentials, and so on. Note that the ReportViewer control does not provide pages for prompted credentials. If your application connects to a report server that uses custom authentication, (that is, one that is forms based), as mentioned you must create the logon page for your application.
658
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
If you’re implementing the ReportViewer control for an ASP.NET application in remote processing mode, you might want to configure connection information in the web.config file for the application. Specifically, you can optionally configure a ReportViewerServerConnection key value to store connection information in the event you’ve implemented your Web application with session state storage turned off. For more information, see the SQL Server Books Online topic “Web.config Settings for ReportViewer” at http://msdn.microsoft.com/enus/ library/ms251661.aspx. Note Projects using the Visual Studio 2005 ReportViewer control are not automatically upgraded. You must manually change references in your project to use the new control.
We have encountered business requirements that necessitate that we code directly against the SSRS API. We’ll talk about the why’s and how’s of that scenario next.
About the SOAP APi If you’re a developer, you might be saying to yourself, “Finally, I get to write some code here.” As with most other aspects of BI solutions, there is a good reason that we’ve placed this section near the end of the chapter. We’ve found many overengineered solutions among our clients. We again caution that writing code should have a solid business justification because it adds time, cost, and complexity to your reporting solution. That being said, what are some of the most compelling business scenarios in which to do this? We’ll use some of our real-life examples to explain. Although for small-to-midsized clients, using URL access or the ReportViewer control has worked well for us, we’ve had enterprise clients for which these solutions proved to be feature deficient. Specifically, we’ve implemented direct calls to the SSRS Web service API for the following reasons: ■■
Custom security implementation Most often, this situation includes very specific access logging requirements.
■■
A large number of custom subscriptions Sometimes this situation includes advanced property configurations, such as snapshot schedules, caching, and so on.
■■
A large number of reports being used in the environment We’ve achieved more efficient administration at large scales by coding directly and implementing the client’s specific requirements—for example, report execution schedules.
■■
Complex custom rendering scenarios This last situation requires quite a bit of custom coding such as overriding the Render method. Although this solution is powerful because it gives you complete control, it’s also labor intensive because you lose all the built-in viewing functionality, such as the report toolbar.
Chapter 22
Advanced SQL Server 2008 Reporting Services
659
If you’re planning to work directly with the API, you’ll be working with one of two categories (management or execution) of Web service endpoints. The management functionality is exposed through the ReportService2005 and ReportService2006 endpoints. The ReportService2005 endpoint is used for managing a report server that is configured in native mode, and the ReportService2006 endpoint is used for managing a report server that is configured for SharePoint integrated mode. (We’ll discuss SharePoint integrated mode in Chapter 25.) The execution functionality is exposed through the ReportExecution2005 endpoint, and it’s used when the report server is configured in native or SharePoint integrated mode. As with all Web service development, you must know how to access the service, what operations the service supports, what parameters the service expects, and what the service returns. SSRS provides you with a Web Service Description Language (WSDL) file, which provides this information in an XML format. If you prefer to consult the documentation first, you can read about the publicly exposed methods in the SQL Server Books Online topic “Report Server Web Service Methods” at http://msdn.microsoft.com/enus/library/ms155071.aspx. As with other custom clients, you must consider security requirements when implementing your solution. The API supports Windows or basic credentials by default. The syntax for passing Windows credentials via an SSRS proxy is as follows: ReportingService rs = new ReportingService(); rs.Credentials = System.Net.CredentialCache.DefaultCredentials;
To pass basic credentials, you use this syntax: ReportingService rs = new ReportingService(); rs.Credentials = new System.Net.NetworkCredential("username", "password", "domain");
You could, of course, also implement custom security by writing a custom authentication extension for SSRS. For an example, see the SQL Server Books Online topic “Implementing a Security Extension” at http://msdn.microsoft.com/enus/library/ms155029.aspx. An additional security consideration is the ability to require SSL connections for selected Web methods. You can configure these requirements by setting the appropriate value (0 through 3) in the SecureConnectionLevel setting in the RSReportServer.config file. As you increase the value of this setting, more Web methods are required to be called over a secure connection. For example, you might set the value to 0, like this:
This means that all Web service methods can be called from non-SSL connections. On the other hand, setting SecureConnectionLevel to 3 means that all Web service methods must be invoked over an SSL connection. There is an interesting new item in the SQL Server 2008 SSRS ReportExecution2005 Web service that corresponds to the estimated page count that we saw earlier in the URL access
660
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
section. In SSRS 2005, there was a class named ExecutionInfo that contained a NumPages property, which could be used to retrieve the actual page count. In SSRS 2008, the ExecutionInfo class has been extended as ExecutionInfo2. The new class has an additional property, PageCountMode, which can be set to Actual or Estimate to control the estimating behavior. The Web service now includes several extended classes and methods to support the new functionality in SSRS 2008, including ExecutionInfo2, LoadReport2, ResetExecution2, Render2, and more.
What Happened to Report Models? SSRS 2005 introduced the ability to create a semantic layer between your database query and report. This was expressed as the report model. You can create report models by using a template in BIDS (for relational sources), or you can create them automatically by clicking a button on the data source configuration Web page in the default Web site (for OLAP sources). Although this functionality is still included in SSRS 2008, we’ll be using report models less often in SSRS 2008. This is because the original reason to create these semantic models was that they were required to be used as sources for designing reports using the Report Builder tool. In the RTM release of Report Builder 2.0 for SSRS 2008, these semantic models were no longer required as a source for report creation. Report Builder 2.0 allows you to use report models or a direct connection to your data (whether relational or OLAP) as a source for reports. Also, using Report Builder 2.0 you can now create .rdl-based reports using a drag-and-drop interface that is similar to the report design interface found in BIDS for SSRS. Figure 22-7 shows a report model built from the OLTP AdventureWorksLT database.
FiguRe 22-7 Report Model tab in BIDS
Chapter 22
Advanced SQL Server 2008 Reporting Services
661
After you deploy a report model to the report server, you can configure associated properties in the default Web site. New to SSRS 2008 is the ability to configure clickthrough reports, item-level security, or both in the default Web site. This is shown in Figure 22-8.
FiguRe 22-8 Report model clickthrough permissions in BIDS
After you deploy your semantic model, you can use it as a basis for building a report using the SSRS old-style Report Builder interface, shown in Figure 22-9, or you can use it as a source in the new Report Builder 2.0.
FiguRe 22-9 Report Builder 1.0 interface using a report model as a data source
662
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
To better understand how Report Designer, Report Builder, and report models work together in SSRS 2008, we suggest you read Brian Welcher’s blog at http://blogs.msdn.com/ bwelcker/archive/2007/12/11/transmissionsfromthesatelliteheartwhatsupwith reportbuilder.aspx. Figure 22-10 provides a conceptual view of the SSRS report-creation tools.
Report Designer 2008
Visual Studio Integration
Full RDL Support All Data Sources Shared Layout Surface Shared Dialog Boxes
Report Builder 2008
Office 12 Look and Feel
Report Models
Integrated Query and Layout Clickthrough Link Generation Report Builder 2005
FiguRe 22-10 Conceptual view of SSRS report-creation tools
The key piece of information for BI is that semantic models are no longer required to build reports using an SSAS database as a data source with the Report Builder 2008 tool. Report models continue to be supported, but they’re not required.
Deployment—Scalability and Security For some BI scenarios, you have enough information at this point to proceed with your implementation. However, SSRS is designed for scalability and we want to include some information on those capabilities because we’ve worked with some clients whose business requirements were such that we chose to implement some of the included scalability features. The scale-out feature (Web farms) that we present here requires the Enterprise edition of SSRS. For a complete list of features by version, go to http://msdn.microsoft.com/enus/ library/cc645993.aspx (reporting section).
Chapter 22
Advanced SQL Server 2008 Reporting Services
663
Performance and Scalability As the load increases on your SSRS instance because of larger reports or more end users accessing those reports, you might want to take advantage of one or more of the scaling features available in SSRS. These include caching, using snapshots, and scaling out the SSRS server itself. The simplest way to configure caching or snapshots is on a per-report basis using the management interface in the Report Manager Web site. You might also want to use some of the included Windows Performance counters for SSRS during the pilot phase of your project to test your server instance using production levels of load. These counters are detailed in the SQL Server Books Online topics “Performance Counters for the MSRS 2008 Web Service Performance Object” (http://tech net.microsoft.com/enus/library/ms159650.aspx) and “Performance Counters for the MSRS 2008 Windows Service Performance Object” (http://technet.microsoft.com/enus/library/ ms157314.aspx). You can easily configure snapshot and timeout settings globally using the SSRS Site Settings page (General section) as shown in Figure 22-11.
FiguRe 22-11 SSRS Site Settings dialog box
You can use the Report Manager Web site to configure caching or snapshots for an individual report. Select the report you want to configure, and then click the Properties tab and choose the Execution section. Here you can also configure an execution timeout value for potentially
664
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
long-running reports as shown in Figure 22-12. When you choose to schedule report execution to reduce load, it’s a best practice to include the built-in field value ExecutionTime on the report so that your end users can be aware of potential data latency.
FiguRe 22-12 SSRS report execution settings
We do caution that if you find yourself needing to use the execution time-out value for a report using an SSAS database as a data source, you might want to reevaluate the quality of the MDX query. Be sure to include only the data that is actually needed in the result set. Tracing with Profiler can help you to evaluate the query itself. Another optimization technique is to add aggregations to the cube at the particular dimension levels that are referenced in the query. This is a scenario where you can choose to use the Advanced view in the aggregation designer to create aggregations at specific intersections in the cube. We discussed this technique in Chapter 9, “Processing Cubes and Dimensions.” New in SQL Server 2008 SSRS is the ability to configure a memory threshold for report processing. In previous releases, the report server used all available memory. In this release, you can configure a maximum limit on memory as well as interim thresholds that determine how the report server responds to changes in memory pressure. You do this by making changes to the default settings in the RSReportServer.config file. To help you understand these settings, we’ll first cover some background and then detail the process to make changes.
Chapter 22
Advanced SQL Server 2008 Reporting Services
665
Advanced Memory Management Because of some significant changes in SSRS architecture, you now have the ability to implement more fine-grained control over memory usage in SSRS. To show you how this works, we’ll first review how memory was allocated in SSRS 2005 and then compare that with the changes in the SSRS 2008 memory architecture. Also, we caution that making any changes to the default configuration should be documented and tested with production levels of load. To start, we’ll review some challenges that occurred in SSRS 2005 because of the memorymanagement architecture. Generally, reports being bound by memory limited scalability, and sometimes large reports running interactively caused memory exceptions. This could also cause smaller reports to be starved for resources by larger reports. Also, page-to-page navigation could result in increasing response times as the number of pages increased. The object model computed calculations before storing reports in an intermediate format. This sometimes resulted in problems such as inconsistent rendering results when paging forward and backward, or inconsistent pagination across rendering outputs. The problem was that all data had to be processed before you could start rendering. Each rendering extension read from a report object model and did pagination. Output of rendering extensions was done by the WebForms control consuming HTML, or by consuming a serialized format or image renderer, the WinForms control, or the Print control. In SSRS 2008, several architectural changes were implemented to resolve these issues. The first change is that the grouping of data has been pulled out of the data regions, which results in consistent grouping. As mentioned, all data region types have been replaced with the Tablix type. The chart data region has been kept separate because of the additional properties required for visualization. The second change is that the results of processing the data regions are stored in intermediate format before the calculations are executed. This means those calculations can be done on the fly because raw data is always available. The third change is that the rendering object model is invoked by rendering an extension for a specific page. In other words, you have to calculate expressions only on a particular requested page. This is an iterative process based on the grouping or groupings in scope on the page being viewed. The report object model is abstracted into three module types: ■■
Soft page layout
■■
Data Outputs data directly
■■
Hard page layout
Interactive rendering in which there is no concept of a page
PDF and image files always have same pagination
The fourth change is that some rendering to the client (HTML or image) is offloaded. This was changed because in previous versions where the rendering was done on the server, that rendering was at the resolution of the server rather than of the client. If the resolution was
666
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
higher on the client, inconsistent results were produced. This also has the effect of improving performance by offloading work from the SSRS server. Memory management allows larger reports to be successfully (but more slowly) processed— in previous versions, those reports would sometimes consume all the available memory on the computer and fail. The goal of manually tuning memory is to reduce out-of-memory exceptions. To tune memory manually, you make changes to the SSRS RSReportServer.config configuration file. Microsoft recommends changing the configuration settings only if reports are timing out before processing completes. Note that, by default, only MemorySafetyMargin and MemoryThreshold are configured. You must manually add the other settings to the configuration file. Table 22-1 summarizes the configuration values and the areas of SSRS memory that configuration changes to these areas affect. TAbLe 22-1
SSRS Memory Configuration Settings
Setting Name
Description
WorkingSetMaximum
This value (expressed in kilobytes) controls the maximum amount of memory the report server can use. By default, this is set to the amount of available memory on the computer.
MemoryThreshold
This value (expressed as a percentage of WorkingSetMaximum) defines the boundary between a medium and high memory pressure scenario. By default, it’s set to 90.
MemorySafetyMargin
This value (expressed as a percentage of WorkingSetMaximum) defines the boundary between a low and medium memory pressure scenario. By default, it’s set to 80.
WorkingSetMinimum
This value (expressed in kilobytes) controls the minimum amount of memory the report server keeps reserved. By default, this is set to 60 percent of WorkingSetMaximum.
There are also several new performance counters available for monitoring service activity. ASP.NET performance counters no longer detect report server events or activity. In addition to tuning the SSRS instance as described in this section so far, you might also elect to run SSRS on more than one physical machine. This is called scaling out, and we’ll talk about this technique in the next section.
Scaling Out The Enterprise edition of SSRS supports scaling out—that is, using more than one physical machine to support the particular SSRS solution that runs from a common database. To implement a scaled-out solution, you use the Reporting Services Configuration Manager tool (Scale-Out Deployment section). This is also called a Web farm. SSRS is not a clusteraware application; this means that you can use network load balancing (NLB) as part of your
Chapter 22
Advanced SQL Server 2008 Reporting Services
667
scale-out deployment. For more information, see the SQL Server Books Online topic “How to: Configure a Report Server Scale-Out Deployment (Reporting Services Configuration)” at http://msdn.microsoft.com/enus/library/ms159114.aspx. You must also manage the encryption key across each instance by backing up the generated encryption key for each instance using either the Reporting Services Configuration Manager or the rskeymgmt.exe command-line utility included for scriptable key management. Figure 22-13 shows the Scale-Out Deployment interface.
FiguRe 22-13 SSRS Scale-out deployment
A typical scaled-out SSRS implementation includes multiple physical servers. Some of these servers distribute the front-end report rendering via a network load balancing type of scenario. You can also add more physical servers to perform snapshots or caching in enterprisesized implementations. For more strategy implementation details on scale-out deployments for SSRS, see the following post: http://sqlcat.com/technicalnotes/archive/2008/10/21/report ingservicesscaleoutdeploymentbestpractices.aspx. Another topic of interest when managing your SSRS instance is using scripting to manage SSRS administrative tasks. We’ll take a look at this in the next section.
Administrative Scripting SSRS includes a scripting interface (rs.exe) that allows administrators to execute scripts written in Visual Basic .NET from the command line as an alternative to using the management pages in the Reports Web site. This tool is not supported when you’re using SharePoint integrated mode with SSRS.
668
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
This tool includes many switches that allow you to further configure the script’s execution. In addition to built-in support for accessing many of the administrative Web service methods, such as report deployment or management, there are also additional scripts available on CodePlex. These scripts provide examples of automating other routine maintenance tasks, such as managing scheduled jobs and setting report server–level properties. For example, there are some custom scripts in CodePlex at http://www.codeplex.com/ MSFTRSProdSamples/Wiki/View.aspx?title=SS2008%21Script%20Samples%20%28Report ing%20Services%29&referringTitle=Home. These include sample scripts that allow you to programmatically add report item security, manage running SSRS jobs, and more. In addition to using rs.exe, you can also create and execute administrative scripts using the WMI provider for SSRS.
Using WMI The SSRS Windows Management Instrumentation (WMI) provider supports WMI operations that enable you to write scripts and code to modify settings of the report server and Report Manager. These settings are contained in XML-based configuration files. Using WMI can be a much more efficient way to make updates to these files, rather than manually editing the XML. For example, if you want to change whether integrated security is used when the report server connects to the report server database, you create an instance of the MSReportServer_ ConfigurationSetting class and use the DatabaseIntegratedSecurity property of the report server instance. The classes shown in the following list represent Reporting Services components. The classes are defined in either the root\Microsoft\SqlServer\ReportServer\\v10 or the root\Microsoft\SqlServer\ReportServer\\v10\Admin namespace. Each of the classes supports read and write operations. Create operations are not supported. ■■
MSReportServer_Instance class Provides basic information required for a client to connect to an installed report server.
■■
MSReportServer_ConfigurationSetting class Represents the installation and run-time parameters of a report server instance. These parameters are stored in the configuration file for the report server.
As with writing scripts and executing them using the rs.exe utility, you can also use the SSRS WMI provider to automate a number of administrative tasks, such as reviewing or modifying SSRS instance properties, listing current configuration settings, and so on. The ability to make these types of configuration changes programmatically is particularly valuable if you need to apply the same settings across a scaled-out farm of SSRS servers or to make sure that multiple environments are configured the same way. The Reporting Services Configuration Manager and rsconfig.exe utility use the WMI provider.
Chapter 22
Advanced SQL Server 2008 Reporting Services
669
Note In SQL Server 2008 SSRS, there are a couple of changes to the WMI API. These include changing the WMI namespace to \root\Microsoft\SqlServer\ReportServer\\v10, adding properties that let you retrieve information about the SSRS edition and version, and removing a couple of classes (such as MSReportManager_ConfigurationSetting and MSReportServer_Con figurationSettingForSharePoint). For more information about the SSRS WMI provider, see the SQL Server Books Online entry at http://msdn2.microsoft.com/enus/library/ms152836(SQL.100).aspx.
Summary In this chapter, we looked at some advanced topics related to SSRS in SQL Server 2008. These included adding custom .NET code to an SSRS report to improve performance for computationally intensive processes. We then looked at the new Word and improved Excel rendering capabilities. Next, we examined creating custom applications using the embeddable report controls available for both Windows Forms and Web Forms applications in .NET. We concluded the chapter by discussing scalability, availability, and advanced memory management related to the SSRS implementation for BI projects.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client In this chapter, we look at the ins and outs of using Microsoft Office Excel 2007 as a user client for Microsoft SQL Server 2008 Analysis Services OLAP cubes. We’ll take a look at the functionality of the updated design of the PivotTable interface in Excel 2007. Although you can use Excel 2003 as a client for OLAP cubes, we won’t cover that functionality here. We’ll start by reviewing the installation process.
Using the Data Connection Wizard As we introduced in Chapter 2, “Visualizing Business Intelligence Results,” you’ll need a sample cube to work with to understand what you can and can’t do using an Excel 2007 PivotTable as a client interface to SQL Server 2008 SSAS OLAP cubes. We’ll use the Adventure Works DW 2008 sample cube as the basis for our discussion in this chapter. In Chapter 2, we detailed how and where to download and set up the sample OLAP cube. Remember that you need to retrieve the sample files from www.CodePlex.com. If you haven’t done so already, set up this sample prior to continuing on in this chapter (if you’d like to follow along by using Excel 2007). We’ll use Excel 2007 as a sample client for the duration of this chapter. You can use Excel 2003 as a client for SSAS 2008 OLAP cubes, but it does have a slightly different interface and somewhat reduced functionality. For more detail on exactly what OLAP cube features are supported when using Excel 2003, see SQL Server Books Online. Excel 2007, like the entire 2007 Microsoft Office system, includes redesigned menus, which are part of the Ribbon. To start, we’ll take a look at making the connection from Excel 2007 to our sample SSAS 2008 OLAP cube. To do this we’ll use the Data tab on the Ribbon (shown in Figure 23-1). Note the two groups on this tab that relate to connection management: Get External Data and Connections.
FigUre 23-1 The Data tab on the Excel 2007 Ribbon
671
672
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
To demonstrate the functionality of the Data tab on the Ribbon, we’ll take you through an example. To make a connection to the AdventureWorks sample cube, click From Other Sources in the Get External Data group. In the drop-down list that opens, click From Analysis Services. The Data Connection Wizard opens, as shown in Figure 23-2.
FigUre 23-2 The Connect To Database Server page of the Data Connection Wizard
After typing the name of the SSAS instance to which you want to connect (localhost if you are following along with the example), enter the login credentials that are appropriate for your scenario. Remember that by default, only local administrators have permission to read OLAP cube data. If you are configuring nonadministrative access, you first have to use SSMS or BIDS to configure Windows role-based security (the preferred method of connecting). Next, log in as the Windows user for whom you are creating the Excel-based connection. On the next page of the Data Connection Wizard, shown in Figure 23-3, you are asked to select the OLAP database name and then the structure to which you want to connect. You can select only one object. Both regular cubes and Analysis Services perspectives are supported as valid connection objects. You’ll recall from Chapter 8, “Refining Cubes and Dimensions,” that an SSAS OLAP cube perspective is a named, defined subset of an existing OLAP cube. Perspectives are often used to provide end users with simplified views of enterprise cubes. You’ll continue to use the AdventureWorks cube for this example. On the last page of the Data Connection Wizard, shown in Figure 23-4, you can configure additional properties, such as naming the connection. Notice the optional setting, which is cleared by default, called Always Attempt To Use This File To Refresh Data. Remember that Excel 2007 does not refresh the data retrieved from the OLAP cube automatically. Later, as we take a look at the PivotTable settings, we’ll review how you can change this default to perform refreshes on demand.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
673
FigUre 23-3 The Select Database And Table page of the Data Connection Wizard
FigUre 23-4 The Save Data Connection Files And Finish page of the Data Connection Wizard
You need to balance the overhead in network traffic, query execution on Analysis Services, and the traffic necessary for your users’ needs for refreshed data. You should base your configuration on business requirements, and you should also document the refresh rate for other administrators and end users.
674
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Working with the import Data Dialog Box After you click Finish on the last page of the Data Connection Wizard, Excel opens the Import Data dialog box (shown in Figure 23-5), which determines how and where the PivotTable will be placed in the workbook. Here you can select the view for the incoming data. Your choices are PivotTable Report, PivotChart And PivotTable Report (which includes a PivotTable), or Only Create Connection (which doesn’t create a PivotTable). Also, you’ll specify whether you want to put the data on the existing, open worksheet page or on a new worksheet.
FigUre 23-5 The Import Data dialog box
You could just click OK at this point and begin laying out your PivotTable. However, alternatively you can click the Properties button to configure advanced connection information. If you do that, a dialog box with two tabs opens. The Usage tab (shown in Figure 23-6) allows you to configure many properties associated with this connection. As mentioned previously, an Excel PivotTable is set to never refresh data by default. To change this setting, either enable timed refresh by selecting Refresh Every and then setting a value in minutes, or select Refresh Data When Opening The File. We call your attention to one other important setting in this dialog box—the OLAP Drill Through setting, with a default of a maximum of 1,000 records. If your business requirements are such that you need to increase this number substantially, we recommend that you test this under production load. Drillthrough is a memory-intensive application that requires adequate resources on both the client and the server. After you’ve completed configuring any values you want to change in the Connection Properties dialog box, click OK to return to the Import Data dialog box. Click OK in that dialog box. Excel will create a couple of new items to help you as you design the PivotTable.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
675
FigUre 23-6 The Usage tab of the Connection Properties dialog box
Understanding the PivotTable interface The Excel Ribbon gives you quick access to the most commonly performed tasks when working with a PivotTable—we’ll be exploring its functionality in more detail shortly. After you’ve completed your connection to SSAS, the Ribbon displays the Options and Design tabs under PivotTable Tools. You’ll generally use the Options tab (shown in Figure 23-7) first, so we’ll start there.
FigUre 23-7 The PivotTable Tools Options tab
As we continue through the new interface, you’ll notice two additional sections. The first is a workspace that is highlighted on the Excel worksheet page itself, as shown in Figure 23-8. The point of this redesign, as we mentioned in Chapter 2, is to make the process of working with a PivotTable more intuitive for minimally trained end users.
676
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 23-8 The work area for creating a PivotTable on a worksheet
Another important component of the PivotTable workspace is the redesigned PivotTable Field List. This list now has four possible display configurations. Figure 23-9 shows the default layout (Field Section And Areas Section Stacked). The button in the upper right allows you to switch to the most natural layout for you. The first section, labeled Show Fields Related To, allows you to filter measures, dimensions, and other objects (such as KPIs) by their association to OLAP cube measure groups. The second section, below the filter, allows you to add items to the PivotTable surface by selecting them. The items are ordered as follows: ■■
Measures
■■
KPIs Shown in associated folders, then by KPI. You can select individual aspects of KPIs to be displayed—either value, status, trend, or goal—rather than the entire KPI.
■■
Dimensions Shown alphabetized. Within each dimension you can select defined hierarchies, individual members of hierarchies, or individual levels to be displayed. You can also select named sets to be displayed.
Shown alphabetized in order of measure groups.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
677
FigUre 23-9 The PivotTable Field List
Click any item in the PivotTable Field List and it is added to the PivotTable workspace and to one of the areas at the bottom of the PivotTable Field List. Measures and KPIs are automatically added to the Values section of the latter. Non-measures, such as dimensional hierarchies, are added to either the Row Labels or Column Labels section of this list. Column labels are also placed by default on the rows axis of the PivotTable. To pivot, or make a change to the PivotTable that you’ve designed, simply click and drag the item you want to move from one axis to another on the designer surface. For example, you can drag a dimensional hierarchy from the rows axis to the columns axis, or you can drag it from the Row Labels section of the PivotTable Field List to the Column Labels section. If you want to use a non-measure item as a report filter, you drag that item to the Report Filter section of the field list. A filter icon then appears next to the item in the list of items. You can also remove values from the PivotTable by dragging them out of any of the sections and dropping them on the list of fields.
678
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Creating a Sample PivotTable Now that we’ve explored the interface, let’s work with a sample in a bit more detail. To do this we’ve first set the PivotTable Field List filter to Internet Sales (measure group). This gives us a more manageable list of items to work with. We’ll look at two measures, Internet Sales Amount and Internet Order Quantity, by selecting them. Selecting these measures adds both of them to our PivotTable display (work) area. It also adds both measures to the Values area. Next we’ll put some dimensional information on both rows and columns, adding the Customer Geography hierarchy to the rows axis and the Date.Calendar hierarchy to the columns axis as shown in Figure 23-10. Of course, we can also add a filter. We’ll do that next.
FigUre 23-10 The PivotTable Field List configured
Remember that if you want to filter one or more members of either dimension that we’ve already added, you simply open the dimension listing either from the field list or from the PivotTable itself and then clear the members that you want to remove from the display. After you implement a filter, the drop-down triangle icon is replaced by a filter icon to provide a visual indicator that the particular axis has been filtered. For our example we’ve removed
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
679
the Canada data member from the Customer Geography dimension by clearing the check box next to that value (Canada) in the Pivot Table Field List for the Customer Geography hierarchy. Now our PivotTable looks a bit more interesting, as shown in Figure 23-11. We show two measures and two dimensions, with the Customer Geography dimension being filtered. Be aware that your end users can perform permitted actions, such as drillthrough, by simply right-clicking the cell containing data of interest. Excel 2007 supports not only drillthrough, but also additional actions defined on the OLAP cube. Remember that OLAP cube actions can be of several types, including regular, reporting, or drillthrough. Regular actions target particular areas of the cube, such as dimensions, levels, and so on, and produce different types of output, such as URLs and datasets.
FigUre 23-11 A simple PivotTable
Adding a filter from another dimension is as simple as selecting the dimension of interest and then dragging that item to the Report Filter section of the field list. As with other SSAS client interfaces, this filter will appear at the top left of the PivotTable, as it does in the BIDS cube browser. This filter allows you to “slice” the OLAP cube as needed. Now that we’ve created a sample PivotTable, you can see that more buttons are active on the PivotTable Tools Options and Design tabs. Using the Design tab (shown in Figure 23-12), you can now format the PivotTable. You can apply predefined design styles and show or hide subtotals, grand totals, empty cells, and more. Which items are enabled on the Ribbon depends on where their focus is. For example, if a measure is selected, most of the grouping options are disabled.
FigUre 23-12 The PivotTable Tools Design tab
You might want to add a PivotChart to your workbook as well. Doing so is simple. You use the PivotTable Tools Options tab to add a PivotChart. First, click any cell in the existing PivotTable and then click the PivotChart button. This opens the Insert Chart dialog box (shown in Figure 23-13) from which you select the chart type and style.
680
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 23-13 The Insert Chart dialog box
As with your PivotTable, the resultant PivotChart includes a redesigned PivotChart Filter Pane. You use this to view and manipulate the values from the cube that you’ve chosen to include in your chart. An example of both a PivotChart and the new filter pane is shown in Figure 23-14.
FigUre 23-14 An Excel PivotChart based on an OLAP cube
After you’ve created a PivotChart, Excel adds a new set of tabs on the Ribbon, under PivotChart Tools. These four tabs help you work with your PivotChart: Design (shown in Figure 23-15), Layout, Format, and Analyze.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
681
FigUre 23-15 The PivotChart Tools Design tab
Offline OLAP In all of our examples so far, we’ve been connected to an SSAS instance and we’ve retrieved data from a particular OLAP cube via query execution by Excel. In some situations, the end user might prefer to use Excel as a client for locally stored data that has been originally sourced from an OLAP cube. This is particularly useful in scenarios where the user may be at a remote location or travels extensively and thus cannot always have direct access to the Analysis Services database. Excel includes a wizard that allows authorized end users to save a local copy of the data retrieved from an OLAP cube. To use this functionality, click the OLAP Tools button on the PivotTable Tools Options tab, and then click Offline OLAP. The Offline OLAP Settings dialog box opens and lets you choose to work online or offline. On-Line OLAP is selected by default, as shown in Figure 23-16. Click the Create Offline Data File button to open the Offline OLAP Wizard, also called Create Cube File. This wizard will guide you through the process of creating a local cube (*.cub) file.
FigUre 23-16 The Offline OLAP Settings dialog box
The first page of the wizard explains the local cube creation process. On the second page of the wizard, you are presented with a list of all dimensions from the OLAP cube. Dimension members that are currently selected to be shown in the PivotTable on the workbook page are shown as selected and appear in bold in the list. The dimensions and levels you choose here control which dimensions and levels are available in the offline copy of the cube. An example is shown in Figure 23-17.
682
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigUre 23-17 Create Cube File – Step 2 Of 4
On the third page of the wizard, you are shown a summary of the dimension members for your local cube. You can select or clear Complete Objects, which excludes all members of that object, or you can remove individual members, levels, and so on from your selected parent items. In Figure 23-18, our selection mirrors the earlier filter that we configured in the main PivotTable. That is, we’ve included the Country dimension, but have filtered out the Canada member from our view.
FigUre 23-18 Create Cube File – Step 3 Of 4
On the last page of this wizard, you configure the path and file name of where you’d like Excel to create and store the local cube file. The default file path is C:\Users\%username%\ Documents\%cubename%.cub. The file that is created is stored in a format native to Excel.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
683
Note You will find that while most values are selected by default, a few dimensions may not be part of your PivotTable at all (such as Destination Currency). This can cause the wizard to throw an error when trying to save the offline cube, because Destination Currency has a many-to-many relationship with Internet Sales, and all the appropriate intermediate values may not be selected by default.
excel OLAP Functions Excel 2007 exposes a group of new functions that allow you to work with OLAP cube information via Excel formulas. These functions are listed in Table 23-1. TABLe 23-1
excel OLAP Functions
Function
Description
CUBEMEMBER(connection,member)
Returns the member defined by member_name
CUBEKPIMEMBER(connection,kpi_name,kpi_property)
Returns the KPI property defined by kpi_name
CUBEVALUE(connection,member1,member2, …)
Returns the value of a tuple from the cube
CUBESET(connection,set_expression)
Returns the set defined by set_expression
CUBERANKEDMEMBER(connection,set_expression,rank)
Returns the nth item from a set
CUBEMEMBERPROPERTY(connection,member,property)
Returns a property of a cube member
CUBESETCOUNT(set)
Returns the number of items in a set
These functions are used directly in the Excel formula bar and make retrieving values from an OLAP cube by using custom formulas simpler. Although using these new Excel functions may suffice for your particular business requirements, in some situations your requirements might call for more advanced customization. We’ll address just how you can do that in the next section.
extending excel If your business requirements call for using Excel as a base client for your BI solution, but also require you to extend Excel, you should take a look at using the Microsoft Visual Studio Tools for the Microsoft Office System (commonly called VSTO). You can use this to extend Excel to create custom output styles for PivotChart or PivotTable objects. Another reason to extend Excel by using code is to overcome the built-in limits, such as 256 page fields for a PivotTable. For more specifics on Excel data storage limits, see http://office.microsoft.com/en-us/excel/ HP051992911033.aspx.
684
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
The development templates for VSTO are included with Visual Studio 2008 Professional, or the Team Editions of Visual Studio 2008. A free run-time download is available at http://www.microsoft.com/downloads/details.aspx?FamilyID=54eb3a5a-0e52-40f9-a2d1eecd7a092dcb&DisplayLang=en. If you have one of the full versions of Visual Studio 2008, it contains templates that allow you to use the various 2007 Office system formats as a basis for application development. In our current context, you’ll choose Excel 2007 Workbook. The Visual Studio 2008 New Project dialog box contains several templates for custom Excel programming. These include templates for Excel add-ins, workbooks, and templates. Note that the VSTO templates are versionspecific, meaning that different versions of the development templates are available for Excel 2003 and for Excel 2007. For example, if you select and open the Excel 2007 Workbook template, you are presented with the familiar Excel workbook inside of Visual Studio 2008. There you can add any of the common Windows Forms controls, as well as add .NET code to Excel as your business needs require. Office applications that you have extended programmatically are called Office Business Applications, or OBAs. A developer resource center that includes the usual code samples, training videos, and so on is available on MSDN at http://msdn.microsoft.com/en-us/office/ aa905533.aspx. An Excel 2007 developer portal is also available on MSDN at http://msdn. microsoft.com/en-us/office/aa905411.aspx. Finally, you will also want to download the Excel 2007 XLL SDK from http://www.microsoft.com/downloads/details.aspx?FamilyId=5272E1D193AB-4BD4-AF18-CB6BB487E1C4&displaylang=en. To see a sample project created using VSTO, visit the CodePlex site named OLAP PivotTable Extensions. You can find this project at http://www.codeplex.com/OlapPivotTableExtend. In this project the developer has created an extension to Excel 2007 that allows authorized end users to define private calculated members that are specific to their particular PivotTable session instance by using a Windows Forms user interface. This extension project also contains a custom library view for easier management of these newly added calculated members. This is a good example of an elegant extension based on business need to Excel’s core PivotTable functionality.
Chapter 23
Using Microsoft Excel 2007 as an OLAP Cube Client
685
Summary In this chapter we investigated the integration between SQL Server 2008 SSAS OLAP cubes and Excel 2007. We looked at the mechanics around PivotTable and PivotChart displays. We then investigated the process for creating offline OLAP cubes. We followed this by examining the new Excel functions, which allow direct OLAP cube data retrieval from the Excel formula bar. We closed our discussion with a look at the setup required to extend Excel programmatically. In the next chapter we’ll continue to look at Excel as a SQL Server 2008 BI client, but not for cubes. In that chapter we’ll look at how Excel is used as a client for SSAS data mining structures and models.
Chapter 24
Microsoft Office 2007 as a Data Mining Client In this chapter we take a look at using the 2007 Microsoft Office system as an end-user client for SSAS data mining objects. Here we’ll examine the ins and outs of using Microsoft Office Excel 2007 and Microsoft Office Visio 2007 as end-user clients for SSAS data mining structure and models. We’ll take a look at the functionality of the Microsoft SQL Server 2008 Data Mining Add-ins for Office 2007. As you may recall from Chapter 2, “Visualizing Business Intelligence Results,” these add-ins enable both Excel 2007 and Visio 2007 to function as enduser interfaces to SSAS data mining models. We’ll start by reviewing the installation process for the Data Mining Add-ins.
Installing Data Mining Add-ins As we introduced in Chapter 2, several products in the 2007 Office suite are designed to work as SSAS data mining clients, including Excel 2007 and Visio 2007. For best performance, Microsoft recommends installing SP1 for Office 2007 prior to setting up and using either of these as data mining clients. After you’ve installed the Office service pack, then you must download, install, and configure the free SQL Server 2008 Data Mining Add-ins for Office 2007. You can download the add-ins at http://www.microsoft.com/downloads/details. aspx?FamilyId=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en. After you’ve downloaded and installed the add-ins, you’ll see several new items related to them under the Microsoft SQL 2008 Data Mining Add-Ins item on your Start menu (see Figure 24-1), including Data Mining Visio Template, Getting Started, Help And Documentation, Sample Excel Data, and Server Configuration Utility. Using the add-ins from within Excel or Visio requires certain configuration information for an SSAS instance. So the best next step is to open the Server Configuration Utility, which is a four-step wizard that guides you through the process of configuring a connection from Excel and Visio to your particular SSAS instance.
FIgure 24-1 Data Mining Add–ins for Office 2007
687
688
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
To use the Server Configuration Utility, open it from the Start menu. One the first page of the wizard, you‘re asked to supply the name of the SSAS instance and the authentication type you wish to use when connecting. The default authentication type is Windows Credentials. On the next page of the wizard, you’re asked whether you wish to allow the creation of temporary (or session) mining models. If you select this option, authorized end users can create these models by invoking SSAS algorithms and populating models with data from their local Excel workbooks. On the third page of the wizard, you are asked whether you’d like to create a new SSAS database (or add the information to an existing database) to hold information about authorized user names for the add-ins. If you select New, a new SSAS database (named DMAddinsDB by default) is created on the configured SSAS instance. A single security role with a name like ExcelAddins_Role_3_19_2008 5_59_27 PM is created, using the current date and time. Local administrators are added to this role by default. Also by default, this role has full control on all objects. As with any SSAS role, you can of course adjust role membership and permissions on this role as necessary. On the last page of the wizard (shown in Figure 24-2), you specify whether you’ll allow users of the Data Mining Add-ins to create permanent models on the SSAS instance.
FIgure 24-2 The last page of the Server Configuration Utility, where you set permanent object creation
permissions
Chapter 24
Microsoft Office 2007 as a Data Mining Client
689
Data Mining Integration with excel 2007 After you install the Data Mining Add-ins, the data mining integration functionality is exposed in Excel and Visio as additions to the menus inside of each. In Excel, two new tabs appear on the Ribbon: Table Tools Analyze and Data Mining. To learn more about the add-ins we’ll open the included sample data file called DMAddins_SampleData.xlsx, which is located (by default) at C:\Program Files\Microsoft SQL Server 2008 DM Add-Ins. Note If you worked with the SQL Server 2005 Data Mining Add-ins for Office 2007 and are now working with SQL Server 2008, you need to download and install the version of the add-ins that is specific to SQL Server 2008. The 2008 edition of the add-ins has several additions to functionality. We review the changes to the add-ins in detail later in this chapter.
After you open Excel 2007, click the Data Mining tab on the Ribbon (Figure 24-3). The add-ins add this tab permanently to Excel 2007. Notice that the Data Mining tab has seven major groups: Data Preparation, Data Modeling, Accuracy And Validation, Model Usage, Management, Connection, and Help. Each group includes one or more large buttons that give you access to the functionality available in the group. In addition, some buttons include a tiny downward-pointing triangle that indicates additional functionality available at a click of that button.
FIgure 24-3 Data Mining tab on the Excel 2007 Ribbon
As we continue through this section, we’ll work through the functionality of the majority of the buttons on the Data Mining tab of the Ribbon. However, we’ll first introduce the other point of integration that the add-ins add to Excel 2007. To see this, you must select at least one cell in an Excel table object. You’ll then see Table Tools on the Ribbon. Click the Analyze tab to see the Table Analysis Tools, Connection, and Help groups, as shown in Figure 24-4.
FIgure 24-4 The Table Tools Analyze tab on the Excel 2007 Ribbon
Because these tools expose the simplest set of functionality, we’ll start our detailed look at the integration between Excel 2007 and SSAS data mining by working with the Table Tools
690
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Analyze tab. We’ll continue our tour by looking in more detail at the functionality of the Data Mining tab.
Using the Table Analysis Tools Group Before we dive in and work with the Table Analysis Tools group, let’s take a minute to consider what types of end users these tools have been built for. An effective way to do that is to look at the help file included for the add-ins in general. Remember that the help for the addins is separate from SQL Server Books Online—you access it via Excel 2007. Here’s an excerpt from the SQL Server Books Online introduction: The SQL Server 2008 Data Mining Add-ins for Office 2007 provides wizards and tools that make it easier to extract meaningful information from data. For the user who is already experienced with business analytics or data mining, these add-ins also provide powerful, easy-to-use tools for working with mining models in Analysis Services. Glancing through the included help topics, you can see that intended user is an intermediate Excel user, particularly someone who uses Excel for analysis tasks. We think of this user as a business analyst. That is probably not surprising to you. What may be surprising, however, is the end user we see as a secondary user of Table Analysis Tools and data mining functionality. That end user is you, the technical BI developer. If you are new to BI in general or if you are new to data mining, using the integration in Excel is a fantastic (and time-efficient) way to understand the possibilities of data mining. That said, you may eventually decide that you prefer to work in BIDS. However, we find the combination of working in Excel and in BIDS to be the most effective. One other consideration is that if you are thinking about creating a custom application that includes data mining functionality, using the add-ins has two points of merit. First, you can see an example of an effective data mining client application. Second, you can actually use what already exists in Excel as a basis for a further customized application, by customizing or extending the included functionality using .NET Framework programming and the Visual Studio Tools for Office (VSTO). For more information about VSTO, go to http://msdn.microsoft.com/en-us/office/ aa905533.aspx. To start using Table Analysis Tools, you must first configure a connection to an SSAS instance. You might be surprised that you have to do this inside of Excel because you already did a type of connection configuration during the initial setup. This second configuration allows you to set permissions more granularly (more restrictively) than you set in the “master” connection when you configured the add-ins setup. Configuring the user-specific setup is simple. Click the Connection button on the tab and then click New in the dialog box. This opens a familiar connection dialog box where you list the instance name and connection information, as shown in Figure 24-5.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
691
FIgure 24-5 Connect To Analysis Services dialog box
Now, we’ve told you that you need a connection, but we haven’t yet shown you why. This should become obvious shortly. At this point, suffice to say that from within Excel you’ll be connecting to and using the data mining algorithms that are part of SSAS to analyze data from your local Excel workbook. Tip As we get started, you’ll notice that Excel has no profiling or tracing capability, so you can’t natively see what exactly is generated on the SSAS instance. Of course, if you wanted to capture the generated information, you could turn on SQL Server Profiler, as we’ve described in Chapter 6, “Understanding SSAS in SSMS and SQL Server Profiler.” It is interesting to note that the Data Mining tab does contain a method of profiling the queries called Trace. You may want to turn it on as we work through the capabilities of the Table Analysis Tools group.
Figure 24-6 shows the Table Analysis Tools group, which we’ll explore next. The first button in the group is Analyze Key Influencers. True to the intended end-user audience, the names of the tools in this group are expressed in nontechnical terms.
FIgure 24-6 The Table Analysis Tools group expresses data mining functionality in nontechnical terms.
This language continues as you actually begin to use the functionality. For example, when you select the Table Analysis Tools Sample worksheet in the sample workbook and click the Analyze Key Influencers button, a dialog box opens that presents you with a single choice— which column is to be analyzed—and a link to more advanced functionality (Add/Remove Considered Columns). The dialog box optionally allows you to continue to analyze the results
692
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
by adding a report that shows how the influencers are discriminated (selected). The Analyze Key Influencers dialog box is shown in Figure 24-7.
FIgure 24-7 The Analyze Key Influencers dialog box
The outcome of this dialog box is an easy-to-understand workbook page that shows the key influencers ranked by level of influence for the selected column and its possible states. In our example, bike buyer can be either 1 or 0 (yes or no). It’s really that simple. Figure 24-8 shows the resulting output table in Excel.
FIgure 24-8 Using the Analyze Key Influencers button produces a table output in Excel.
It’s important to understand what just happened here. If you used the Tracer tool, you can see that a temporary data mining model was created and was then populated (or trained) using the Excel table (spreadsheet) as source data. Closer examination of either Tracer or SQL Server Profiler output shows that Analyze Key Influencers performed the following steps: 1. Created a temporary mining structure, marking all source columns as discrete or discretized
Chapter 24
Microsoft Office 2007 as a Data Mining Client
693
2. Added a mining model to the structure using the Microsoft Naïve Bayes algorithm 3. Trained the model with the Excel data 4. Retrieved metadata and data from the processed model The following detailed DMX statements were produced: CREATE SESSION MINING STRUCTURE [Table2_572384] ([__RowIndex] LONG KEY,[ID] Long Discretized, [Marital Status] Text Discrete, [Gender] Text Discrete, [Income] Long Discretized, [Children] Long Discrete, [Education] Text Discrete, [Occupation] Text Discrete, [Home Owner] Text Discrete, [Cars] Long Discrete, [Commute Distance] Text Discrete, [Region] Text Discrete, [Age] Long Discretized, [Purchased Bike] Text Discrete) ALTER MINING STRUCTURE [Table2_572384] ADD SESSION MINING MODEL [Table2_572384_NB_234496]([__RowIndex], [ID], [Marital Status], [Gender], [Income], [Children], [Education], [Occupation], [Home Owner], [Cars], [Commute Distance], [Region], [Age], [Purchased Bike] PREDICT) USING Microsoft_Naive_Bayes(MINIMUM_DEPENDENCY_PROBABILITY=0.001) INSERT INTO MINING STRUCTURE [Table2_572384] ([ID], [Marital Status], [Gender], [Income], [Children], [Education], [Occupation], [Home Owner], [Cars], [Commute Distance], [Region], [Age], [Purchased Bike]) @ParamTable CALL System.GetPredictableAttributes('Table2_572384_NB_234496') CALL System.GetAttributeValues('Table2_572384_NB_234496', '10000000c') CALL System.GetAttributeDiscrimination('Table2_572384_NB_234496', '10000000c', '', 0, '', 2, 0.0, true) CALL System.GetAttributeDiscrimination('Table2_572384_NB_234496', '10000000c', 'No', 1, '', 2, 0.0, true) CALL System.GetAttributeDiscrimination('Table2_572384_NB_234496', '10000000c', 'Yes', 1, '', 2, 0.0, true) SELECT FLATTENED(SELECT [SUPPORT] FROM NODE_DISTRIBUTION WHERE ATTRIBUTE_NAME='Purchased Bike' AND VALUETYPE=1) FROM [Table2_572384_NB_234496].CONTENT where NODE_TYPE=26 CALL System.GetAttributeValues('Table2_572384_NB_234496', '10000000c')
Now that you understand what is happening, you should work your way through the rest of the buttons on the Table Tools Analyze tab. You’ll see that the Detect Categories button uses the Microsoft Clustering algorithm to create a temporary mining model that results in a workbook page that groups your input data (from one or more columns) into clusters or categories and shows the results both statistically (through a row count of category values) and via a stacked bar graph. You can use the Detect Categories button to quickly group data
694
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
from an Excel workbook source into buckets, or groups. Using this functionality can help you to understand your data and correlations between attributes in your data more quickly. For example, you can create groups (or ranges) of income values in a source dataset that includes a large number of other attributes, such as age, number of children, home ownership, and so on. The Fill From Example button asks you to select an example column, then produces output workbook pages that show the values over time that you can use for extending your series. You can use this functionality to quickly fill in a series on a source workbook. Such series can include time, financial results, age, and so on. A look at the output in the Tracer tool shows that this functionality uses the Microsoft Logistic Regression algorithm. The Forecast button requires a bit more configuration in its associated dialog box, shown in Figure 24-9. To use it you select the column (or columns) that you wish to forecast. Only columns with source data of types that can be used as input to the associated algorithm appear available from the source workbook in this dialog box. The next step is to confirm or adjust the number of time units that you wish to forecast. The default value is five units. In the Options section, you can set the source columns for the time stamp. Finally, you can set the periodicity of the data. Note Your screen may say Income rather than Yearly Income, depending on which version of the Excel sample workbook for data mining (called DMAddins_SampleData.xlsx) you are using.
FIgure 24-9 Using the Forecast button creates a mining model using the Microsoft Time Series algorithm.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
695
Did you guess which algorithm was used here? It’s pretty obvious, isn’t it? It’s the Microsoft Time Series. You may have noticed that our sample data does not contain any temporal information. We did this on purpose to illustrate a point. Although the Table Analysis Tools may be straightforward to use, you must still observe some common-sense practices. You still must select the correct tool for the job at hand. Although this produces a result, that result may not be meaningful. Also, the overhead to produce this result can be pretty steep, because the algorithm has to “fake” a time column. Of course, fake data is rarely meaningful, so although the algorithm runs even if you use source data without time values, we advise against doing this. If you are wondering how this is done, again, use the Tracer tool. We did, and here’s the key line: CREATE SESSION MINING STRUCTURE [Table2_500896] ([Income] Long CONTINUOUS,[Age] Long CONTINUOUS,[__RowIndex]LONG KEY TIME)
Note the addition of the _RowIndex column with type LONG KEY TIME. Although understanding data mining concepts is not really needed to use the Table Analysis Tools, this idea of including a time-based source column is important when using the Forecasting functionality. As mentioned earlier, it is important to include data with an included time-series value when you run the Forecast tool from the Table Analysis Tools. The next button, Highlight Exceptions, just asks you to select one or more of the source columns for examination. This button, like those reviewed so far, creates a mining structure and model. Of course, it also runs a DMX query. Notice that the DMX Predict, PredictVariance, and PredictCaseLikelihood functions are used here. These functions generate a result (a new workbook in Excel) that allows you to quickly see the exception cases for your selected data. These exceptions are sometimes called outliers. Understanding this can help you to judge the quality of the source data—in other words, the greater the quantity of outliers, the poorer the quality of data. The particular DMX prediction function invoked when you use the Highlight Exceptions functionality depends on the source data type selected for examination. The text of that generated query looks like this: SELECT T.[Income], Predict([Income]), PredictVariance([Income]), PredictCaseLikelihood() FROM [Table2_372960_CL_463248] NATURAL PREDICTION JOIN @ParamTable as T ParamTable = Microsoft.SqlServer.DataMining.Office.Excel.ExcelDataReader
Note The input is read from Excel using an ExcelDataReader object. If you were to write a custom data mining client, this is the object and the library that you would work with to do so. The next button, Scenario Analysis, allows you to perform goal-seeking or what-if scenarios on one column and one or more rows of data. A goal-seeking scenario generates a DMX query that uses the PredictStdDev function. A what-if scenario generates a DMX query that uses the PredictProbability function. The output from both queries is shown at the bottom of
696
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
the configuration dialog box. The output is assigned a confidence score as well. Figure 24-10 shows the output from using the What-If Scenario Analysis. In this example we’ve configured the scenario change (input) column to use Income. We’ve asked to analyze the impact on the target column named Commute Distance. For expediency, we’ve asked to perform analysis on a single row from the source table. You can see that the output of the query predicts that an increase in yearly income correlates with a commute distance of 0-1 miles. The confidence of the result is ranked from poor to very good as well as color-coded and bar-graphed to show the strength of that resulting confidence.
FIgure 24-10 The What-If Scenario Analysis output includes a confidence ranking.
Because of the popularity of the Table Analysis Tools group, Microsoft has added two new buttons, Prediction Calculator and Shopping Basket Analysis, to the Table Tools Analyze tab in the SQL Server 2008 Data Mining Add-ins. The Prediction Calculator requires a bit of configuration before it can be used, as shown in Figure 24-11. The first value you must configure is Target, which represents the column and either an exact value or a range of values from that column’s data for which you want to detect prediction patterns. Because this column can contain multiple values and you want to restrict the values being predicted for, you can use the Exactly or In Range options. Of course, you can only select values in range for input column values that could be considered continuous, such as income. Two optional values, Operational Calculator and Printer-Ready Calculator, are selected by default. These values are produced in addition to the output so that a user can input values and see the variance (or
Chapter 24
Microsoft Office 2007 as a Data Mining Client
697
cost) of the changes to the mode. For our example, we set the Target column to Education and the value to Bachelors.
FIgure 24-11 The Prediction Calculator includes an Operational Calculator.
This tool uses the Microsoft Logistic Regression algorithm, and marks the Target value with the Predict attribute. The output produced is a dynamic worksheet that allows you to tinker with the positive or negative cost or profit. Literally, this helps you to understand possible potential profit (or cost) from making a correct (or incorrect) prediction. The worksheet consists of four parts. The first section is a small calculation section where you can adjust the values associated with positive or negative cost or profit. The other three sections show the point of reaching profitability based on the input calculations. As you adjust the costs associated with the targeted value upward or downward, you can immediately see the projected profit score threshold and cumulative misclassification values change on the linked charts. Because we left the default values of Operational Calculator and Printer-Ready Calculator selected as well, we also get two additional workbook pages that can be used interactively or printed and used by end users who wish to fill out (and score) the results manually. These calculators assign scores (or weights) to each of the attribute values. Adding the scores for selected attributes produces the likelihood results for the target attribute. The results of this report are shown in Figure 24-12.
698
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FIgure 24-12 The Prediction Calculator allows you to test different values associated with costs and profits.
The Shopping Basket Analysis button uses the Microsoft Association algorithm and produces several reports that can help you to understand which items sell best together. It also suggests which items sold together would increase your profits the most. To use this button, you’ll want to use data from the sample worksheet on the Associate tab. This is because this data is in a format expected by this tool, meaning that it contains an order number (key), category (optional), product name, and price. You can see from the configuration dialog box that these values are called Transaction ID, Item, and Item Value. Although Item Value is optional, the tool produces much more meaningful results if this information is available. The Advanced (Configuration) link lets you adjust the default values for Minimum Support (set for 10) and for Minimum Probability Rule (set for 40). The former means the minimum number of transactions used to create rules; the latter relates to strength of correlations required to create rules. As you may remember from Chapter 13, “Implementing Data Mining Structures,” and Chapter 14, “Architectural Components of Microsoft SQL Server 2008 Integration Services,” the Microsoft Association algorithm creates rules and then shows itemsets, or groups of items that generate increasingly greater total return values. You can view the Shopping Basket Analysis tool in Figure 24-13.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
699
FIgure 24-13 The Shopping Basket Analysis tool uses the Microsoft Association algorithm.
The output from this tool is two new workbook pages, the Shopping Basket Bundled Items report and the Shopping Basket Recommendation report. The first report shows you bundled items (minimum of two items per bundle), number of items in the bundle, number of sales of this bundle, average value per bundle, and overall sales total of bundled items, as you can see in Figure 24-14. This report is sorted by overall sales total, but you can, of course, re-sort it to your liking. The second report shows you which items when bundled together result in increased sales and suggests items to cross-sell.
FIgure 24-14 The Shopping Basket Bundled Items report shows which items have sold together.
700
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
The first column lists a particular item; this is followed by the top recommendation for crossselling with that particular item. The table also shows counts, percentages, and estimated sale values of cross-selling the listed items. As with the Shopping Basket Bundled Items report, this report is also sorted by overall value (or potential profit) dollars by default. As we complete our tour of the Table Tools Analyze tab, you should remember a couple of points. This tab’s primary purpose is to expose the functionality of some of the SSAS data mining algorithms to users who wish to use Excel table data as source data. Note that we did not access, create, update, or in any other way work with permanent data mining models on the SSAS server. For tasks such as those we’ll look next to the other data mining tool available in Excel: the Data Mining tab of the Ribbon.
Using the Data Mining Tab in Excel 2007 Before we start working with the Data Mining tab, let’s take a minute to understand the conceptual differences between the tools and functionality it exposes as compared to those available on the Table Tools Analyze tab of the Ribbon. Refer again to Figure 24-3, the Data Mining tab, as we discuss its design. Take a look at the tab’s groups: Data Preparation, Data Modeling, Accuracy And Validation, Model Usage, Management, Connection, and Help. It is interesting to note that these group names roughly correspond to the CRISP-DM SDLC phases that we discussed in Chapter 13. These names might seem odd because Excel is typically an end-user client. In the case of data mining, Excel is also a lightweight development tool. This type of implementation is quite new in Microsoft’s product suite—using an Office product as an administrative or a developer interface to a server-based product has rarely happened. It is quite important that you understand that the Data Mining tab (interface) is designed for these two types of end users: business analysts and BI administrators. Of course, in smaller organizations it is quite possible that these functions could be performed by the same person—in fact, that is something that we’ve commonly seen with clients who are new to data mining. Also, as we’ve mentioned, for those of you application developers who are completely new to data mining, using the Data Mining tools, rather than BIDS itself, to develop, query, and manage data mining models on your SSAS instance may prove to be more productive for you. Unlike the Table Analysis Tools group, the Data Mining tab functionality can generally (but not always) interact with data that is stored either in the local Excel workbook or on the SSAS server instance. The two tab interfaces have some commonality in the access to the Data Mining Add-in–specific help file and access to the required Connection configuration object. As with the Table Analysis Tools group, using any of the tools exposed on the Data Mining tab requires an active connection to an SSAS instance because mining algorithms are being used to process the data in the request. We mentioned the use of the included Tracer tool when we discussed the Table Analysis Tools group. Tracer captures and displays activity
Chapter 24
Microsoft Office 2007 as a Data Mining Client
701
generated on the SSAS instance from the use of any of the Data Mining tools in Excel. Figure 24-15 shows some sample output from the Tracer tool. As we did when examining the Table Analysis Tools group, we’ll use Tracer when working with the tools on the Data Mining tab so that we can better understand what the tools are actually doing.
FIgure 24-15 The Tracer tool shows generated activity on the SSAS server from the Data Mining Add-ins in Excel.
We’ll now take a look at the remaining groups and contained functionality of the tab— those that are unique to this particular tab: Data Preparation, Data Modeling, Accuracy And Validation, Model Usage, and Management. We’ll start by taking a look at the some of the administrative capabilities built into the Data Mining tab. These capabilities are found in the Management and Model Usage groups on the tab.
Management and Model Usage The Management group of the tab has one button, Manage Models. This button allows authorized Excel users to perform a number of functions on existing models that are stored in the configured SSAS instance. After you click this button, a dialog box opens (Figure 24-16) that lists all mining structures and their contained models on the left side. On the right side, you’ll see a list of actions that can be performed on a selected model or structure. Possible actions for structures include renaming, deleting, clearing data, processing, exporting existing metadata, and importing new metadata. Possible actions for models include renaming, deleting, clearing data, processing, and exporting or importing existing metadata. If you choose to export metadata, you must specify the export file destination location. The type of output file produced is a native application format, meaning that it is not XMLA. Instead it is an SSAS backup file that can be restored to an SSAS instance using SQL Server Management Studio.
702
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FIgure 24-16 The Manage Models tool exposes administrative functionality through Excel for SSAS data
mining objects.
On the bottom right of this dialog box, metadata about the selected structure or object is shown. You may be wondering whether the Data Mining tab buttons are security-trimmed. Security trimming is a feature that removes menu items or tools when the currently active user does not have permission to perform the actions associated with them. In the case of the Data Mining tab, the tools are not security-trimmed, which means that end users in Excel can see all the tools, but may be connecting to an SSAS instance using credentials that are authorized to perform a limited subset of available options. If a user tries to perform an action for which she is not authorized, that action fails and a dialog box alerts the user that the action has failed. The next button, Document Model, appears in the Model Usage group and has been newly added for SQL Server SSAS 2008. When you click this button, the Document Model Wizard opens. The first page of the wizard describes what the tool does. On the second page of the wizard, you are presented with a list of all mining structures and models on the SSAS server instance. You click the mining model that you are interested in documenting, and on the next page of the wizard you choose the amount of detail you’d like to see in the documentation (complete or summary). The wizard produces a new workbook page with the selected mining model’s characteristics documented. Figure 24-17 shows a partial example output using the Customer Clusters model that is part of the Adventure Works DW 2008 sample. Notice that it contains the following information: Model information, Mining Model column metadata, and Algorithm parameter configuration values. You may want this information for backup and restore (general-maintenance) purposes.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
703
FIgure 24-17 The new Document Model tool provides a quick and easy way to document model settings.
The Browse button, which is closely related in functionality to the Document Model button, is located on the Model Usage section of the tab. As with the tools we’ve looked at so far, when you click this button you are first presented with a list of mining structures and models located on the SSAS instance. You then select the model of interest and Excel presents you with the same model viewers that we already saw in BIDS. Remember from our earlier discussion of those mining model viewers that each of the nine included data mining algorithms is associated with one or more viewers. It is interesting to note that Excel’s Browse capability has a useful addition to some of the viewers. At the bottom left of the dialog box that displays the viewer (for most viewers), you’ll see a Copy To Excel button. The results of clicking this button vary depending on what type of viewer you are looking at. In some cases the interactive viewer is rendered as a graphic in a new worksheet. In other cases the information from the viewer is loaded into a new worksheet as Excel data and auto-formatted. Figure 24-18 shows a portion of example output of the latter case. We used the Customer Clusters sample mining model and copied the Cluster Profiles viewer data to Excel. We’ve found integration capabilities such as this one to be very useful and powerful. Remember that when the viewer’s Copy To Excel action dumps the source data into a new workbook, the end user can manipulate that data using any of Excel’s functionality, such as sort, filter, format, and so on.
704
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FIgure 24-18 The Copy To Excel feature used with Cluster characteristics produces a new workbook page.
The last button in the Model Usage group of the Data Mining tab is the Query button. This button gives authorized end users the ability to build DMX prediction queries by using a guided wizard. When you click the Query button, the wizard presents you with a series of pages. These pages include tool explanation, model selection, source data (which can be from the open Excel workbook or from any data source that has been configured on SSAS), model and input column mapping, output selection, and query result destination. This tool functions similarly to the Mining Model Prediction tab in BIDS in that it is designed to help you write and execute DMX prediction queries. The key difference in the version available on the Data Mining tab of the Ribbon in Excel is that you can define input tables using data from an Excel workbook as an alternative to using SSAS source data for that purpose. An advanced query builder is also available from within the wizard. You access this query builder by clicking the Advanced button on the lower left of the column or output mapping page. The Data Mining Advanced Query Editor is shown in Figure 24-19. Note that in addition to being able to directly edit the query, this tool also includes DMX templates, as well as quick access to the other wizard input pages, such as Choose Model. You can click parameter values, which are indicated by angle brackets (such as in Figure 24-19) and Excel opens a dialog box that allows you to quickly configure the needed value. For example, if you click (shown in Figure 24-19), the Add Output dialog box opens, where you can quickly complete those values. After you click OK, you are returned to the Data Mining Advanced Query Editor to continue editing the entire query.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
705
FIgure 24-19 The Data Mining Advanced Query Editor
The tools that we’ve covered so far are quite powerful and are designed for advanced analysts as well as BI administrators and developers. These tools are designed primarily to allow Excel users to work with models on the SSAS server, although there are exceptions. The next group of tools we’ll review has a different focus in that it is primarily designed to facilitate quick evaluation of Excel data using SSAS data mining algorithms.
Data Preparation Group The first group on the Data Mining tab is called Data Preparation. It includes the Explore Data, Clean Data, and Sample Data buttons. Generally these three buttons are designed to expose data mining functionality to locally stored data (meaning in the Excel workbook). You can think of the functionality as a kind of an SSIS-light look at Excel data. Clicking any of these buttons opens a descriptive wizard page. The second page of the wizard allows you to select the data you’d like to work with. For exploring and cleaning data you are limited to working with Excel data. Your selection options are Table or Range. For Sample Data you may select Excel data or data from a data source configured in SSAS. These tools allow a quick review, simple cleaning, or sampling of source data. To use Explore Data, simply click the button to start the wizard. After you select your data source, you select a single column for analysis. Sample output based on the Education column in the Table Analysis Tools Sample Table 2 is shown in Figure 24-20. This tool counts discrete data values in the selected columns and graphs the output. A bar graph is the default output. You can toggle between charting the discrete values and looking at continuous numeric data by
706
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
clicking the two small chart buttons on the bottom left of the output page. The column you are exploring must be numeric if you want to use the numeric view.
FIgure 24-20 The Explore Data Wizard automatically “buckets” your data using a clustering algorithm.
The next button, Clean Data, includes two functions, Outliers and Re-label. Click Outliers, click Next on the tool description page, and then select the data you wish to clean. You then select the column you wish to analyze, and Excel presents you with output that shows the values from the selected column graphed. You can adjust the expected minimum or maximum values with the slider controls on the Specify Thresholds page. Values that fall outside of this range are considered outliers. Figure 24-21 shows a sample from the Source Data worksheet, Source Data table, and Age column. As with the Explore Data output page, you can toggle the graph type between discrete and numeric displays by clicking the small button on the bottom left corner of the Specify Thresholds page. After you define the outliers and click Next, on the next page of the wizard you specify what you would like done with the values that you’ve indicated are outliers. You have four choices: ■■
Change Value To Specified Limits (the default setting)
■■
Change Value To Mean
■■
Change Value To Null
■■
Delete Rows Containing Outliers
Chapter 24
Microsoft Office 2007 as a Data Mining Client
707
FIgure 24-21 The Outliers Wizard allows you to set the acceptable range for values.
After you specify your preference for outlier handling, on the last page you specify where you’d like to put the modified data. You can add it as a new column in the current workbook, place it in a new workbook, or change the data in place. The Re-label function allows you to specify alternate values for existing column values. The wizard presents a similar series of pages, including one where you list the new labels and one where you specify where the new output should be placed. This is quick way to update column values to make them consistent. The last button, Sample Data, allows you to select either source data from Excel or from a data source defined on your SSAS instance and then create a sample set of data from it. Remember that a common technique in data mining is to use one set of data for training and another set of data for validation. Remember also that a new feature of SSAS 2008 is to specify partition values during model creation (supported for most algorithms) using BIDS, so we see this function in Excel being used mostly on Excel source data. After selecting your data source, on the next page of the wizard you are asked to specify the sampling method. You have two choices: random sampling (which is the default) or oversampling (which allows you to specify a particular data distribution). For example, you can use oversampling to ensure that your sample data includes equal numbers of car owners and non–car owners, even if the source data does not reflect that distribution. If you choose random sampling, on the next page of the wizard you specify the size of the sample by percentage or row count. The default is 70%. On the last page of the wizard, you specify the output location. If you choose oversampling as the sampling method, on the next page of the wizard you specify the input column, target state, target percentage, and sample size, as shown in Figure 24-22.
708
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FIgure 24-22 The Sample Data Wizard
After you have reviewed your source data, cleaned it, and created samples from it, you may want to use one or more data mining algorithms to help you better understand that data. The next section of the tab, Data Modeling, allows you to do just that.
Data Modeling Group It is important to understand that a global configuration setting controls an important behavior of all of the tools available in the Data Modeling group on the Data Mining tab. That setting concerns mining model creation. By default, when you set up the initial configuration and connection to the SSAS server using the Data Mining Add-ins Server Configuration Utility, you can specify whether you want to allow the creation of temporary mining models. If you’ve left this default setting selected, using the mining model tools from this group on the tab creates temporary (or session) mining models. These models are available only to the user who created them and only during that user’s session. If you disable the creation of temporary mining models, using any of the tools in this group creates permanent mining models on your SSAS instance. The Data Modeling group on the Data Mining tab is shown in Figure 24-23.
FIgure 24-23 The Data Modeling group on the Data Mining tab in Excel
Chapter 24
Microsoft Office 2007 as a Data Mining Client
709
Needless to say, appropriate configuration of the model creation location is quite important. Consider carefully which situation better suits your business needs when planning your implementation of data mining and using Excel 2007 as a client. All of the tools in this group function similarly in that they create a data mining model (either temporary or permanent) using source data and values you specify as you work through the wizard associated with each tool. The buttons map to the original algorithms like so: ■■
Classify uses Microsoft Decision Trees.
■■
Estimate uses Microsoft Decision Trees in regression mode (auto-detects a regressor).
■■
Cluster uses Microsoft Clustering.
■■
Associate uses Microsoft Association.
■■
Forecast uses Microsoft Time Series.
■■
Advanced allows you to select any algorithm using the name it has in BIDS.
Because we’ve spent so much time in previous chapters reviewing the algorithms in detail, we won’t repeat the process here. We’ll just take you through one example so that you can get a feel for how mining model creation works in Excel. We’ll use the workbook page named Associate from the Excel sample data. As we work through the wizard, notice that required values are presented on the Association page, as shown in Figure 24-24.
FIgure 24-24 The Association page exposes required configurable property values such as Transaction ID.
710
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
You may recall a similar tool on the Table Tools Analyze tab called Shopping Basket Analysis. If you look closely at the wizard page in Figure 24-24 and compare it with the one associated with the Shopping Basket Analysis tool (shown previously in Figure 24-13), you can see that the only difference is that some of the advanced parameter values, such as thresholds for support, are shown on this page. As with the Shopping Basket Analysis tool, to use the Associate tool you need to select a column to associate with the transaction ID and also one for the item. It is interesting also to note that one value that was configurable on the Shopping Basket Analysis page (item price) is missing here. You may remember that the Shopping Basket Analysis tool is new for SQL Server 2008. Its configurable parameters probably represent the most common customer requests. Although you could build your model after configuring only the parameters displayed on this page, you have access to additional parameters when using the tools on the Data Mining tab. If you click the Parameters button on the bottom left of the Association page of the wizard, an Algorithm Parameters dialog box opens, as shown in Figure 24-25. As we’ve discussed, these advanced parameters vary widely by source algorithm. Although some parameters are fairly self-explanatory, such as MAXIMUM_ITEMSET_COUNT in the following example, we consult SQL Server Books Online when we work with less common manually configured parameters.
FIgure 24-25 The Algorithm Parameters dialog box presents advanced parameters that are specific to each
algorithm.
After you complete your configuration of all exposed parameters, the Associate Wizard presents you with a page containing metadata about the mining structure that you are about to create. A sample output is shown in Figure 24-26. You can see that in addition to suggested structure and model names and descriptions, you have a couple of options to configure on this last page.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
711
FIgure 24-26 The final page of the Associate Wizard presents details of what will be created for the mining
structures.
The default—Browse Model—is shown in Figure 24-26. You may also choose to create a temporary model, rather than a permanent one, and whether to enable drillthrough. Remember from our previous discussion on drillthrough that not all algorithms support drillthrough. Remember also that a new feature of SQL Server 2008 is to allow drillthrough to all columns included in the mining structure, even if the columns of interest are not part of the mining model. If you leave the Browse Model check box selected on the Finish page, Excel displays your model using the associated viewers for the particular algorithm that you used to build it. As we discussed when reviewing the functionality of the Browse button on the Ribbon, what is unique about the display of the viewers in Excel, rather than BIDS, is that the viewer includes the ability to copy to Excel. The copy is executed in one of two ways. (The method used varies by algorithm viewer.) The first method is to embed an image file of the viewer in a new workbook page. The second method is to dump the data (and often to apply Excel’s autoformatting to that data) into a new workbook page. Now that you have an understanding of what you can do using the Data Modeling group on the Data Mining tab, you might be wondering in what business situations you should use BIDS to create models rather than Excel. The answer is practical: Use the tool that seems the most natural to you. You also want to consider which users, if any, you’ll grant model creation permission to. If you do so, you also need to consider whether you’ll allow these users to create only temporary (session) models, only server-based models, or some combination of the two.
712
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
We generally advocate involving power data miners (usually business analysts) as Excel users of SSAS data mining because most of them are already Excel power users. They can quickly and easily become productive with a familiar tool. We have also had some success getting technical professionals (whether they are DBAs or developers) quickly up to speed with data mining capabilities by using the Excel add-ins rather than BIDS. As with model building in BIDS, after you create a model in Excel, a best practice is to validate it. To that end, we’ll next look at the group on the Data Mining tab of the Excel Ribbon where those tools are found—Accuracy And Validation.
The Accuracy And Validation Group As with the previous group, in the Accuracy And Validation group you find tools that expose functionality that is familiar to you already because you’ve seen it in BIDS. The Mining Accuracy tab in BIDS contains nearly identical functionality to the Accuracy Chart, Classification Matrix, Profit Chart, and (new for SQL Server 2008) the Cross-Validation tools. As in the previous section, we’ll take a look at just one tool to give you an idea of using this functionality in Excel rather than in BIDS. We’ll use the Cross-Validation tool for our discussion. When you click Cross-Validation, a wizard opens where you select your model of interest (remembering that some algorithms are not supported for cross-validation). We’ll use the Targeted Mailing structure from the Adventure Works DW 2008 sample and then select the TM Decision Trees model from it. On the Specify Cross-Validation Parameters page, shown in Figure 24-27, we’ll leave Fold Count set to the default of 10. We’ll also leave Maximum Rows set to the default of 0. We’ll change Target Attribute to Bike Buyer and we’ll set Target State to 1. We find that Cross-Validation is quite resource-intensive. As we mentioned in Chapter 13, it is not the type of validation that you will choose to use for all of your models. Instead, you may want to use it when you haven’t partitioned any source data. Cross-Validation, of course, doesn’t require a test dataset because it creates multiple training sets dynamically during the validation process. Cross-Validation produces a report in a new workbook page that contains similar information to that produced when using Cross-Validation in BIDS. A portion of the output produced is shown in Figure 24-28. We are nearly, but not yet, done with our exploration of data mining integration with Office 2007. One additional point of integration remains to review—integration between SSAS data mining and Visio 2007. This functionality is included as part of the SQL Server 2008 Data Mining Add-ins for Office 2007.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
FIgure 24-27 The Specify Cross-Validation Parameters page
FIgure 24-28 The Cross-Validation output in Excel is similar to that produced in BIDS.
713
714
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Data Mining Integration in Visio 2007 To understand the integration between SSAS 2008 data mining and Visio 2007, click Data Mining Visio Template under the SQL Server 2008 Data Mining Add-ins option on the Start menu. This opens the template that is designed to allow you to create custom visualizations of the results of the three included data mining algorithm views. Note that the add-ins add this template and new menu items to Visio 2007. The options available on the new Data Mining menu are Manage Connections, Insert Decision Tree, Insert Dependency Net, Insert Cluster, Trace, and Help. You can click the Trace item to open the Tracer tool (which is identical to the one available in Excel) so that you can see the query text that Visio generates and sends to SSAS when you use any of the integration capabilities. To start working with Visio’s integration, click Manage Connections on the Data Mining menu and configure a connection to the Adventure Works DW 2008 sample. The next step is to either select one or more mining model views from the menu for insertion on your working diagram, or drag one or more shapes from the Microsoft Data Mining Shapes stencil (which contains the same three model views). After you perform either of these actions, a wizard opens. The first page of the wizard describes the purpose of the particular data mining algorithm view in nontechnical terms. The second page of the wizard asks you to select the connection to SSAS to use to retrieve the model information. The third page of the wizard, (shown in Figure 24-29) lists the available mining structures and models on the SSAS instance to which you’ve connected. Available means data mining models that were constructed using algorithms that use the particular view type that you’ve selected. In our example, this is the Dependency Network model.
FIgure 24-29 The Visio data mining integration includes wizards to help you select the appropriate source models.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
715
The next page of the wizard allows you to configure specific items for the particular view type selected. In our example using the Dependency Network view, this next page asks you to specify the number of nodes fetched (default is 5) and (optionally) to filter displayed nodes by using a Name Contains query. You’ll also see the Advanced button that let you format the output for each displayed node. These options are shown in Figure 24-30.
FIgure 24-30 The Dependency Net Options allow you to format the node output.
The last page of the wizard lists the tasks that will be completed, including fetch the information, format the information, and so on, and shows you a status value as each step is performed. The output is then displayed on the Visio workspace. In addition to using the template or menu to add additional items, you can, of course, use any of Visio’s other notation capabilities to further document the output. You can also use the Data Mining toolbar. Figure 24-31 shows sample output. An interesting option available on the Data Mining toolbar is Add Items. To use this option, select any node on the diagram and then click the Add Items button. A dialog box opens that queries the mining model metadata and allows you to select additional related nodes for display on the working diagram. Particular to the Dependency Network is the Strength Of Association slider that we’ve seen displayed to the left in the BIDS viewer. This slider functions in the same way, allowing you to add or remove nodes based on strength of association. However, the slider is displayed to the right of the working area in Visio. One limit to the data mining visualizations in Visio is that you can include only one visualization per Visio page.
716
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FIgure 24-31 The Dependency Net view includes the Strength Of Association slider.
Next we’ll take a look at specifics associated with the Decision Tree view. To do that, create a new, blank page in Visio and then drag the Decision Tree shape from the stencil to that page. A wizard opens where you again confirm your connection. On the next page of the wizard you can choose the mining model you want to use. Next, you are presented with formatting choices that are specific to this algorithm view, as shown in Figure 24-32. Here you select the particular tree from the model. (Remember that Decision Trees models can house more than one tree in their output.) Next you select the maximum rendering depth (the default is 3) and the values and colors for rendering. As with the previous algorithm view, if you click the Advanced button on this page, you are presented with the ability to further customize the node formatting for this algorithm. On the last page, as with the previous model, you are presented with the list of tasks to be performed and their status as the wizard executes them. After all steps are completed, the model is rendered onto the blank Visio page. The output displayed shows the decision tree selected, and like any Visio output can be further formatted to your liking.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
717
FIgure 24-32 The Decision Tree view requires that you select a particular tree from a Decision Tree source
mining model.
The last type of included view is Cluster. To see how this works, you create a third new page and then drag the Cluster shape onto that page. This opens the Cluster Wizard. Confirm your connection to SSAS, and on the next page, choose an appropriate mining model. On the fourth wizard page (shown in Figure 24-33), you are presented with a set of display options specific to this view. The default is to display the cluster shapes only. As an alternative to the default display you can choose to show the cluster characteristics or the discrimination chart. Just as with the other wizards, you can click the Advanced button to further define the format of the objects Visio displays.
FIgure 24-33 The Cluster Wizard allows you to specify the output view type.
718
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
The last page of this wizard indicates the steps and completion status. After processing completes, the output is displayed on the Visio page. For our example, we chose Show Clusters With Characteristics Chart. The output is shown in Figure 24-34.
FIgure 24-34 The Cluster view in Visio
As we end our brief tour of the included data mining shapes in Visio, we remind you that not only can you produce highly customized visualization results of your processed data mining models, but you can also use VSTO to further customize the results from Visio.
Client Visualization As we complete this chapter on using Office 2007 as a client interface for SSAS 2008 data mining, we’d like to mention a few more current and future possibilities for effectively visualizing the results of data mining models. We do this so that you can be inspired to think creatively about how you can solve the important problem of visualization for your particular BI project. We also hope to get application developers in general thinking more about this challenge. As better visualization technologies are released, such as WPF and Silverlight from Microsoft as well as products from other vendors, we do believe that more creative, effective, and elegant solutions will be in high demand.
Chapter 24
Microsoft Office 2007 as a Data Mining Client
719
In the short term, remember that the data viewers included with BIDS, SSMS, and Excel are available for application developers to embed in custom applications. These embeddable controls are downloadable from http://www.sqlserverdatamining.com/ssdm/Home/ Downloads/tabid/60/Default.aspx. Advanced control developers can access the source code from these controls and extend it, or they can create their own controls from scratch. It is also interesting to note that new types of development environments are being created to support new types of visualization. Microsoft’s Expression suite is a set of tools aimed at visual authoring for WPF and Silverlight applications. Microsoft’s Popfly is a visual programming environment available online that is itself created using Silverlight. Figure 24-35 shows a Popfly mashup (application). We can see possibilities for combining the input from data mining models along with advanced (and new) types of visualization controls in both traditional applications and mashups. Access to Popfly (http://www.popfly.com) is free and mashups can be embedded into applications.
FIgure 24-35 Popfly is a Web-based integrated development environment that allows developers to create “mashed-up” visualizations.
720
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
In addition to these options, we think it’s quite exciting to consider the future of data mining visualization controls. A great place to look to understand what this future holds is Microsoft Research (MSR), which has an entire division devoted to the problem of effective data visualization. They have many interesting projects available to review, which you can find on their main Web site at http://research.microsoft.com/vibe/. The FacetMap is one example of an interesting visualization. Take a look at http://research.microsoft.com/vibe/projects/FacetMap.aspx. Many of the enhancements in SQL Server 2008 data mining originated directly from work done at MSR. You can download some data visualization controls that have been developed by MSR from http://research.microsoft.com/research/downloads/Details/dda33e92f0e8-4961-baaa-98160a006c27/Details.aspx.
Data Mining in the Cloud During the writing of this book, Microsoft previewed Internet-hosted data mining. In August 2008, Microsoft first showed an online sample showcase. The sample includes a subset of the Table Analysis Tools included on the Excel 2007 Ribbon, as shown in Figure 24-36.
FIgure 24-36 Data mining in the cloud is now available online!
Chapter 24
Microsoft Office 2007 as a Data Mining Client
721
The sample is available at http://www.sqlserverdatamining.com/cloud/. To test this sample, click the Try It Out In Your Browser button, and then click the Load Data button. You can use either data from the Adventure Works DW sample database or you can upload a .csv file to analyze. The loaded data is then displayed on the Data tab of the online application. Next you select the type of analysis to be performed by clicking one of the toolbar buttons, such as Analyze Key Influencers. You then configure any parameters required by the algorithms. Following our example using Analyze Key Influencers, select the target column. The output is displayed on the Analysis Results tab. As of this writing, cloud-hosted data mining is in early preview stages only. Not all features available in the Excel data mining client are supported in this online preview. Pricing has not yet been announced.
Summary In this chapter we investigated the integration between SQL Server 2008 data mining and Office 2007 using the free Data Mining Add-ins. Specifically, we addressed initial configuration and then went on to look in depth at the Excel 2007 Table Tools Analyze and Data Mining tabs. We then looked at the Visio 2007 data mining template. We concluded by taking a brief look at other client tools and a peek at the future of data visualization. We hope we’ve conveyed our excitement about the possibilities of using data mining in your current BI projects resulting from the power and usability of the end-user tools included in Office 2007.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007 The release of Microsoft Office SharePoint Server 2007 has meant many things to many people. To some, it was a development platform that embraced the proven techniques of ASP.NET 2.0; to some, it was a workflow and business process automation engine. To others, it was a content management system to manage and surface important content within an organization. An integral piece of the Office SharePoint Server 2007 pie was the business intelligence (BI) integration slice. Although Microsoft has included rich BI capabilities in its SQL Server line of products for the past several releases of SQL Server, the integration with SharePoint Server prior to the 2007 release was very rudimentary. With Office SharePoint Server 2007, you can now surface BI capabilities for information workers using a familiar interface. We look specifically at the integration between SQL Server Analysis Services (SSAS) and Office SharePoint Server 2007 in this chapter. This particular section builds on our discussions in Chapters 20 through 24 regarding SQL Server Reporting Services (SSRS) and Microsoft Office Excel 2007 integration. In this chapter, we focus on two specific BI capabilities that Office SharePoint Server 2007 brings to the table. The first feature is Excel Services, which refers to the ability of Office SharePoint Server 2007 to enable business users to apply, in a SharePoint Web-based portal, the Excel skills they have developed over the years. The second set of features is related to SQL Server Reporting Services integration with Office SharePoint Server 2007. These features allow solution providers to surface automated SQL Server reporting information from within an Office SharePoint Server portal where business users do their work.
Excel Services We’ve emphasized that you should select a client tool that’s easy for you to use, and that this choice is critical to the success of your BI project. As you saw in Chapter 24, “Microsoft Office 2007 as a Data Mining Client,” end users often prefer to start the BI process using data stored in Excel. For many years, business users have been using Excel as their own personal data repository, and they’ll probably continue to do so. 723
724
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Although maintaining local data might seem contrary to the goals of an enterprise BI project, we’ve found a close correlation between Excel use prior to a BI project and the adoption rate for Excel or Excel Services as a client to BI data stored in OLAP cubes and data mining structures after the implementation of BI projects. As capable as Excel is, it’s still a thickclient, desktop-limited application. The workbooks that you or your end users author live on their desks. When someone asks for a new version, you e-mail your workbook to them. As you might have guessed (or experienced!), this ends up creating major versioning problems. These versioning problems span both data and logic, and they can include other challenges such as inappropriate sharing of embedded information, such as SQL Server connection strings (that might include passwords) or logic that is a part of Visual Studio Tools for Office (VSTO) dynamic-link libraries (DLLs) within your Excel sheet. Thus, you’ve probably found it difficult over the years to share all the valuable insight that you have built into your Excel sheets. Excel Services addresses these problems. Excel Services is part of the Microsoft Office SharePoint Server technology stack (Enterprise edition only). Excel Services makes it simple to use, share, secure, and manage Excel 2007 workbooks (.xlsx and .xlsb file formats only) as interactive reports that can be delivered to authorized end users either through the Web browser or by using Web services in a consistent manner throughout the enterprise. Be sure to keep in mind that only Microsoft Office SharePoint Server 2007 Enterprise edition can be used to leverage Excel Services as a BI client for your solution. To summarize, here are the important points to remember: ■■
Excel Services is available only with Excel 2007 and the new Excel 2007 formats.
■■
Excel Services will not support Excel sheets with embedded logic, such as macros, ActiveX controls, or VSTO DLLs. Only .xlsx, and .xlsb files are rendered in Excel Services.
■■
Excel Services does not give you a server-side version of Excel. Instead, it gives you the capability of sharing an Excel sheet as an interactive report.
■■
This interactive report, along with all its logic, is exposed to the enterprise either through the Web browser or Web services.
■■
Charts are rendered as static images, so interactivity for SSAS purposes is limited unless you use a PivotTable.
Basic Architecture of Excel Services The diagram shown in Figure 25-1 illustrates the major components of Excel Services. From Figure 25-1, you can tell that at the heart of Excel Services is Excel Calculation Services. Excel Calculation Services is the part of Excel Services that runs on the application server (or the server running Office SharePoint Server 2007), and it has the responsibility of loading
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
725
workbooks, calculating workbooks, calling custom code as user-defined functions, and refreshing external data. In addition, the Excel Calculation Services piece is also responsible for maintaining session information for the same workbook and the same caller. SharePoint Web Front End Excel Web Access
Excel Web Services
SharePoint Application Server User-Defined Functions
Excel Calculation Services
SharePoint Content Database External Data Sources (Such as SQL Server)
Excel Workbooks
FigurE 25-1 Excel Services architecture
Excel Calculation Services is also responsible for caching open Excel workbooks, calculation states, or external data query results. Because of this caching responsibility, when either of the front-end pieces—the Web service or Web Parts—is requested to render an Excel sheet, it queries Excel Calculation Services to provide the information it needs to render. Excel Calculation Services then either presents a cached version of the Excel sheet or loads it as necessary. When Excel Calculation Services needs to load an Excel sheet, it queries the metadata for the selected sheet from the Office SharePoint Server 2007 content database and then starts processing it. The loaded workbook might have external data connections, in which case Excel Calculation Services queries the external data source and refreshes the Excel sheet data accordingly. Or the loaded Excel sheet might have embellished its calculation abilities by using user-defined functions (UDFs) that are written in .NET. Excel Calculation Services will then look for such UDFs and load and call them as necessary. When Excel Calculation Services is done processing the Excel workbook, it hands over the information as requested to the Web service or Web Parts, whichever the case may be.
726
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Immutability of Excel Sheets You need to understand this very important fact: Excel Calculation Services does not give you the ability to edit an Excel sheet in this release. It merely lets you view the Excel sheet in its current form through either a Web service or Web browser. As you’ll see later in this chapter, you do have the ability to change the values of certain predefined cells as parameters during your session. That ability allows you to view a new snapshot of the data without having to change the Excel sheet. You also have the ability to export that snapshot using the Web Parts UI. What you cannot do is allow Excel Services to save that snapshot back in the document library. Although you might be wondering about the usefulness of this feature set because of its immutability limitation, we’ve actually found this restriction to be acceptable because the core business requirement to be able to expose an Excel sheet through the browser is often equivalent to the requirement of providing “one version of the truth.” If any user connecting to the document library using a Web browser is able to edit the sheet, the sheet’s credibility becomes questionable. Of course, the read-only limitation does not meet the needs of all BI scenarios. However, we find that it does have its place for some clients—usually for clients who want to present a quick view of one or more business metrics to a large number of users.
Introductory Sample Excel Services Worksheet We’ll walk you through a couple of examples so that you can better understand the core functionality of Excel Services. In these examples, we’re intentionally focusing on the functionality of Excel Services itself, rather than trying to showcase Excel Services being used with SSAS data. To that end, we’ll just use relational source data, rather than using multidimensional or data mining structures as source data. You can, of course, use any data source type that is supported in Excel as a data source for Excel Services. We’re working with a basic server farm running Office SharePoint Server 2007 that includes the following: ■■
A Shared Service Provider (SSP) has been set up on port 4000.
■■
A central administration site has been provisioned on port 40000.
■■
A front-end Web site has been provisioned on port 80.
■■
Additions to My Sites have been configured to be created on port 4001.
The preceding port numbers and components, such as My Sites, are not necessary for Excel Services. For this example, the full version of Office SharePoint Server 2007, rather than
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
727
Windows SharePoint Services, is required. This is because only Office SharePoint Server 2007 contains the MySites template. Also, only Office SharePoint Server 2007 contains the SSP container for system settings. And, as mentioned, only Office SharePoint Server 2007 (and not Windows SharePoint Services) contains the Excel Services feature set itself. In addition to the details listed, we have created a site collection on port 80 using the blank site collection. The aim of this exercise is to author a simple Excel sheet and render it through Excel Services. The following steps lead you through this exercise: 1. Create a place in the front-end site to store Excel sheets. This is as simple as creating a document library that holds Excel 2007 sheets. The document library is called sheets. 2. Configure the SSP to allow a certain document library to be used in Excel Services. 3. Go to the SSP for your server or farm. 4. Locate the Excel Services Settings section, and click Trusted File Locations. 5. Click Add Trusted File Location. 6. Type http:///sheets as the address of the document library that will hold your Excel sheets. Also, indicate that you intend to trust it to be used with Excel Services. 7. Leave the rest of the default settings as they are, and click OK. 8. Author and publish an Excel sheet that will get rendered in Excel Services. 9. Start Excel 2007, and author a sheet as shown in Figure 25-2. Note that cell B5 is a formula.
FigurE 25-2 A sample Excel sheet
10. Click the Microsoft Office Button, and then click Publish in the left pane. This opens the Publish submenu. From there, click Excel Services to publish your workbook to Excel Services, as shown in Figure 25-3. If you’re unable to find an Excel Services submenu item on the Publish menu in your Excel application, chances are that you’re not running Office Professional or Ultimate. In that case, you can simply upload the Excel sheet to the document library. For certain
728
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
functions, such as parameterized Excel sheets, you need to have Office Professional or Ultimate. For a comparison of Office edition features, see http://office.microsoft.com/ en-us/suites/FX101757671033.aspx.
Publish to Excel Services
FigurE 25-3 Publishing to Excel Services
11. In the Save As dialog box that appears, type http:// as the save location, and save the sheet to the sheets document library you created earlier. Save the sheet as MyExpenses.xlsx. 12. Next, you need to edit the front-end site so that it can display Excel sheets in the browser using out-of-the-box Web Parts. 13. Before you can use the Excel Services Web Parts, you need to enable them. You do this by enabling the Office SharePoint Server 2007 Enterprise Site Collection features in the Site Collection Features section of your port 80 site collection in Office SharePoint Server 2007 by choosing Site Actions, Site Settings, and then Site Features from the main (or top) page of your portal. 14. Browse to the port 80 site collection you created earlier, and choose Site Actions, Edit Page to edit the home page. 15. Click Add A Web Part in the left pane, and click to add the Excel Web Access Web Part. The Web Part prompts you to select a workbook. Click the link shown in the prompt to open the tool pane.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
729
16. In the tool pane, locate the workbook text box, and enter the path to the MyExpenses.xlsx workbook you uploaded earlier in your sheets document library. You can also browse to it by clicking the button next to the text box. 17. Click OK, and then click the Exit Edit Mode link to return to the viewing mode. You should now see the workbook rendered in the browser, as shown in Figure 25-4.
FigurE 25-4 The Excel sheet running under Excel Services
In this first walkthrough, we listed the steps you need to take to publish a simple read-only Excel workbook to Excel Services in Office SharePoint Server 2007. In addition to performing simple publishing tasks, you can enable parameters. We’ll take a detailed look at how to do that next.
Publishing Parameterized Excel Sheets We find that a more common BI business requirement than publishing a simple Excel workbook for BI projects is the need to publish parameterized workbooks. This is easy to accomplish. The first thing you need to do is edit your Excel sheet by adding a pie chart to it, as shown in Figure 25-5. To quickly create a pie chart using source data from an open workbook, click on any section (cell or group of cells) containing data or labels, click Insert on the Ribbon, and then click the Pie (chart) button to create the type of pie chart you want to add.
730
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigurE 25-5 Adding a pie chart to your Excel sheet
Next, let’s assume you need to include information to enable users to edit the value of cell B3, which is the expense for gas, and view an updated pie chart. To provide this capability, you need to give B3 a defined name, such as GasExpense. To do this, select the B3 cell, click the Formulas tab on the Ribbon, and click the Define Name button. Define a new name as shown in Figure 25-6.
FigurE 25-6 Defining a name for the cell that shows gas expense
Next, republish the Excel sheet as you did earlier (using the steps described in the previous section), with one difference. This time, click the Excel Services Options button in the Save As dialog box as shown in Figure 25-7.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
731
FigurE 25-7 The Excel Services Options button
In the Excel Services Options dialog box, click the Parameters tab, click the Add button, and select GasExpense to add it as a parameter. When you publish the sheet to Excel Services and render it in the browser, the end user will be able to change the value of the gas expense parameter using the GasExpense text box found in the Parameters task pane, as shown in Figure 25-8. To verify the functionality, you simply enter a new value for GasExpense. (In this example, we entered a lower expense of 20.) Then click the Apply button to refresh the pie chart. As we mentioned earlier, this update does not affect the base Excel sheet stored in the document library—that is immutable when using Excel Services. The user’s changes, including his parameters, are lost as soon as he closes the browser or his session times out. If the user wants to export a snapshot of his changes, he can do so through the Internet Explorer toolbar by choosing Open and then Open Snapshot on the Excel menu. Click the Excel Services Options button to see what other options are available. You can choose to publish the entire workbook, specific sheets, or even individual charts and ranges.
732
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
GasExpense text box
FigurE 25-8 Editing Excel Services parameters
Before we leave the subject of Excel Services core functionality, we’ll remind you that Excel supports SSAS OLAP cube source data. This, of course, can include calculations and key performance indicators (KPIs). For example, one business scenario that we’ve been asked to support is a dashboard-like display of OLAP cube KPIs via Excel Services. We’ve also been asked to simply implement Excel Services to centralize storage of key workbooks that had been previously scattered across users’ desktop computers.
Excel Services: The Web Services API As we discussed in Chapter 23 and Chapter 24, you can use VSTO to extend Excel programmatically. You might wonder whether (and how) you can programmatically extend Excel Services. To do this, you can work with the Excel Web Services API. This allows external applications to aggregate Excel Services information over standard ASMX calls. To work with this API, you start by accessing the Web service endpoint. The ASMX endpoint is exposed at http:///_vti_bin/ExcelService.asmx. ASMXs are classic Web services that can be used with Windows Communication Foundation (WCF) using the default basic HttpBinding. Note If you’re unfamiliar with calling WCF services from a client application, see the walkthrough on MSDN titled “How to: Create a Windows Communication Foundation Client” at http://msdn.microsoft.com/en-us/library/ms733133.aspx.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
733
As an example, the Excel sheet mentioned earlier in this section can be calculated on the server side, and a WCF proxy can be created and used as follows: ExcelService.ExcelServiceSoapClient client = new ExcelService.ExcelServiceSoapClient(); client.ClientCredentials.Windows.AllowedImpersonationLevel = System.Security.Principal.TokenImpersonationLevel.Impersonation; ExcelService.Status[] outStatus; string sessionID = client.OpenWorkbook(docLibAddress + excelFileName, "en-US", "en-US", out outStatus); outStatus = client.Calculate(sessionID, "Sheet1", rc); object o = client.GetCell(sessionID, "Sheet1", 5, 2, false, out outStatus);
In the preceding code snippet, docLibAddress is a string that contains the path to the document library that stores your Excel sheets and excelFileName is a string that is the actual file name. The important thing to remember here is that with Web services in .NET 2.0, by default your Windows identity was propagated all the way to the server running the Web service. Office SharePoint Server 2007 expects to see your Windows identity when calling a Web service. When using WCF, however, security is configurable, and it’s rightfully made anonymous by default. If your business requirements are such that you’d like to revert from WCF-style authentication to ASMX-style implicit Windows authentication, you have to add the following security section of the basicHttpBinding configuration location in your application’s configuration file (usually named app.config): <security mode="TransportCredentialOnly"> <message clientCredentialType="UserName" algorithmSuite="Default" />
A Real-World Excel Services Example So far, all the examples we’ve presented have walked you through simple examples that served well in explaining the basics of Excel Services. With the basics behind us, we’ll next work through a more complex example that targets the Northwind database and presents an interactive PivotTable using infrastructure provided by Excel Services. Note that you can also connect to SSAS data by clicking From Other Sources on the Data tab of the Ribbon in Excel and then clicking From Analysis Services in the drop-down list. Note You can download the Northwind sample database from http://www.microsoft.com/downloads/details.aspx?FamilyID=06616212-0356-46A0-8DA2-EEBC53A68034&displaylang=en.
734
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
You can use the following steps to walk through this exercise: 1. Start Excel 2007, and create a new workbook. 2. Click the Data tab on the Ribbon. 3. Click From Other Sources and then click From SQL Server in the drop-down list, as shown in Figure 25-9.
FigurE 25-9 Excel data sources
After you click From SQL Server, the Data Connection Wizard opens. The Connect To Data Source page asks you for the server name and login credentials. On the next page of the wizard, select the sample database and table. We are using the Northwind database, and specifically the Orders table. On the last page of the Data Connection Wizard, you need to fill in the .odc file name and, optionally, a description, a friendly name, and search keywords. To make the data connection easily shareable, save the .odc file in a Data Connection document library on the Office SharePoint Server 2007 front-end port 80 site collection. In our example, we saved the results in /dataconnections/NorthwindOrders.odc To save the .odc in this location, click the Browse button in the File Name section of the Save Data Connection File And Finish page of the Data Connection Wizard. This opens the File Save dialog box, where you’ll select the Office SharePoint Server 2007 Data Connection library location from the list on the left. If this location does not appear, you’ll have to manually enter it. Note Office SharePoint Server 2007 includes a Data Connection document library as one of the default templates in the Report Center group of templates.
After you save the data connection file, you will be prompted to import the data into your Excel sheet and the Import Data dialog box will open. It is set by default to import your data into Excel in a tabular (table) format. Choose the PivotTable Report option to import the data as a PivotTable, as shown in Figure 25-10. Then click OK.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
735
FigurE 25-10 Importing information into Excel
After you click OK, the PivotTable Field List opens. There you will be presented with a list of fields that you can add to your PivotTable by clicking the box next to their name. At the bottom of the PivotTable Field List, there are four areas that you can use to lay out your PivotTable data. Next, choose to format your PivotTable report as shown in Figure 25-11 by clicking on fields in the Choose Fields To Add To Report section of the PivotTable Field List and dragging those fields to one of the four areas at the bottom of the PivotTable Field List window: Report Filter, Column Labels, Row Labels, or Values. We’ve used ShipCountry as a report filter, ShipRegion as a row label, and Count Of ShippedDate as the displayed data aggregate value.
FigurE 25-11 Setting up your PivotTable
After you’ve set a filter on the ShipCountry value to USA, publish the Excel sheet to Excel Services. The rendered sheet will look like Figure 25-12.
736
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigurE 25-12 The PivotTable in a browser
You can verify that you have PivotTable interactivity built right into the Web-based interface after you publish your report to Office SharePoint Server 2007 Excel Services and view the page where you host Excel Services Web Parts from Office SharePoint Server 2007 in a browser. There, you’ll be able to set filters on the columns, rows, and the ShipCountry filter variable. As we have shown, the capability of being able to host Excel-sourced reports on a Webbased UI, with back-end data from SSAS or other data sources, is quite powerful. You will note that we did not have to write a single line of code to accomplish this task. Note also that the Excel Services interface is exposed as a Web service, so if you want to extend its capabilities programmatically you can do so by working with the publicly exposed methods of its API.
SQL Server reporting Services with Office SharePoint Server 2007 So far in this chapter, we’ve talked about Excel Services being used as a BI tool with Office SharePoint Server 2007. Although you can certainly use Excel Services as your BI user portal, the limits of Excel might not match your business requirements. Excel Services allows the business user to achieve simple tasks and manage simple sets of data. In spite of Excel’s ability to connect to external data sources, Excel Services targets one specific type of client only—users who are already comfortable working in some version of Excel. We see this user population typically as business analysts. Of course, there are exceptions to this characterization; however, we most often use Excel or Excel Services as one part of our client interfaces for BI projects.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
737
As we saw in Chapters 20 through 22, SSRS is a sophisticated and powerful end-user client BI tool. Office SharePoint Server 2007 can integrate very closely with SSRS and then render reports that you create in SSRS inside the information worker’s work portal.
Configuring SQL Server Reporting Services with Office SharePoint Server 2007 SSRS can work with Office SharePoint Server 2007 in two configuration modes: native mode or SharePoint integrated mode. In either mode, the SSRS reports are authored using the Business Intelligence Development Studio (BIDS) SSRS template or Report Builder. They are then rendered in the browser inside an Office SharePoint Server 2007 UI. It is important that you understand a bit more about the implications of using these two configuration modes as you consider using SSRS in Office SharePoint Server 2007 as a client for your BI solution. The major difference between native mode and SharePoint integrated mode is that SharePoint integrated mode lets you deploy and manage both the reports and the relevant data connections in SharePoint document libraries. This reduces the administrative overhead for your solution. If you choose to use SharePoint integrated mode, you first need to configure SSRS to use this mode rather than the default configuration for SSRS, which is to use native mode. Also, to use SharePoint integrated mode, SSRS must be installed on the same physical machine as Office SharePoint Server 2007. To set up SSRS in SharePoint integrated mode, on the Start menu, click the Reporting Services Configuration Manager link. Navigate to the Web Service URL section and set up a virtual directory called ReportServer on a port other than what Office SharePoint Server 2007 is using. This is important because SQL Server 2008 does not use Internet Information Services (IIS) and Office SharePoint Server 2007 cannot natively share a port. Note By using stsadm.exe –exclude, you can configure Office SharePoint Server 2007 to share (or exclude) specific URLs, such as that of an SSRS instance. As a best practice, we generally use separate ports when we host both SSRS and Office SharePoint Server 2007 on the same server.
Next, in the Database section, create a new database with a unique name and then select the desired mode—for example, native or SharePoint integrated. When both Office SharePoint Server 2007 and SSRS have been selected as client tools for a BI project, we use SharePoint integrated mode more often than native mode because of the simplified administration it provides. This simplification also includes fewer metadata tables for SSRS. In SharePoint integrated mode, SSRS stores its own metadata in tables inside the configured Office SharePoint Server 2007 metadata databases rather than in unique databases for SSRS.
738
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
The next step is to create a Report Manager URL. We created http://:10000/Reports for our example. It might not be obvious given the UI, but a Report Manager virtual directory is not created for you by default. This is indicated by the message shown in Figure 25-13. The reason for this is to give you greater control over the particular URL that is being used for SSRS. After you’ve entered your desired URL, you must click Apply to create the Report Manager instance at this location. This last step is optional; however, we usually implement SSRS and Office SharePoint Server 2007 integration using this independent URL option for SSRS.
FigurE 25-13 Configuring the Report Manager URL
In the next step, you create a simple report and deploy it so that you can see it displayed in Office SharePoint Server 2007.
Authoring and Deploying a Report To create a sample report, open BIDS and author a new Report Server project named MyReports under the Business Intelligence Projects category, as shown in Figure 25-14. In the Shared Data Sources section, add a new data source called Northwind, and use it to connect to the Northwind database. Next, in the Reports section, right-click and choose Add, New Item, Report. On the report designer surface, drag and drop a Table data region from the Toolbox to the report surface. When prompted to provide a data source, click the link, and then choose the existing Northwind data source as shown in Figure 25-15. Click OK.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
FigurE 25-14 Creating a Report Server project
FigurE 25-15 Picking the proper data source
739
740
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
Next you need to get some data to display on your report. To do this, type a sample query (targeting the Customers table). This creates the necessary dataset. Then format the results to create a report as shown in Figure 25-16. For our example, we used the following query: Select CustomerID, ContactName, Address, City, PostalCode, Country From Customers
FigurE 25-16 The report in design mode
For the last step, you need to specify the deployment settings. If you’re using native mode, in the project settings dialog box, specify the target server URL as http://:10000/reportServer. If you’re using SharePoint integrated mode, you need to specify these values: ■■
The target data source folder as a Data Connection document library on your SharePoint site
■■
The target report folder as another document library on your SharePoint site
■■
The target server URL as the SharePoint site itself
With the project settings established, go ahead and deploy the project by right-clicking the project name and clicking Deploy on the shortcut menu.
Using the Report in Office SharePoint Server 2007: Native Mode If you want to use the report in native mode, accept the default SSRS configuration settings. For both SharePoint integrated mode and native mode, you’ll want to display the report on a SharePoint Web page. To do this, you have several options. The simplest option is to use the Web Parts that are designed to host SSRS reports. These Web Parts are shipped and copied locally when Office SharePoint Server 2007 is installed, but they are not enabled on a SharePoint site by default.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
741
To make these Web Parts available for a particular SharePoint site, you must first enable the SSRS Web Parts for Office SharePoint Server 2007. These Web Parts are named Report Explorer and Report Viewer. Report Explorer returns a list of reports at a particular SSRS directory location. Report Viewer hosts the SSRS standard report viewer control in an I-frame-like container. To make these Web Parts available for authorized end users to add to pages on a SharePoint site, you (as an authorized site administrator) must run the following command line: stsadm.exe -o addwppack -filename C:\Program Files\Microsoft SQL Server\100\Tools\Reporting Services\SharePoint\RSWebParts.cab
This command activates the Web Parts. The Web Parts are included with a standard SQL Server 2008 installation, but they are not activated by default with the Enterprise edition of Office SharePoint Server 2007. End users must have appropriate permission to install (or add) any Web Parts to selected SharePoint Web site pages. Also, all Office SharePoint Server 2007 Web Parts require Code Access Security (CAS) permissions for activation. You can adjust the default settings for the CAS permissions depending on the functionality included in the SSRS reports that you plan to host on your Office SharePoint Server 2007 instance. Note CAS in Office SharePoint Server 2007 is expressed as a collection of CAS permissions. Each collection contains multiple individual permissions that are assigned to DLLs that meet the criteria—that is, they have a certain name, version, and so on. Permission sets include Full Trust and lesser (more granular) permissions as well. You should refrain from adding Web Parts that have the Full Trust permission set associated with them because they can pose a security risk.
After you’ve made the SSRS Web Parts available as part of your Office SharePoint Server 2007 instance, you need to add them to a page on your portal. To do this, browse to your SharePoint Web site, and then put the selected page into edit mode. Next, add a Web Part to a selected area of the editable page. Select the Report Viewer Web Part from the Add Web Parts dialog box as shown in Figure 25-17. Note that this dialog box also includes a Report Explorer Web Part. For the last step, you need to configure the connection from the Office SharePoint Server 2007 Report Viewer Web Part to your particular SSRS instance. To do this, you add information to the properties of the Web Part. In the properties section of the Report Viewer Web Part, change the Report Manager URL to http://:10000/reports and the report path to exactly this: /Myreports/report1. You should now be able to view the report when running Office SharePoint Server 2007.
742
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigurE 25-17 Report Viewer Web Part
Using the Report in Office SharePoint Server 2007: SharePoint Integrated Mode Before you can use SSRS in SharePoint integrated mode, you first need to install the Microsoft SQL Server 2008 Reporting Services Add-in for Microsoft SharePoint technologies on all involved Web front-end SSRS servers. SSRS can be installed on a single server, or its components can be scaled to separate servers. After installing the Reporting Services add-in on the server running Office SharePoint Server 2007 where the front-end SSRS components are installed, you need to activate the Reporting Services Integration feature on both the Central Administration site of the Office SharePoint Server 2007 instance and the front-end Web site where you want to use the reports. In Central Administration, go to Application Management, and click Grant Database Access under Reporting Services, as shown in Figure 25-18. In providing the relevant accounts with the appropriate database access, Office SharePoint Server 2007 prompts you to enter credentials for a user that has administrative rights on the domain. Make sure to enter the user name in the Domain\Username format.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
743
FigurE 25-18 Granting database access in Central Administration
Next, under Central Administration\Application Management\Reporting Services, click Manage Integration Settings. In the Reporting Services Integration window, provide the settings as shown in Figure 25-19.
FigurE 25-19 Configuring Office SharePoint Server 2007 with SQL Server Reporting Services Integration
settings
Next, in your front-end site, after having deployed the reports and activated the Reporting Services Integration feature, edit the home page and add the SQL Server Reporting Services Report Viewer Web Part as shown in Figure 25-20. You can now configure this Web Part and point it to the report that you deployed in the document library earlier. This results in you being able to render a selected SSRS report on the SharePoint page where you’ve added the ReportViewer control.
744
Part IV Microsoft SQL Server Reporting Services and Other Client Interfaces for Business Intelligence
FigurE 25-20 Picking the SQL Server Reporting Services Report Viewer Web Part
Using the Report Center Templates There is a set of template pages that ship with a default Enterprise installation of Office SharePoint Server 2007. Each of these page types includes Web Parts that you might want to use to display some of your BI reports for authorized end users. The top-level page for this group is named Report Center. Report Center contains linked pages where you can store content of the following types: data connections, reports, KPIs, and more. Also, the top-level page includes custom menu items related to reports for end users. These menus include notifications and more. If you intend to use Office SharePoint Server 2007 as a host for BI reports implemented via SSRS, you might want to examine the built-in functionality in the Report Center templates to see whether it can meet any of your project’s business needs. Report Center includes, for example, a specialized type of SharePoint document library called Reports Library. This template contains functionality specific to hosting *.rdlbased reports, such as integrated SSRS report uploading and display. The Office SharePoint Server 2007 Report Center also contains templates to help you quickly create BI dashboards. A BI dashboard commonly contains Web Parts that display various BI metrics. These often include the display of KPIs (OLAP cube-based KPIs, Excel workbookbased KPIs, or KPIs locally hosted in Office SharePoint Server 2007), reports, and Excel workbooks or charts.
Chapter 25
SQL Server Business Intelligence and Microsoft Office SharePoint Server 2007
745
Although we sometimes use these templates, in our real-world experience, we most often create custom pages and add the included Report Viewer and Report Explorer controls. We do this because most of our customers want to implement greater customization on their portal in their BI end-user implementation than is provided by the Report Center templates. For more information on these templates, see the SQL Server Books Online topic “Understanding the Report Center and Dashboards in SharePoint 2007” at http://msdn.microsoft.com/en-us/library/bb966994.aspx.
PerformancePoint Server Another Microsoft product that provides rich integration with SQL Server SSAS is PerformancePoint Server (PPS). While in-depth coverage of PPS is beyond the scope of this book, we often choose PPS as one of our client tools in BI solutions, and did provide some coverage of writing MDX for it in Chapter 11, “Advanced MDX.” We encourage you to explore online resources to investigate the integration capabilities PPS offers. A good starting point is http://office.microsoft.com/en-us/performancepoint/FX101680481033.aspx.
Summary Microsoft SharePoint technologies have enjoyed increasingly wide adoption. We find that this is particularly driven by features added in the 2007 release. The SQL Server 2008 SSAS BI integration capabilities available in Office SharePoint Server 2007 are seen by many of our customers as a critical driver of adoption of the SharePoint platform in their enterprise. Office SharePoint Server 2007 enables rich BI capabilities by using Excel Services or SQL Server Reporting Services. It’s only reasonable to assume that you’ll see further investments in this arena from both Microsoft and third-party vendors.
Index A AccessMode property, 449 account intelligence, configuring in Business Intelligence Wizard, 243, 246–247 AcquireConnections method, 581 actions, SSAS defined, 149, 233 drillthrough, 233, 236–238 regular, 233, 234–235 reporting, 233, 235–236 Add SourceSafe Database Wizard, 541–542 AddRow method, 582 administrative scripting, SSRS, 667–669 ADO.NET connection manager, 473 ADO.NET data flow destination, 485, 486 ADO.NET data flow source, 483, 497–498, 597 Agent. See SQL Server Agent Aggregate data flow transformation, 486, 487, 488 Aggregation Design Wizard, 271–273 aggregations Aggregation Design Wizard, 271–273 built-in types, 147 configuring, 262–263 creating designs manually, 277–278 defined, 9 and fact tables, 261 implementing, 270–278 key points, 271 main reason to add, 271 overview, 261–263 and query processing, 262 question of need for, 270–271 role of SQL Server Profiler, 275–277 in SQL Server cube store vs. Transact-SQL, 9 Usage-Based Optimization Wizard, 274–275 using with date functions, 324–326 viewing and refining, 262–263
Agile Software Development. See MSF (Microsoft Solution Framework) for Agile Software Development Algorithm Parameters dialog box, 367, 376–377, 378, 710 algorithms, data mining association category, 359 classification category, 358 clustering category, 359 configuring parameters, 367, 378 in data mining models, 45, 46, 46–47, 358 forecasting and regression category, 359 Microsoft Association algorithm, 391–393 Microsoft Clustering algorithm, 386–389 Microsoft Decision Trees algorithm, 381–383 Microsoft Linear Regression algorithm, 383 Microsoft Logistic Regression algorithm, 395–396 Microsoft Naïve Bayes algorithm, 376–381, 518 Microsoft Neural Network algorithm, 394–395 Microsoft Sequence Clustering algorithm, 389–390 Microsoft Time Series algorithm, 383–386 sequence analysis and prediction category, 359 supervised vs. unsupervised, 376 viewer types for, 369–370 ALTER MINING STRUCTURE (DMX) syntax, 366 Analysis Management Objects (AMOs), 31 Analysis Services. See SQL Server Analysis Services (SSAS) Analysis Services Processing task, 430, 530 analytical activities. See OLAP (online analytical processing) Ancestors MDX function, 319–320 Application class, 596, 599 applications, custom, integrating SSIS packages in, 596–600
ascmd.exe tool, 157 ASMX files, 732 assemblies compiled, using with SSAS objects, 196–197 custom, adding to SSRS reports, 647–649 custom, creating, 197 default, in SSAS, 197 association algorithms, 359 Association Wizard, 48 asynchronous data flow outputs, 459 asynchronous transformation, 583–586 attribute hierarchies, in OLAP cube design, 206–207 attribute ordering, specifying in Business Intelligence Wizard, 244, 250 attribute relationships, in BIDS, 139, 205, 207–209, 223 Audit transformation, 524 auditing. See also SQL Server Profiler added features in SQL Server 2008, 111 using SQL Server Profiler, 109–110 authentication credential flow in SSRS reports, 103 requesting access to Report Server, 610–611 AverageOfChildren aggregate function, 147
B background processing, for reports and subscriptions, 612 backups and restores overview, 106 for SQL Server Analysis Services, 106–107 for SQL Server Integration Services, 107–108, 112 for SQL Server Reporting Services, 108 Barnes and Noble, 28
747
748
BI solutions BI solutions. See also Business Intelligence Development Studio (BIDS) case studies, 27–33 common challenges, 54–56 common terminology, 11–15 complete solution components, 50–54 customizing data display in SQL Server 2008, 3 defined, 3 development productivity tips, 70 in law enforcement, 29 localization of data, 29 measuring solution ROIs, 56–58 MSF project phases, 65–71 multiple servers for solutions, 4 and Office SharePoint Server, 723–745 process and people issues, 61–83 project implementation scope, 28 query language options, 23–25 relational and non-relational data sources, 22–23 reporting interfaces, 3 role of Microsoft Excel, 36–37, 43–50 sales and marketing, 29 schema-first vs. data-first approaches to design phase, 130 security requirements for solutions, 95–106 skills necessary for projects, 72–76 software life cycle, 28 solution core components, 16–20 solution optional components, 21–23 testing project results, 70–71 top 10 scoping questions, 30 visualizing solutions, 34–36 BIDS. See Business Intelligence Development Studio (BIDS) BIDS Helper tool, 255, 490, 494, 510 Biztalk Server, 22 Boolean data type, 363 BottomCount MDX statement, 311 breakpoints, inserting, 505–506 build, defined, 259 building phase, MSF, 68–70 business intelligence (BI). See also Business Intelligence Development Studio (BIDS) case studies, 27–33 common challenges, 54–56 common terminology, 11–15
complete solution components, 50–54 customizing data display in SQL Server 2008, 3 defined, 3 development productivity tips, 70 in law enforcement, 29 localization of data, 29 measuring solution ROIs, 56–58 MSF project phases, 65–71 multiple servers for solutions, 4 and Office SharePoint Server, 723–745 process and people issues, 61–83 project implementation scope, 28 query language options, 23–25 relational and non-relational data sources, 22–23 reporting interfaces, 3 role of Microsoft Excel, 36–37, 43–50 sales and marketing, 29 schema-first vs. data-first approaches to design phase, 130 security requirements for solutions, 95–106 skills necessary for projects, 72–76 software life cycle, 28 solution core components, 16–20 solution optional components, 21–23 testing project results, 70–71 top 10 scoping questions, 30 visualizing solutions, 34–36 Business Intelligence Development Studio (BIDS). See also SQL Server Analysis Services (SSAS) BIDS Helper tool, 255, 490, 494, 510 compared with Visio for creating OLAP models, 133 as core tool for developing OLAP cubes and data mining structures, 16, 40, 157 creating new SSIS project templates by using New Project dialog box, 464–465 creating or updating SSAS objects, 186–188 creating reports, 612–622 creating SSIS packages, 463–495 data mining interface, 360–375 defined, 155 Dependency Network view, 47 deploying reports to SSRS, 624
deploying SSIS packages, 553–556 Deployment Progress window, 41 development tips, 70 disconnected instances, 259–261 Error List window, 31 exploring dimension table data, 123 exploring fact table data, 120 MDX Query Designer, 628–631 New Cube Wizard, 134 OLAP cubes, adding capabilities, 225–255 OLAP cubes, using to design, 183–223 online vs. offline mode, 184–186 opening sample databases, 39–43 overview, 183–186 as primary development tool for SSIS packages, 20, 439–440, 463–495 processing options for cubes and dimensions, 287–291 relationship to Visual Studio, 16–17, 22, 41, 463 Report Data window, 635–638 resemblance to Visual Studio interface, 157 role designer, 195–196 running on x64 systems, 91 Solution Explorer window, 40, 46, 184, 186–188 source control considerations, 113 SSRS Toolbox, 621–622, 638 working with SSAS databases in connected mode, 261 working with two instances open, 225 Business Intelligence Wizard accessing, 243 Create A Custom Member Formula, 244, 251 Define Account Intelligence, 243, 246–247 Define Currency Conversion, 244, 251–254 Define Dimension Intelligence, 243, 250 Define Semiadditive Behavior, 244, 250 Define Time Intelligence, 243, 245 Specify A Unary Operator, 244, 248–250 Specify Attribute Ordering, 244, 250 ByAccount aggregate function, 147
CREATE KPI statement
C cache scopes, for queries, 326 CacheMode property, 364 Calculated Columns sample package, 487 calculated measures, 148 calculated members creating in Business Intelligence Wizard, 318, 320 creating in cube designer, 239–241 creating in query designer, 631 creating using WITH MEMBER statement, 307 defined, 175, 307 global vs. local, 631 permanent, creating using BIDS interface, 334–335 permanent, creating using MDX scripts, 335–336 pros and cons, 241 vs. stored measures, 298 Calculations tab, cube designer, 201, 239–242, 334–335 Capability Maturity Model Integration (CMMI), 65 CAS (code access security), 648–649 Cash Transform data flow transformation, 486 change data capture (CDC), 524, 531 Chart control, 638, 643 checkpoints, in SSIS packages configuring, 506 defined, 506 writing, 507 child tables, relationship to parent table, 5 Children MDX function, 300, 316, 321 Choose Toolbox Items dialog box, 591 classification algorithms, 358 classification matrix, 415–416 Clean Data Wizard, 705, 706–707 cloud-hosted data mining, 720–721 Cluster DMX function, 425 Cluster Wizard, Microsoft Visio, 717–718 ClusterDistance DMX function, 425 clustering algorithms, 359 ClusterProbability DMX function, 425 CMMI (Capability Maturity Model Integration), 65 code access security (CAS), 648–649 CodePlex Web site, 37–38, 86, 157
columns in dimension tables, 121–122 in fact tables, 118–119, 146 variable-width, in data flow metadata, 456–457 command-line tools ascmd.exe tool, 157 DTEXEC utility, 440 DTEXECUI utility, 440–441 DTUTIL utility, 441 installed with SQL Server 2008, 157 rsconfig.exe tool, 604 rs.exe tool, 609 SQLPS.exe tool, 157 CommandText property, 598 community technology preview (CTP) version, SQL Server 2008, 40 ComponentMetaData property, 580 components. See also Script component compared with tasks, 444, 567–568 custom, in SSIS, 587–588 destination, 485–486, 586–587 in SSIS package data flows, 444 transformation, 486–488 Configuration Manager, Reporting Services, 102, 108, 155, 607, 609, 737 Configuration Manager, SQL Server, 94, 155, 157–158 Configuration Manager, SSRS. See Configuration Manager, Reporting Services confusion matrix. See classification matrix connection managers adding to packages, 473 ADO.NET, 473 custom, 588, 594 defined, 468 Flat File, 474 inclusion in Visual Studio package designers, 468 ODBC, 473 OLE DB, 473 overview, 448–450 Raw File, 474 specifying for log providers, 502 types, 473–474 using in Script components, 580–581 using within Visual Studio, 473–474 Connections property, 571, 580 ConnectionString property, 581
constraints. See precedence constraints containers default error handling behavior, 499 generic group, 479 SSIS control flow, 478–479 content types Continuous, 362 Cyclical, 362 defined, 361 detecting in Data Mining Wizard, 402 Discrete, 361 Discretized, 362 Key, 362 Key Sequence, 362, 363 Key Time, 362, 363 Ordered, 362 support for data types, 363 Table, 362 Continuous content type, 362 control flow designer Connection Manager window in, 468 Data Flow task, 476, 477–478 Data Profiling task, 476 defined, 468 event handling, 500–501 Execute Process sample, 476–478 Execute Process task, 476, 477 Execute SQL tasks, 476, 476–477, 494 Foreach Loop containers, 476, 478 For Loop containers, 478 Sequence containers, 478 Task Host containers, 478 task overview, 476–478 Toolbox window in, 469 control flow, in SSIS packages building custom tasks, 591–593 configuring task precedence, 480–481 container types, 478–479 Data Profiling task, 510–513 event handling, 450–451 logging events, 504 Lookup sample, 528 overview, 442–444 Script task, 567–568 copying SSIS packages to deploy, 552–553 Count aggregate function, 147 counters. See performance counters Create A Custom Member Formula, Business Intelligence Wizard, 244, 251 CREATE KPI statement, 348
749
750
CreateNewOutputRows method CreateNewOutputRows method, 582 CRISP-DM life cycle model, 399–400, 409 cross validation, 417–418 CTP (community technology preview) version, SQL Server 2008, 40 cube browser, 41–42, 201 cube designer accessing Business Intelligence Wizard, 243 Actions tab, 201, 233–239 Aggregations tab, 201, 262–263, 275 Browser tab, 41–42, 201 Calculations tab, 201, 239–242, 334–335 Cube Structure tab, 201, 201–203 description, 201 Dimension Usage tab, 126–128, 134–135, 211–212, 215 KPIs tab, 201, 228, 345 opening dimension editor, 203–204 Partitions tab, 201, 264, 278 Perspectives tab, 201, 227 tool for building OLAP cubes, 198–204 Translations tab, 201 cube partitions defined, 263 defining, 265–266 enabling writeback, 285–286 overview, 263–264 for relational data, 268–269 remote, 270 specifying local vs. remote, 270 in star schema source tables, 268–269 storage modes, 270 and updates, 532 Cube Wizard building first OLAP cube, 218–223 Create An Empty Cube option, 199 Generate Tables In The Data Source option, 199, 200 launching from Solution Explorer, 218 populating Dimension Usage tab, 128 Use Existing Tables option, 198–199, 200 CUBEKPIMEMBER OLAP function, 683 CUBEMEMBER OLAP function, 683 CUBEMEMBERPROPERTY OLAP function, 683
CUBERANKEDMEMBER OLAP function, 683 cubes, OLAP adding aggregations, 263 assessing source data quality, 516–518 background, 13 as BI data structure, 13 BIDS browser, 41–42, 201 building in BIDS, 198–204 building prototypes, 50 building sample using Adventure Works, 37–39 configuring properties, 243–254 connecting to sample using Microsoft Excel, 43–45 as core of SQL Server BI projects, 115 core tools for development, 157 creating empty structures, 133 as data marts, 13 data vs. metadata, 258 in data warehouses, 11 defined, 9 and denormalization concept, 125 vs. denormalized relational data stores, 10 deploying, 254–255, 260–261 designing by using BIDS, 183–223 dimensions overview, 9–10, 204–210, 257–258 fact (measure) modeling, 146–147 first, building in BIDS, 218–223 Microsoft Excel as client, 671–684 modeling logical design concepts, 115–150 vs. OLTP data sources, 54 opening sample in Business Intelligence Development Studio, 39–43 overview of source data options, 115–116 partitioning data, 263–270 as pivot tables, 10–11 pivoting in BIDS browser, 42 presenting dimensional data, 138–142 processing options, 287–291 and ROI of BI solutions, 56–58 skills needed for building, 72, 74 star schema source data models, 116–125 UDM modeling, 9–10 updating, 530–533 using dimensions, 210–217 viewing by using SSMS Object Browser, 164–168 visualizing screen view, 10–11
CUBESET OLAP function, 683 CUBESETCOUNT OLAP function, 683 CUBEVALUE OLAP function, 683 currency conversions, configuring in Business Intelligence Wizard, 244, 251–254 CurrentMember MDX function, 232, 313 custom applications, integrating SSIS packages in, 596–600 custom foreach enumerators, 594–595 custom member formulas, creating in Business Intelligence Wizard, 244, 251 custom SSIS objects control flow tasks, 591–593 data flow components, 588–591, 593–594 deploying, 589–591 implementing user interfaces, 593, 594 overview, 587–588 registering assemblies in GAC, 590 signing assemblies, 589 customer relationship management (CRM) projects, skills needed for reporting, 75 Cyclical content type, 362
D Data Connection Wizard, 672 data dictionaries, 67 data flow designer advanced edit mode, 484 Calculated Columns sample package, 487 Connection Manager window in, 468 debugging Script components, 587 defined, 468 destination components, 485–486 error handling, 497–498 Execute Process sample, 482 overview, 482–483 paths, defined, 484 separate from control flow designer, 478 source components, 483–485 specifying Script components, 573–581 and SSIS data viewer capability, 488–489 Toolbox window in, 469 transformation components, 486–488
data regions, SSRS data flow engine, in SSIS asynchronous outputs, 459 basic tasks, 453–454 memory buffers, 454 metadata characteristics, 454–458 overview, 453–454 synchronous outputs, 458, 459 variable-width column issue, 456–457 data flow, in SSIS packages. See also data flow engine, in SSIS asynchronous component outputs, 459 custom components, 588–591, 593–594 error handling, 451–452 logging events, 504 Lookup sample, 529–530 overview, 444 Script component, 567–568 synchronous component outputs, 458, 459 data flow maps, 525 Data Flow task, 476, 477–478, 482 data lineage, 524 data marts, 13 data mining adding data mining models to structures using BIDS, 404–406 algorithms. See algorithms, data mining ALTER MINING STRUCTURE syntax, 366 Attribute Characteristics view, 378, 379 Attribute Discrimination view, 378, 380 Attribute Profiles view, 378 background, 14 BIDS model visualizers, 46–47 BIDS visualizer for, 18 building objects, 407 building prototype model, 50 building structures using BIDS, 401–404 cloud-hosted, 720–721 compared with OLAP cubes, 14, 396 content types, 361–363 core tools for development, 157 creating structures by opening new Analysis Services project in BIDS, 401–404 creating structures using SSAS, 18 data types, 361–363 defined, 14 Dependency Network view, 370–371, 378
Distribution property, 363 DMX query language, 24, 179–180 end-user client applications, 431 feature selection, 377–381 future for client visualization controls, 720 Generic Content Tree viewer, 371–372 getting started, 396 implementing structures, 399–431 importance of including functionality, 53–54 initial loading of structures and models, 533–534 installing add-ins to Microsoft Office 2007, 687–688 Microsoft Cluster viewer, 372 Microsoft Excel and end-user viewer, 356 Microsoft Office 2007 as client, 687–721 model viewers in BIDS, 46 Modeling Flags property, 363 object processing, 429–431 OLAP cubes vs. relational source data, 401 prediction queries, 419–426 problem of too much data, 55 processing models/objects, 407–409 queries, 535–537 Relationship property, 363 and ROI of BI solutions, 57–58 role of Microsoft Excel add-ins, 45–47 sample Adventure Works cube, 357 sample Adventure Works structure, 46 skills needed for building structures, 72, 74 software development life cycle for BI projects, 69 SQL Server Analysis Services 2008 enhancements, 148–149 and SQL Server Integration Services, 426–428 tools for object processing, 429–431 validating models, 409–418 viewing sample structures in BIDS, 43 viewing structures by using SSMS viewers, 164, 168–170 viewing structures using Microsoft Excel, 47–50 Data Mining Add-ins for Office 2007
downloading and installing, 47–48, 94 installing, 687–688 Table Analysis Tools group, 690–700 Data Mining Advanced Query Editor, 704 Data Mining Extensions. See DMX (Data Mining Extensions) query language Data Mining Model Training destination, 485, 534–535 Data Mining Query control flow task, 535–536 Data Mining Query data flow component, 427, 428 Data Mining Query Task Editor, 427–428, 536 Data Mining Query transformation component, 487, 536–537 data mining structure designer choosing data mining model, 365–368 handling nested tables, 364, 366 Mining Accuracy Chart tab, 360, 373–375, 417 Mining Model Prediction tab, 360, 375, 419, 424 Mining Model Viewer tab, 46, 360, 368–373, 408, 409 Mining Models tab, 360, 365–368, 404, 404–405 Mining Structure tab, 360, 364–365, 404 viewing source data, 364 Data Mining tab, Microsoft Excel Accuracy And Validation group, 712 comparison with Table Tools Analyze tab, 700–701 Data Modeling group, 708–712 Data Preparation group, 705–708 Management group, 701–702 Model Usage group, 702–705 Data Mining Wizard, 401–404, 405–406 Data Profiling task defined, 510 limitations, 512 list of available profiles, 512–513 new in SQL Server 2008, 478 profiling multiple tables, 513 viewing output file, 513 when to use, 510 data regions, SSRS, defined, 638–655 Tablix data region, defined, 639–642
751
752
Data Source Properties dialog box Data Source Properties dialog box, 614–616 data source views (DSVs) compared with relational views, 161 creating, 199, 201 defined, 190 examining, 190–192 getting started, 195 making changes to existing tables, 193–194 making changes to metadata, 192–193 overview, 190–191 as quick way to assess data quality, 518 required for building OLAP cubes, 199 in SSIS, 466, 467 data storage containers, skills for building, 72 data stores, OLAP. See also cubes, OLAP denormalized, 8 as source for decision support systems, 13–14 data stores, OLTP query challenges, 6–7 reasons for normalizing, 5–6 relational vs. non-relational, 5 Data Transformation Services (DTS) comparison with SSIS, 446, 463–464 relationship to SSIS, 437–438 in SQL Server 2000, 546 data types Boolean, 363 content types supported, 363 date, 363 defined, 361 detecting in Data Mining Wizard, 403 double, 363 long, 363 text, 363 data viewers, SSIS, 488–489, 506 data visualization group, Microsoft Research, 34, 83, 720 data vs. metadata, 258 data warehouses background, 12 compared with OLAP, 12 data marts as subset, 13 defined, 11 Microsoft internal, case study, 28 database snapshots, 507
DatabaseIntegratedSecurity property, 668 data-first approach to BI design, 130 DataReader destination, 485 DataReader object, 598 dataset designer, 618 Dataset Properties dialog box, 618 date data type, 363 date functions, 321–326 debugging SSIS packages, 471–472, 505–506 SSIS script tasks, 572 using data viewers, 488–489, 506 decision support systems, 13–14 decision tables fast load technique, 528 loading source data into, 525–526 Decision Tree view, Microsoft Visio, 716 Declarative Management Framework (DMF) policies, 95 DefaultEvents class, 596 Define Account Intelligence, Business Intelligence Wizard, 243, 246–247 Define Currency Conversion, Business Intelligence Wizard, 244, 251–254 Define Dimension Intelligence, Business Intelligence Wizard, 243, 250 Define Relationships dialog box, 212–213, 215–216, 216 Define Semiadditive Behavior, Business Intelligence Wizard, 244, 250 Define Time Intelligence, Business Intelligence Wizard, 243, 245 degenerate dimension, in fact tables, 119 denormalization and OLAP cube structure, 125 in OLAP data stores, 8 Dependency Network view, Microsoft Visio, 714–715 Deploy option, BIDS Solution Explorer, 260 deploying code for custom objects, 589–591 reports to SSRS, 623–624 role and responsibility of release/ operations managers, 83 SSIS packages, 441, 461–462, 546–558 Deployment Progress window, 41 Deployment Utility, 556–558 Deployment Wizard, 155 derived measures, 148
Descendants MDX function, 318–319, 321 Description SSIS variable property, 491 destination components data flow designer, 485–486 Script-type, 586–587 developers IT administrators vs. traditional developers, 81 keeping role separate from tester role, 81 manager’s role and responsibility on development teams, 79–81 responsibility for performing administrative tasks, 160, 181 SSAS graphical user interface for creating objects, 154 types needed for BI projects, 80 development teams forming for BI projects, 76–83 optional project skills, 74–76 required project skills, 72–74 role and responsibility of developer manager, 79–81 role and responsibility of product manager, 78 role and responsibility of program manager, 79 role and responsibility of project architect, 78–79 role and responsibility of release/ operations manager, 83 role and responsibility of test manager, 81–82 role and responsibility of user experience manager, 82–83 roles and responsibilities for working with MSF, 76–83 source control considerations, 111–113 development tools, conducting baseline survey, 86 deviation analysis, 360 DimCustomer dimension table example, 122, 123 DimCustomer snowflake schema example, 134 dimension designer accessing Business Intelligence Wizard, 243 Attribute Relationships tab, 141 Dimension Structure tab, 139 dimension editor Attribute Relationships tab, 205, 207–209, 223 Browser tab, 205 Dimension Structure tab, 205–207
event handler designer opening from cube designer, 203–204 overview, 205 Translations tab, 205, 209 dimension intelligence, configuring in Business Intelligence Wizard, 243, 250 Dimension Processing destination, 485 dimension structures, defined, 139 dimension tables data vs. metadata, 258 DimCustomer example, 122, 123 exploring source table data, 123 as first design step in OLAP modeling, 131 generating new keys, 122, 146 pivot table view, 123 rapidly changing dimensions, 144 slowly changing dimensions, 142–144 space issue, 124 for star schema, 117–118, 121–125, 194 table view, 123 types of columns, 121–122 updating, 532–533 Dimension Usage tab, in cube designer options, 211–212 for snowflake schema, 135, 215 for star schema, 126–127, 134–135 using Cube Wizard to populate, 128 dimensions adding attributes, 222–223 adding to OLAP cube design using Cube Wizard, 220–221 combining with measures to build OLAP cubes, 210–214 configuring properties, 243–254 creating using New Dimension Wizard, 221 data vs. metadata, 258 enabling writeback for partitions, 285–286 hierarchy building, 138–139 non-star designs, 215–217 presenting in OLAP cubes, 138–142 processing options, 287–291 querying of properties, 329–332 rapidly changing, 144, 284 relationship to cubes, 257–258 role in simplifying OLAP cube complexity, 204–205 slowly changing, 142–144
as starting point for designing and building cubes, 204–205 Unified Dimensional Model, 138 writeback capability, 145 disconnected BIDS instances, 259–261 Discrete content type, 361 Discretized content type, 362 Distinct Count aggregate function, 147 Distribution property, 363 .dll files, 590 DMF (Declarative Management Framework) policies, 95 DMX (Data Mining Extensions) query language adding query parameter to designer, 634–635 ALTER MINING STRUCTURE syntax, 366 background, 24 building prediction queries, 419–421 Cluster function, 425 ClusterDistance function, 425 ClusterProbability function, 425 defined, 24 designer, defined, 627 designer, overview, 617 including query execution in SSIS packages, 535–537 Predict function, 421, 423–424, 425 PredictHistogram function, 425 prediction functions, 423–426 PREDICTION JOIN syntax, 422, 427 prediction query overview, 421–423 PredictProbability function, 425 PredictProbabilityStDev function, 425 PredictProbabilityVar function, 425 PredictStDev function, 425 PredictSupport function, 425 PredictTimeSeries function, 425 PredictVariance function, 425 queries in SSIS, 426–428 querying Targeted Mailing structure, 633–635 RangeMax function, 425 RangeMid function, 425 RangeMin function, 425 switching from MDX designer to DMX designer, 633 templates, 179–180 ways to implement queries, 426
domain controllers, conducting baseline survey, 85 double data type, 363 drillthrough actions, 233, 236–238, 372–373 DROP KPI statement, 348 .ds files, 108 .dsv files, 108 DSVs. See data source views (DSVs) DTEXEC utility, 440 DTEXECUI utility, 440–441 DTLoggedExec tool, 600 DTS. See Data Transformation Services (DTS) Dts object, 571–572 DtsConnection class, 597 .dtsx files, 442, 540, 547 DTUTIL utility, 441, 558
E Enable Writeback dialog box, 286 end users decision support systems for communities, 13–14 reporting interface considerations, 51 viewing BI from their perspective, 31–50 Enterprise Manager. See SQL Server Enterprise Manager enumerators, custom, 588, 594–595 envisioning phase, MSF, 65–67 Error and Usage Reporting tool, 155 error conditions, in BIDS, 259 error handling, in SSIS, 451–452, 497–499 ETL (extract, transform, and load) systems background, 14–15 as BI tool, 515–516 defined, 14–15 importance of SSIS, 52–53 for loading data mining models, 533–537 for loading OLAP cubes, 516–530 security considerations, 97–98 skills needed, 73, 76 SSIS as platform for, 435–462 for updating OLAP cubes, 530–533 EvaluateAsExpression SSIS variable property, 491 event handler designer Connection Manager window in, 468 defined, 468 Toolbox window in, 469
753
754
event handling, in SSIS event handling, in SSIS, 450–451, 499–501 events, logging, 501–505 Excel. See also Excel Services adding Data Mining tab to Ribbon, 47–48, 50 adding Table Tools Analyze tab to Ribbon, 47–48, 419 Associate sample workbook, 48–49 as client for SSAS data mining structures, 101 as client for SSAS OLAP cubes, 100–101 configuring session-specific connection information, 101 connecting to sample SSAS OLAP cubes, 43–45 creating sample PivotChart, 679–680 creating sample PivotTable, 678–679 Data Connection Wizard, 672 Data Mining add-ins, 73, 361, 368, 419 data mining integration functionality, 689–690 Data Mining tab functionality, 700–712 Dependency Network view for Microsoft Association, 48–49 extending, 683–684 Import Data dialog box, 674 Offline OLAP Wizard, 681–683 as OLAP cube client, 671–684 OLAP functions, 683 as optional component for BI solutions, 21 PivotTable interface, 675–677 PivotTables as interface for OLAP cubes, 10–11 popularity as BI client tool, 723–724 Prediction Calculator, 419 role in understanding data mining, 45–47 security for SSAS objects, 100–101 skills needed for reporting, 73, 75 trace utility, 101 viewing data mining structures, 47–50 viewing SSRS reports in, 649–650 Excel Calculation Services, 724 Excel data flow destination, 485 Excel data flow source, 483 Excel Services. See also Excel basic architecture, 724–725 complex example, 733–736
extending programmatically, 732–736 immutability of Excel sheets, 726 overview, 724 publishing parameterized Excel sheets, 729–732 sample worksheets, 726–729 and Web Services API, 732–736 ExclusionGroup property, 579 Execute method, 592, 596, 597 Execute Package task, 478 Execute Process sample control flow tasks in, 476–478 data flow designer, for SSIS package, 482–483 installing, 474–475 Execute SQL tasks, 476, 476–477, 494 ExecuteReader method, 597, 598 Explore Data Wizard, 705–706 Expression And Configuration Highlighter, 494 Expression dialog box, 637 Expression SSIS variable property, 491–492 expressions adding to constraints, 480–481 in dataset filters, 637 in SSIS, 447, 493–494 Expressions List window, 494 extracting. See ETL (extract, transform, and load) systems extraction history, 524
F fact columns, in fact tables, 118–119, 146 fact dimension (schema), 216 fact modeling, 146–147 fact rows, in fact tables, 261 fact tables in Adventure Works cube, 211 data vs. metadata, 258 degenerate dimension, 119 exploring source table data, 120, 146–147 FactResellerSales example, 119 fast load technique, 527, 528 loading initial data into, 527–530 multiple-source, 211 OLAP model design example, 131–132 pivot table view, 120 for star schema, 117–118, 118–121 storage space concern, 121 types of columns, 118 updating, 532
FactResellerSales fact table example, 119 fast load technique for loading initial data into dimension tables, 528 for loading initial data into fact tables, 527 for updating fact tables, 532 File deployment, for SSIS packages, 547 file servers, conducting baseline survey, 85 files-only installation, SSRS, 607 Filter MDX function, 305–307, 308 filtering creating filters on datasets, 637 in data mining models, 366 source data for data mining models, 404–405 firewalls conducting baseline survey, 85 security breaches inside, 97 First (Last) NonEmpty aggregate function, 147 FirstChild aggregate function, 147 Flat File connection manager, 474 Flat File data flow destination, 485 Flat File data flow source, 483 For Loop containers, SSIS, 478 foreach enumerators, custom, 594–595 Foreach Loop containers, 476, 478 ForEachEnumerator class, 594 forecasting algorithms, 359 FREEZE keyword, 342 functions, 299–307, 326 Fuzzy Grouping transformation, 517–518 Fuzzy Lookup transformation, 521
G Gauge control, 622, 638 GetEnumerator method, 594 global assembly cache (GAC), registering assembly files, 590 grain statements, 128–129 granularity, defined, 128 GUI (graphic user interface), SQL Server 2008 need for developers to master, 69–70 for SSAS developers, 154
MDX
H Head MDX function, 316 health care industry, business intelligence solutions, 27–28 hierarchical MDX functions, 316–320 HOLAP (hybrid OLAP), 267, 279 holdout test sets, 403 HTTP listener, 607, 607–609
templates for, 229–232 viewing in Adventure Works cube, 228, 229 Key Sequence content type, 362, 363 Key Time content type, 362, 363 Kimball, Ralph, 12 KPIs. See key performance indicators (KPIs)
I
L
IDBCommand interface, 597 IDBConnection interface, 597 IDBDataParameter interface, 597 IIf MDX function, 337–338 IIS (Internet Information Services) conducting baseline survey, 86 not an SSRS requirement, 606, 608, 610 noting version differences, 86 Image data region, 638–655 Import Data dialog box, 674 Import/Export Wizard defined, 155 role in developing SSIS packages, 439 Inmon, Bill, 12 Input0_ProcessInputRow method, 577, 578 Integration Services. See SQL Server Integration Services (SSIS) IntelliSense, 633, 637 Internet Information Services conducting baseline survey, 86 not an SSRS requirement, 606, 608, 610 noting version differences, 86 iteration in BI projects, 62 in OLAP modeling, 132
LastChild aggregate function, 147 LastChild MDX function, 314, 324 LastPeriods MDX function, 314, 324, 325 law enforcement, business intelligence solutions, 29 least-privilege security accessing source data by using, 96–97, 190 configuring logon accounts, 98 when to use, 70 life cycle. See software development life cycle lift charts, 410–413 linked objects, 285 list report item, 638–655 load balancing, 270 Load method, 596 loading. See ETL (extract, transform, and load) systems local processing mode, 653, 657 localization, 29 Log Events window, 503 log locations, 502–503 log providers custom, 588, 594 overview, 459–460 specifying connection manager, 502 logging for package execution, 501–505 question of how much, 504–505 SSIS log providers, 459–460, 502 viewing results, 503 logical modeling, OLAP design concepts, 115–150 logical servers and services conducting baseline survey, 86 considerations, 92–94 service baseline considerations, 94 long data type, 363 Lookup data flow transformation, 486 Lookup sample, using SSIS to load dimension and fact tables, 528–530
K key columns, in fact tables, 118 Key content type, 362 key performance indicators (KPIs) accessing from KPIs tab in cube designer, 201, 228, 345 client-based vs. server-based, 232 core metrics, 229 creating, 345–349 customizing, 231–232 defined, 15 defining important metrics, 55 metadata browser for, 229–231 nesting, 229 overview, 149, 228–233
M Maintenance Plan Tasks, SSIS, 471 many-to-many dimension (schema), 216–217 Matrix data region, 638–655 Max aggregate function, 147 MDX (Multidimensional Expressions) query language Ancestors function, 319–320 background, 23 BottomCount statement, 311 Children function, 300, 316, 321 core syntax, 296–305 creating calculated members, 307 creating named sets, 308, 338–340 creating objects by using scripts, 309 creating permanent calculated members, 333–336 in cube designer Calculations tab, 240, 241–242 CurrentMember function, 232, 313 date functions, 321–326 defined, 23 Descendants function, 318–319, 321 designer included in Report Builder, 644 Filter function, 305–307, 308 functions, 299–307, 326 Head function, 316 hierarchical functions, 316–320 IIf function, 337–338 Internet Sales example, 295, 297–299 and key performance indicators, 232 LastChild function, 314, 324 LastPeriods function, 314, 324, 325 Members function, 299–300, 308 native vs. generated, 225 object names, 296 opening query designer, 628 OpeningPeriod function, 322–323 operators, 297 Order function, 302–303 ParallelPeriod function, 232, 322 Parent function, 317–318 PeriodsToDate function, 333 query basics, 295 query designer, 617 query designer, defined, 627 query designer, overview, 628–631 query templates, 175–178 querying dimension properties, 329–332
755
756
MDX IntelliSense MDX, continued Rank function, 312–314 SCOPE keyword, 246 scripts, 341–343 setting parameters in queries, 631–633 Siblings function, 317 Tail function, 315, 330–331 TopCount function, 310 Union function, 320 using with PerformacePoint Server 2007, 352–354 using with SQL Server Reporting Services (SSRS), 349–351 warning about deceptive simplicity, 239 working in designer manual (query) mode, 629, 630–631 working in designer visual (design) mode, 629–631 Ytd function, 294 MDX IntelliSense, 633 measure columns, in fact tables, 118–119 Measure Group Bindings dialog box, 213–214 Measure Group Storage Settings dialog box, 278–279 measure groups in Adventure Works cube, 211 creating measures, 211 defined, 211 defining relationship to dimension data, 212–213 enabling writeback for partitions, 286 how they work, 211 relationship to source fact tables, 146–147 selecting for OLAP cubes, 219 measure modeling, 146–147 measures, calculated compared with derived, 148 Members MDX function, 299–300, 308 memory management, and SSRS, 665–666 metadata, data flow characteristics, 454–458 how SSIS uses, 458 variable-width column issue, 456–457 metadata vs. data, 258 Microsoft Association algorithm, 391–393, 400 Microsoft Baseline Security Analyzer, 95 Microsoft Biztalk Server, 22
Microsoft Clustering algorithm, 386–389, 400 Microsoft Decision Trees algorithm in classification matrix example, 415 defined, 400 in lift chart example, 412–413 overview, 381–383 in profit chart example, 414 for quick assessment of source data quality, 518 viewers for, 369–370 Microsoft Distributed Transaction Coordinator (MS-DTC), 508 Microsoft Dynamics, 22, 75 Microsoft Excel. See Excel Microsoft Excel Services. See Excel Services Microsoft Linear Regression algorithm, 383, 400 Microsoft Logistic Regression algorithm, 395–396, 400 Microsoft Naïve Bayes algorithm, 376–381, 399, 518 Microsoft Neural Network algorithm, 394–395, 400 Microsoft Office 2007 as data mining client, 687–721 installing Data Mining Add-ins, 687–688 optional components for BI solutions, 21–22 Microsoft Office SharePoint Server 2007. See Office SharePoint Server 2007 Microsoft PerformancePoint Server (PPS) integration with SQL Server Analysis Services, 745 as optional component for BI solutions, 22 skills needed for reporting, 75 using MDX with, 352–354 Microsoft Project Server, 22 Microsoft Research, 34, 83, 478, 720 Microsoft Security Assessment Tool, 95 Microsoft Sequence Clustering algorithm, 389–390, 400 Microsoft Solutions Framework (MSF). See also MSF (Microsoft Solutions Framework) for Agile Software Development Agile Software Development version, 63–65 alternatives to, 62 building phase, 68–70 defined, 62
deploying phase, 71 development team roles and responsibilities, 76–83 envisioning phase, 65–67 milestones, 62 planning phase, 67–68 project phases, 62, 65–71 role of iteration, 62 spiral method, 62, 64 stabilizing phase, 70–71 Microsoft SQL Server 2008. See SQL Server 2008 Microsoft SQL Server 2008, Enterprise edition. See SQL Server 2008, Enterprise edition Microsoft Time Series algorithm, 383–386, 400 Microsoft Visio. See Visio Microsoft Visual Studio. See Visual Studio Microsoft Visual Studio Team System (VSTS) integrating MSF Agile into, 64–65 reasons to consider, 22, 546 Team Foundation Server, 111–112 Microsoft Visual Studio Tools for Applications (VSTA) debugging scripts, 572 defined, 568 writing Script component code, 577–582 writing scripts, 570–572 Microsoft Visual Studio Tools for the Microsoft Office System (VSTO), 683–684 Microsoft Word as optional component for BI solutions, 21 viewing SSRS reports in, 649–650 milestones, in Microsoft Solutions Framework (MSF), 62 Min aggregate function, 147 mining model viewers, 46, 48–49, 356, 368, 408 mining structure designer choosing data mining model, 365–368 handling nested tables, 364, 366 Mining Accuracy Chart tab, 360, 373–375, 417 Mining Model Prediction tab, 360, 375, 419, 424 Mining Model Viewer tab, 46, 360, 368–373, 408, 409 Mining Models tab, 360, 365–368, 404, 404–405
OLAP cubes Mining Structure tab, 360, 364–365, 404 viewing source data, 364 Model Designer, SSRS, backing up files, 108 model training. See Data Mining Model Training destination modeling logical modeling, OLAP design concepts, 115–150 OLAP. See OLAP modeling OLTP modeling, 115, 137–138 physical, for business intelligence solutions, 4 Modeling Flags property, 363 MOLAP (multidimensional OLAP), 267, 279 MS-DTC (Microsoft Distributed Transaction Coordinator), 508 MsDtsSrvr.ini.xml, SSIS configuration file, 112 MSF (Microsoft Solutions Framework) for Agile Software Development background, 63–65 built into Microsoft Visual Studio Team System, 64–65 defined, 63 development team roles and responsibilities, 77 project phases, 65–71 suitability for BI projects, 64 MSF (Microsoft Solutions Framework) for Capability Maturity Model Integration (CMMI), 65 .msi files, 38 MSReportServer_ ConfigurationSetting class, 668 MSReportServer_Instance class, 668 multidimensional data stores. See cubes, OLAP Multidimensional Expressions. See MDX (Multidimensional Expressions) query language
N Name SSIS variable property, 492 named sets, 241, 308, 338–340 Namespace SSIS variable property, 492 natural language, 67 .NET API application comparison with SSIS projects in Visual Studio, 467 and compiled assemblies, 196–197
developer skills, 80, 81 skills needed for custom client reporting, 75 for SQL Server Integration Services, 442 using code in SSRS reports, 647–649 using to develop custom SSIS objects, 587–588, 588 network interface cards (NICs), conducting baseline survey, 85 New Cube Wizard, 134 New Table Or Matrix Wizard, 644–646 non-relational data, defined, 5 normalization implementing in relational data stores, 5 reasons for using, 6–7 view of OLTP database example, 5
O Object Explorer defined, 39 viewing SSAS objects from SSMS, 160 viewing SSIS objects from SSMS, 438, 439 object viewers, 164 ODBC connection manager, 473 Office 2007 as data mining client, 687–721 installing Data Mining Add-ins, 687–688 optional components for BI solutions, 21–22 Office SharePoint Server 2007 configuration modes for working with SSRS, 737–738 integrated mode, installing SSRS add-in for, 742–743 native mode, integration of SSRS with, 740–741 as optional component for BI solutions, 21–22 Report Center, 744–745 skills needed for reporting, 75 SQL Server business intelligence and, 723–745 SSRS and, 604, 736–745 template pages, 744–745 Windows SharePoint Services, 94 Offline OLAP Wizard, 681–683 OLAP (online analytical processing) characteristics, 8 compared with data mining, 14
compared with data warehousing, 12 defined, 8 Microsoft Excel functions, 683 modeled as denormalized, 8 when to use, 8 working offline in Microsoft Excel, 681–683 OLAP cubes adding aggregations, 263 assessing source data quality, 516–518 background, 13 as BI data structure, 13 BIDS browser, 41–42, 201 building in BIDS, 198–204 building prototypes, 50 building sample using Adventure Works, 37–39 configuring properties, 243–254 connecting to sample using Microsoft Excel, 43–45 as core of SQL Server BI projects, 115 core tools for development, 157 creating empty structures, 133 as data marts, 13 data vs. metadata, 258 in data warehouses, 11 defined, 9 and denormalization concept, 125 vs. denormalized relational data stores, 10 deploying, 254–255, 260–261 designing by using BIDS, 183–223 dimensions overview, 9–10, 204–210, 257–258 fact (measure) modeling, 146–147 first, building in BIDS, 218–223 Microsoft Excel as client, 671–684 modeling logical design concepts, 115–150 vs. OLTP data sources, 54 opening sample in Business Intelligence Development Studio, 39–43 overview of source data options, 115–116 partitioning data, 263–270 as pivot tables, 10–11 pivoting in BIDS browser, 42 presenting dimensional data, 138–142 processing options, 287–291 and ROI of BI solutions, 56–58 skills needed for building, 72, 74 star schema source data models, 116–125
757
758
OLAP modeling OLAP cubes, continued UDM modeling, 9–10 updating, 530–533 using dimensions, 210–217 viewing by using SSMS Object Browser, 164–168 visualizing screen view, 10–11 OLAP modeling compared with OLTP modeling, 115 compared with views against OLTP sources, 137–138 as iterative process, 132 naming conventions, 150 naming objects, 132 need for source control, 132 role of grain statements, 128–129 tools for creating models, 130–132, 149–150 using Visio 2007 to create models, 130–132, 133 OLE DB connection manager, 473 OLE DB data flow destination, 485 OLE DB data flow source, 483, 484 OLTP (online transactional processing) characteristics, 6 defined, 5 normalizing data stores, 5 querying data stores, 6–7 OLTP modeling compared with OLAP modeling, 115 compared with OLAP views, 137–138 OLTP table partitioning, 268–269 OnError event, 451 OnExecStatusChanged event handler, 500 OnInit method, 648 online analytical processing. See OLAP (online analytical processing) online transactional processing. See OLTP (online transactional processing) OnPostExecute event handler, 500, 500–501 OnProgress event handler, 500 OnVariableValueChanged event handler, 500 OpeningPeriod MDX function, 322–323 operating environment, conducting baseline survey, 86, 87–88 optional skills, for BI projects, 74–76 Oracle, 5, 19 Order MDX function, 302–303
Ordered content type, 362 Outliers Wizard, 706–707 overtraining, data model, 535
P Package Configuration Wizard, 549–550 Package Configurations Organizer dialog box, 548–549, 551–553 package designer adding connection managers to packages, 473 best design practices, 509–510 Connection Manager window in, 468 control flow designer, 468 data flow designer, 468 debugging packages, 471–472 event handler designer, 468 executing packages, 471–472 how they work, 470–472 navigating, 479 overview, 467–469 Toolbox window in, 469 viewing large packages, 479 Package Explorer, 468–469 Package Installation Wizard, 557–558 Package Store File deployment, for SSIS packages, 547 Package Store MSDB deployment, for SSIS packages, 547 packages, in SSIS adding checkpoints to, 506–507 adding to custom applications, 596–600 backups and restores, 107–108 best practices for designing, 509–510 configurations, 461 configuring transactions in, 507–508 connection managers, 448–450 control flow, 442–444 control flow compared with data flow, 444–445 creating with BIDS, 463–495 data flow, 444 debugging, 505–506 default error handling behavior, 498 defined, 20 deploying and managing by using DTUTIL utility, 441, 558 deployment options, 461–462, 546–558
developing in Visual Studio, 464–472 documentation standards, 525 Encrypt Sensitive With Password encryption option, 563 Encrypt Sensitive With User Key encryption option, 563 encrypting, 554–556 encryption issues, 563 error handling, 451–452 event handling, 450–451, 499–501 executing by using DTEXEC utility, 440 executing by using DTEXECUI utility, 440–441 expressions, 447 external configuration file, 548–552 file copy deployment, 552–553 handling sensitive data, 563 keeping simple, 508, 509 logical components, 442–452 physical components, 442 role of SSMS in handling, 438–439 saving results of Import/Export Wizard as, 439 scheduling execution, 558–559 security considerations, 97–98, 559–562 setting breakpoints in, 505–506 source control considerations, 112 as SSIS principal unit of work, 436, 438 and SSIS runtime, 452 tool and utilities for, 438–441 upgrading from earlier versions of SQL Server, 440 variables, 445–447 where to store, 98, 112 PacMan (SSIS Package Manager), 600 parallel processing, in SQL Server 2008, Enterprise edition, 269 ParallelPeriod MDX function, 232, 322 Parent MDX function, 317–318 parent tables, relationship to child tables, 5 Partition Processing destination, 485 Partition Wizard, 265–266 partitions, cube defined, 263 defining, 265–266 enabling writeback, 285–286 overview, 263–264 for relational data, 268–269 remote, 270 specifying local vs. remote, 270
query designer in star schema source tables, 268–269 storage modes, 270 and updates, 532 partitions, table, 268–269 Pasumansky, Mosha, 340, 633 performance counters possible problems to document, 88 role in creating baseline assessment of operating environment, 87–88 Performance Visualization tool, 510 PerformancePoint Server integration with SQL Server Analysis Services, 745 as optional component for BI solutions, 22 skills needed for reporting, 75 using MDX with, 352–354 PeriodsToDate MDX function, 333 permissions, for SSRS objects, 103–104 perspectives compared with relational views, 227 defined, 149 overview, 227–228 phases, in Microsoft Solutions Framework (MSF), 62 physical infrastructure assessing servers needed for initial BI development environment, 88 conducting baseline survey, 85–87 planning for change, 85–89 physical modeling, for business intelligence solutions, 4 physical servers assessing number needed for initial BI development environment, 88–89 for business intelligence solutions, 4 conducting baseline survey, 85 considerations, 91–92 consolidation, 92 determining optimal number and placement for initial BI development environment, 89–94 development server vs. test server, 91 installing SQL Server, 90 installing SSAS, 90 installing SSRS, 90
target location options for deploying SSIS packages, 547–548 typical initial BI installation, 90 pie chart, adding to Microsoft Excel sheet, 729–731 PivotCharts, Microsoft Excel adding to workbooks, 679–680 creating views, 44 PivotTable Field List, 650–651, 652 PivotTables, Microsoft Excel connecting to sample cubes, 43–44 creating, 678–679 creating PivotChart views, 44 dimensional information, 678–679 formatting, 679 as interface for cubes, 10–11 overview, 675–680 ways to pivot view of data, 44 planning phase, MSF, 67–68 PostExecute method, 577, 578, 587 PostLogMessage method, 580 PPS. See Microsoft PerformancePoint Server (PPS) precedence constraints, 480–481 Predict DMX function, 421, 423–424, 425, 427 PredictHistogram DMX function, 425 prediction algorithms, 359 Prediction Calculator, Microsoft Excel, 419 prediction functions, 423–426 PREDICTION JOIN syntax, 422, 427 predictive analytics, 148–149, 355, 366, 426 Predictive Model Markup Language (PMML), 409 PredictProbability DMX function, 425 PredictProbabilityStDev DMX function, 425 PredictProbabilityVar DMX function, 425 PredictStDev DMX function, 425 PredictSupport DMX function, 425 PredictTimeSeries DMX function, 425 PredictVariance DMX function, 425 PreExecute method, 577, 587 proactive caching fine tuning, 283–284 notification settings, 282 overview, 279–282 Process Cube dialog box, 288–289 Process Progress dialog box, 407, 429–430
processing layer, security considerations, 97–98 processing time, 270 ProcessInput method, 583, 584–585, 585, 586 ProcessInputRow method, 584, 585, 586, 587. See also Input0_ ProcessInputRow method processors, multiple, 454, 523 ProClarity, 22 product managers job duties, 78 role and responsibility on development teams, 78 profit charts, 413–414 program managers job duties, 79 role and responsibility on development teams, 79 project architects, role and responsibility on development teams, 78–79 Project Real, 28 Project Server, 22 Propagate variable, 501 prototypes, building during MSF planning phase, 68 proxy accounts, 563 proxy servers, conducting baseline survey, 85
Q queries. See also DMX (Data Mining Extensions) query language; MDX (Multidimensional Expressions) query language cache scopes, 326 challenges in OLTP data stores, 6–7 creating in report designer, 616–618 creating named sets, 338–340 creating permanent calculated members, 333–336 manually writing, 54–55 MDX basics, 295 optimizing, 326 query browsers Cube filter, 175 Measure Group filter, 175 query designer multiple types, 616–618 for reports, 627–638 setting parameters in queries, 631–633
759
760
query languages query languages, 23–25. See also DMX (Data Mining Extensions) query language; MDX (Multidimensional Expressions) query language; XMLA (XML for Analysis) query language Query Parameters dialog box, 632 query templates for DMX (Data Mining Extensions) query language, 179–180 execution process, 174–175 for MDX (Multidimensional Expressions) query language, 175–178 for XMLA (XML for Analysis) query language, 180 query tuning, 276
R RaiseChangedEvent SSIS variable property, 492 RangeMax DMX function, 425 RangeMid DMX function, 425 RangeMin DMX function, 425 Rank MDX function, 312–314 rapidly changing dimensions, 144 Raw File connection manager, 474 Raw File data flow destination, 485, 507 Raw File data flow source, 483, 484–485 RDBMS (relational database management systems) conducting baseline survey of servers, 85 defined, 5 SQL Server data as source data, 19 RDL. See Report Definition Language (RDL) .rdl files, 108, 112–113 .rds files, 108 ReadOnlyVariables property, 577 ReadWriteVariables property, 578 Recordset destination, 485 rectangle report item, 639–655 regression algorithms, 359 Re-label Wizard, 706, 707 relational data defined, 5 normalizing, 5 partitioning, 268–269 tables for denormalizing, 8 Relationship property, 363 ReleaseConnections method, 581 release/operations managers job duties, 83
role and responsibility on development teams, 83 remote partitions, 270 remote processing mode, 653, 656–657 Report Builder creating report, 644–646 defined, 19, 156 user interface, 643 version issues, 604, 643 Report Data window, 618 Report Definition Language (RDL) creating metadata, 635 defined, 24–25 role in report building, 621, 623, 624 version issues, 624 report designer adding report items to designer surface, 638–639 backing up files, 108 building tabular report, 619–620 creating queries, 616–618 fixing report errors, 621 illustrated, 614 opening, 614 previewing reports, 620 tabular reports, 619–620 types of report data containers, 618 using Tablix data region to build reports, 640–642 version enhancements, 19, 614 working with MDX query results, 635–643 Report Manager Web site, 604, 609 report models, 660–662 report processing modes, 653 report processing systems, 15. See also SQL Server Reporting Services (SSRS) Report Project Property Pages dialog box, 623–624 Report Properties dialog box, 647–648 Report Server Web service authentication for access, 610–611 authoring and deploying reports, 738–740 defined, 603–604 job manager, 612 overview, 609–610 Reporting Services. See SQL Server Reporting Services (SSRS) Reportingservicesservice.exe.config file, 108
reporting-type actions, 233, 235–236 reports. See also report processing systems; SQL Server Reporting Services (SSRS) adding custom code to, 647–651 authentication credential flow in SSRS, 103 building for SSRS, 627–646 cleaning and validating data for, 55 client interface considerations, 51 considering end-user audiences, 51 creating by using New Table Or Matrix Wizard, 644–646 creating in SSRS, 603–624 creating with BIDS, 612–622 defining data sources, 613–614 defining project-specific connections, 627 deploying to SSRS, 623–624 query designers for, 627–638 samples available, 622 setting parameters in queries, 631–633 Toolbox for, 621–622, 638 using Tablix data region to build, 640–642 viewing in Microsoft Excel, 650 viewing in Word, 649–650, 650 Reports Web site, 604 ReportViewer control features, 652–656 embedding custom controls, 652–656 entering parameter values, 656–657 security credentials, 657–658 required skills, for BI projects, 72–74 restores. See backups and restores ROLAP (relational OLAP) dimensional data, 284–285 in Measure Group Storage Settings dialog box, 279 overview, 267–268 roles, in SSAS, 195–196 .rptproj files, 108 rsconfig.exe tool, 604 rs.exe tool, 609 Rsmgrpolicy.config file, 108 Rsreportserver.config file, 108 Rssvrpolicy.config file, 108 runtime, SSIS, 452
SQL Server 2008
S Sample Data Wizard, 705, 707–708 Save Copy Of Package dialog box, 554–556 scalability, and SSRS, 662–664 scaling out, in SQL Server 2008, Enterprise edition, 666–667 schema-first approach to BI design, 130 SCOPE keyword, MDX query language, 246, 341–342 Scope SSIS variable property, 492 Script component. See also Script Transformation Editor dialog box compared with Script task, 567–568 connection managers in, 580–581 as data source, 573 debugging, 587 destination-type, 586–587 selecting Transformation type option, 574 source-type, 582 synchronous and asynchronous transformation, 582–586 type options, 573–574 writing code, 577–581 Script task compared with Script component, 567–568 defined, 478 using to define scripts, 568–570 Script Task Editor dialog box, 568–570 Script Transformation Editor dialog box Connection Managers page, 580 Inputs And Outputs page, 576, 578–579, 579 Input Columns page, 574–576 opening, 574 scripting limitations, 587 Script task compared with Script component, 567–568 for SSRS administrative tasks, 667–669 ScriptLanguage property, 568 scripts, MDX, 341–343 security. See also least-privilege security best practices, 70, 564 BIDS, for solutions, 98–99 BIDS, when creating SSIS packages, 98
custom client considerations, 104–106 encrypting packages when deploying, 554–556 handling sensitive SSIS package data, 563 overview of SSIS package issues, 559–562 passing report credentials through ReportViewer, 657–658 proxy execution accounts for SSIS packages, 79 Security Assessment Tool, 95 security requirements in development environment, 70 overview, 95–106 Select Script Component type dialog box, 573–574 semiadditive behavior, configuring in Business Intelligence Wizard, 244, 250 sequence analysis algorithms, 359 Sequence containers, 478 servers. See logical servers and services; physical servers service level agreements (SLAs) availability strategies, 87 conducting baseline survey, 87 reasons to create in BI projects, 87 Service Principal Name (SPN), 159 Shared Data Source Properties dialog box, 613–614 SharePoint Server. See Office SharePoint Server 2007 Siblings MDX function, 317 signing assemblies, 589 skills, for BI projects optional, 74–76 required, 72–74 .sln files, 98 backing up, 108 Slowly Changing Dimension transformation, 533 Slowly Changing Dimension Wizard, SSIS, 143–144 slowly changing dimensions (SCD), 142–144 .smdl files, 108, 112–113 snapshots, database, 507 snowflake schema DimCustomer example, 134 on Dimension Usage tab of cube designer, 135, 215 overview, 134 when to use, 136–137 software development life cycle, 61–71
Solution Explorer, in BIDS, 40, 46, 184, 186–188 Solution Explorer, Visual Studio configuring SSAS object properties, 243 data sources and data source views, 466, 467 SSIS Packages folder, 466–467 viewing SSIS projects in, 466–467 solutions, defined, 98, 539 SOLVE_ORDER keyword, 343–344 source code control/source control systems, 540–542 source control, 111–113, 132 source data accessing by using leastprivileged accounts, 96–97 cleaning, validating, and consolidating, 69 collecting connection information, 96–97 loading into decision tables, 525–526 non-relational, 5 performing quality checks before loading mining structures and models, 533–534 querying OLTP data stores, 6–7 relational, 5 structure names, 68 transformation issues, 519–523 source data systems, upgrading to SQL Server 2008, 89 Specify A Unary Operator, Business Intelligence Wizard, 244, 248–250 Specify Attribute Ordering, Business Intelligence Wizard, 244, 250 spiral method, 62, 64 SQL Server 2000, upgrading production servers for SSAS, SSIS, and SSRS, 90 SQL Server 2008 command-line tools installed, 157 complete installation, 156–157 as core component of Microsoft BI solutions, 16, 19 customizing data display, 3 Database Engine Tuning Advisor, 8 documenting sample use, 86 downloading and installing Data Mining Add-ins for Office 2007, 47–48 downloading and installing sample databases, 154 feature differences by edition, 37, 58
761
762
SQL Server 2008, Enterprise edition SQL Server 2008, continued installing sample databases, 37–41 installing samples in development environment, 86 minimum-installation paradigm, 153–154 new auditing features, 111 online transactional processing, 5–8 security features, 70 SQL Server 2008, Enterprise edition parallel processing, 269 scaling out, 666–667 SQL Server Agent, 558–564 SQL Server Analysis Services (SSAS). See also SQL Server Management Studio (SSMS); SQL Server Profiler aggregation types, 147 background, 16–17 backups and restores, 106–107 baseline service configuration, 157–159 BIDS as development interface, 16 BIDS as tool for developing cubes, 16 building sample OLAP cube, 37–39 considering where to install, 89 as core component of Microsoft BI solutions, 16 core tools, 153–181 creating data mining structures, 18 creating roles, 195–196 credentials and impersonation information for connecting to data source objects, 98–99 Cube Wizard, 128 data source overview, 188–190 data source views (DSVs), 190–195 database relationship between cubes and dimensions, 257–258 default assemblies, 197 defined, 16 deploying Adventure Works 2008 to, 41 Deployment Wizard, 155 dimension design in, 140–142 documenting service logon account information, 94 exploring fact table data, 120 installing, 153 installing multiple instances, 90 linked objects, 285 logon permissions, 98 mastering GUI, 69–70 performance counters for, 87
providers for star schema source data, 117 query designers, using for creating reports, 627–638 querying objects in SSMS, 170–175 reasons for installing SQL Server Database Engine Services with, 153 as requirement for OLAP BI solutions, 23 roles in, 195–196 scaling to multiple machines, 91 security considerations, 98–99 source control considerations, 112, 113 SSMS as administrative interface, 16 using compiled assemblies with objects, 196–197 using OLAP cubes vs. OLTP data sources, 54 viewing configuration options in SSMS, 93 viewing SSAS objects in Object Explorer, 160 viewing what is installed, 153–154 working on databases in BIDS in connected mode, 261 SQL Server Books Online, defined, 156 SQL Server Compact destination, 485 SQL Server Configuration Manager, 94, 155, 157–158 SQL Server Database Engine, 108 SQL Server Database Engine Tuning Advisor, 8, 156 SQL Server destination, 485 SQL Server Enterprise Manager, 463–464 SQL Server Error and Usage Reporting, defined, 155 SQL Server Installation Center, defined, 156 SQL Server Integration Services (SSIS) architectural components, 435–462 architectural overview, 436–438 backups and restores, 107–108, 112 BIDS as development interface, 16 BIDS as tool for implementing packages, 20 comparison with Data Transformation Services, 446, 463–464
considering where to install, 89 creating ETL packages, 55 in custom applications, 596–600 custom task and component development, 587–595 data flow engine, defined, 436 data flow engine, overview, 453–459 and data mining by DMX query, 426–428 data mining object processing, 430 defined, 20 documenting service logon account information, 94 error handling, 497–499 ETL skills needed, 76 event handling, 499–501 history, 437–438 as key component in Microsoft BI solutions, 20 log providers, 459–460 mastering GUI, 69–70 MsDtsSrvr.ini.xml configuration file, 112 .NET API overview, 442 object model and components, 442–452 object model, defined, 436 package as principal unit of work, 436, 438 performance counters for, 87 relationship to Data Transformation Services, 437 runtime, defined, 436 runtime, overview, 452 scaling to multiple machines, 91 scripting support, 567–587 security considerations for packages, 97–98 service, defined, 436 Slowly Changing Dimension Wizard, 143–144 solution and project structures, 539–540 source control considerations, 112 SSMS as administrative interface, 16 upgrading packages from earlier versions of SQL Server, 440 ways to check data quality, 516–518 SQL Server Management Studio (SSMS) backups of deployed SSAS solutions, 106–107 connecting to SSAS in, 160
taxonomies as core tool for developing OLAP cubes and data mining structures in SSAS, 157 data mining object processing, 431 defined, 19, 155, 160 menus in, 161 object viewers, 164 opening query editor window, 295 processing OLAP objects, 163 querying SSAS objects, 170–175 role in handling SSIS packages, 438–439 verifying Adventure Works installation, 39 viewing configuration options available for SSAS, 93 viewing data mining structures, 164, 168–170 viewing dimensions, 163 viewing OLAP cubes, 164–168 viewing OLAP objects, 162–164 working with SSIS Service, 564–565 SQL Server Profiler as core tool for developing OLAP cubes and data mining structures in SSAS, 157 defined, 156 how query capture works, 172–174 overview, 171–172 role in designing aggregations, 275–277 using for access auditing, 109–110 SQL Server Reporting Services (SSRS) adding custom code to reports, 647–649 architecture, 603–605 authentication credential flow for reports, 103 background, 19 backups and restores, 108 BIDS as development interface, 16 building reports for, 627–646 command-line utilities, 604 Configuration Manager, 102, 108, 155, 607, 609, 737 configuring environment for report deployment, 623–624 configuring with Office SharePoint Server, 737–738 considering where to install, 89 as core component of Microsoft BI solutions, 16 creating reports, 603–624
defined, 19 deploying reports, 623–624 documenting service logon account information, 94 feature differences by edition, 606 installing add-in for Microsoft Office SharePoint Server, 742–743 installing and configuring, 606–612 integration with Office SharePoint Server, 604 mastering GUI, 69–70 performance counters for, 87 query design tools, 616–618 Report Builder, 19 and scalability, 662–664 scaling to multiple machines, 91 security decisions during installation and setup, 102–104 skills needed for reporting, 73, 75 source control considerations, 112–113 SSMS as administrative interface, 16 storing metadata, 604 using in SharePoint integrated mode, 742–743 using in SharePoint native mode, 740–741 using MDX with, 349–351 Web site interface, 19 Windows Management Instrumentation, 668–669 SQLPS.exe tool, 157 SSAS. See SQL Server Analysis Services (SSAS) SSAS Deployment Wizard, 155 SSIS. See SQL Server Integration Services (SSIS) SSIS Package Manager (PacMan), 600 SSIS Package Store, 552, 554, 564 SSIS Package Upgrade Wizard, 440 SSIS Performance Visualization tool, 510 SSIS Service, 564–565 SSMS. See SQL Server Management Studio (SSMS) SSRS. See SQL Server Reporting Services (SSRS) SSRS Web Services API, 658–659 stabilizing phase, MSF, 70–71 staging databases, when to use, 520–523, 524, 531 star schema comparison with non-star designs, 215–217
conceptual view, 125 for denormalizing, 30 dimension tables, 117–118, 121–125, 194 Dimension Usage tab, in cube designer, 126–127, 134–135 fact tables, 117–118, 118–121, 194 Microsoft changes to feature, 210–211 moving source data to, 525–530 for OLAP cube modeling, 116–125 on-disk storage, 116–117 physical vs. logical structures, 116–117 reasons to create, 126, 126–127 tables vs. views, 116–117 visualization, 117–118 storage area networks, 91 Subreport data region, 638–655 Sum aggregate function, 147 Synchronize command, to back up and restore, 107 synchronous data flow outputs, 458, 459 synchronous transformation, 583–586 SynchronousInputID property, 578–579 system variables, in SSIS, 445–446, 493
T Table content type, 362 Table data region, 638–655 table partitioning, defined, 269. See also OLTP table partitioning tables parent vs. child, 5 relational, for denormalizing, 8 Tablix container, defined, 622 Tablix data region, defined, 639–642 tabular report designer, 619–620 Tail MDX function, 315, 330–331 Task class, 591–592 Task Host containers, 478 tasks compared with components, 444, 445 custom, 587–588 default error handling behavior, 499 in SSIS package control flow, 442–444 taxonomies documenting, 67–68 role in naming of OLAP objects, 132
763
764
Team Foundation Server Team Foundation Server, 38, 540 teams. See development teams Template Explorer, 174, 422 test managers job duties, 81 keeping role separate from developer role, 81 role and responsibility on development teams, 81–82 testing. See stabilizing phase, MSF testing plans, 70–71 text data type, 363 Textbox data region, 638–655 This function, 342 time intelligence, configuring in Business Intelligence Wizard, 243, 245 Toolbox, SSIS adding objects to, 591 overview, 469 Toolbox, SSRS, in BIDS, 621–622, 638 Toolbox, Visual Studio, 652, 654 tools ascmd.exe tool, 157 BIDS Helper tool, 255, 490, 494, 510 downloading from CodePlex Web site, 37–38, 86, 157 DTEXEC utility, 440 DTEXECUI utility, 440–441 DTUTIL utility, 441 installed with SQL Server 2008, 157 rsconfig.exe tool, 604 rs.exe tool, 609 SQLPS.exe tool, 157 TopCount MDX function, 310 Trace Properties dialog box, 276 Tracer utility, Microsoft Excel, 101 transactional activities. See OLTP (online transactional processing) transactions, package, 507–508 Transact-SQL aggregate functions, 9 queries, 54–55 transformation components, 486–488 transformations, built-in, 578 transforming. See ETL (extract, transform, and load) systems translations for cube metadata, 225–226 SSAS, defined, 149
U UDM (Unified Dimensional Model), 9–10, 138 unary operators, specifying in Business Intelligence Wizard, 244, 248–250 Unified Dimensional Model (UDM), 9–10, 138 Union MDX function, 320 Upgrade Package Wizard, 487 upgrading SSIS packages from earlier versions of SQL Server, 440 URLs (uniform resource locators) enhanced arguments, 651 implementing access, 651–652 Usage-Based Optimization Wizard, 274–275 user experience managers, role and responsibility on development teams, 82, 82–83 user interfaces (UIs) role and responsibility of user experience managers, 82–83 skills needed for creating, 73, 75 utilities ascmd.exe tool, 157 BIDS Helper tool, 255, 490, 494, 510 downloading from CodePlex Web site, 37–38, 86, 157 DTEXEC utility, 440 DTEXECUI utility, 440–441 DTUTIL utility, 441 installed with SQL Server 2008, 157 rsconfig.exe tool, 604 rs.exe tool, 609 SQLPS.exe tool, 157
V Validate method, 592, 594 Value SSIS variable property, 492–493 ValueType SSIS variable property, 493 variables, in SSIS adding to packages, 490 Description property, 491 differences related to SSIS platform, 490–493 EvaluateAsExpression property, 491 Expression property, 491–492 Name property, 492 Namespace property, 492
opening Variables window, 490 overview, 445–447 properties, 491–493 RaiseChangedEvent property, 492 Scope property, 492 system, 493 Value property, 492–493 ValueType property, 493 ways to use, 494–495 variable-width columns, in data flow metadata, 456–457 Virtual PC, setting up test configurations, 37 virtualization, 4, 91 Visio adding Data Mining template, 47–48 data mining integration, 714–718 data mining integration functionality, 689–690 Data Mining toolbar, 715–718 as optional component for BI solutions, 21 using to create OLAP models, 130–132, 133 Vista. See Windows Vista, IIS version differences Visual SourceSafe (VSS) checking files in and out, 544–545 creating and configuring VSS database, 541–542 creating and configuring VSS user accounts, 542 History dialog box, 545 Lock-Modify-Unlock Model option, 541–542 overview, 540 storing solution files, 542–544 using Add SourceSafe Database Wizard, 541–542 Visual Studio. See also Solution Explorer, Visual Studio Adventure Works.sln file, 38, 40 embedding custom ReportViewer controls, 652–656 as location for SSIS package development, 464–472 relationship to BIDS, 16–17, 22, 41, 463 relationship to SQL Server Integration Services, 440 resemblance to BIDS interface, 157 signing custom object assemblies, 589 SSIS menu, 472 SSIS package designers, 467–472 Toolbox, 652, 654
Ytd MDS function usefulness in BI development, 86 viewing new SSIS project template in, 465–466 VSTO (Microsoft Visual Studio Tools for the Microsoft Office System), 683–684 VSTS (Visual Studio Team System) integrating MSF Agile into, 64–65 reasons to consider, 22, 546 Team Foundation Server, 111–112
W Warnings tab, in BIDS, 259 waterfall method, 62 Web Parts, 21–22, 740–741 Web servers, conducting baseline survey, 85 Web Services API, and Excel Services, 732–736 Windows Communication Foundation (WCF), 732–733 Windows Management Instrumentation (WMI), 668–669
Windows Reliability and Performance Monitor tool, 87 Windows Server 2003, IIS version differences, 86 Windows Server 2008 IIS version differences, 86 Performance Monitor counters, 523 Reliability and Performance Monitor tool, 87 virtualization improvements, 4 Windows SharePoint Services and Office SharePoint Server 2007, 94 Windows Vista, IIS version differences, 86 Windows-on-Windows 32-bit applications, 91 Word 2007 as optional component for BI solutions, 21 viewing SSRS reports in, 649–650 writeback defined, 98 overview, 145
storing changes to dimensions, 285–286
X XML data flow source, 483 XML for Analysis. See XMLA (XML for Analysis) query language XMLA (XML for Analysis) query language background, 24 defined, 24 query templates, 180 source control considerations, 113 using for data mining object processing, 431 viewing scripts, 164
Y Ytd MDS function, 294
765
About the Authors Several authors contributed chapters to this book.
Lynn Langit Lynn Langit is a developer evangelist for Microsoft. She works mostly on the West Coast of the United States. Her home territory is southern California. Lynn spends most of her work hours doing one of two things: speaking to developers about the latest and greatest technology, or learning about new technologies that Microsoft is releasing. She has spoken at TechEd in the United States and Europe as well as at many other professional conferences for developers and technical architects. Lynn hosts a weekly webcast on MSDN Channel 9 called “geekSpeak.” She is a prolific blogger and social networker. Lynn’s blog can be found at http://blogs.msdn.com/SoCalDevGal. Prior to joining Microsoft in 2007, Lynn was the founder and lead architect of her own company. There she architected, developed, and trained technical professionals in building business intelligence (BI) solutions and other .NET projects. A holder of numerous Microsoft certifications—including MCT, MCITP, MCDBA, MCSD.NET, MCSE, and MSF—Lynn also has 10 years’ experience in business management. This unique background makes her particularly qualified to share her expertise in developing successful real-world BI solutions using Microsoft SQL Server 2008. This is Lynn’s second book on SQL Server business intelligence. In her spare time, Lynn enjoys sharing her love of technology with others. She leads Microsoft’s annual DigiGirlz day and camp in southern California. DigiGirlz is a free educational program targeted at providing information about careers in technology to high-school girls. Lynn also personally volunteers with a group of technical professionals who provide support to a team of local developers building and deploying an electronic medical records system (SmartCare) in Lusaka, Zambia. For more information about this project, go to http://www.opensmartcare.org.
Davide Mauri Davide wrote Chapters 15 and 16 in the SQL Server Integration Services (SSIS) section and kindly reviewed Lynn’s writing for the remainder of the SSIS chapters. Davide holds the following Microsoft certifications: MCP, MCAD, MCDBA, Microsoft Certified Teacher (MCT), and Microsoft Most Valued Professional (MVP) on SQL Server. He has worked with SQL Server since version 6.5, and his interests cover the whole platform, from the relational engine to Analysis Services, and from architecture definition to performance tuning.
Davide also has a strong knowledge of XML, .NET, and the object-oriented design principles, which provides him with the vision and experience to handle the development of complex business intelligence solutions. Having worked as a Microsoft Certified Teacher for many years, Davide is able to share his knowledge with his co-workers, helping his team deliver high-quality solutions. He also works as a mentor for Solid Quality Mentors and speaks at many Italian-language and international BI events.
Sahil Malik Sahil Malik wrote Chapter 25 in the SQL Server Reporting Services (SSRS) section. Sahil, the founder and principal of Winsmarts, has been a Microsoft MVP and INETA speaker for many years. He is the author of many books and numerous articles. Sahil is a consultant and trainer who delivers training and talks at conferences internationally.
Kevin Goff Kevin wrote Chapters 10 and 11 on MDX in the SQL Server Analysis Services (SSAS) section.
John Welch John Welch was responsible for the technical review of this book. John is Chief Architect with Mariner, a consulting firm specializing in enterprise reporting and analytics, data warehousing, and performance management solutions. He has been working with business intelligence and data warehousing technologies for six years, with a focus on Microsoft products in heterogeneous environments. He is a Microsoft MVP, an award given to him for his commitment to sharing his knowledge with the IT community. John is an experienced speaker, having given presentations at Professional Association for SQL Server (PASS) conferences, the Microsoft Business Intelligence conference, Software Development West (SD West), the Software Management Conference (ASM/SM), and others. John writes a blog on business intelligence topics at http://agilebi.com/cs/blogs/bipartisan. He writes another blog focused on SSIS topics at http://agilebi.com/cs/blogs/jwelch/. He is also active in open source projects that help make the development process easier for Microsoft BI developers, including BIDS Helper (http://www.codeplex.com/bidshelper), an add-in for Business Intelligence Development Studio that adds commonly needed functionality to the environment. He is also the lead developer on ssisUnit (http://www.codeplex.com/ssisUnit), a unit testing framework for SSIS.